Improve UTF-8 decode speed

# Feature or enhancement

### Proposal:

DuckDB and orjson don't use PyUnicode_FromStringAndSize() but PyUnicode_4BYTE_DATA() because they think Python UTF-8 decoder is slow.

## vs DuckDB

* DuckDB decoder: https://github.com/duckdb/duckdb/blob/e791508e9bc2eb84bc87eb794074f4893093b743/tools/pythonpkg/src/numpy/array_wrapper.cpp#L215
* PyUnicode_FromStringAndSize is 4x faster when decoding long ASCII string.
* DuckDB is 1.5x faster when decoding short non-ASCII string.

## vs orjson

I don't create simple extension module that contains only orjson UTF-8 decoder because I am not familiar with Rust yet.

When running their benchmark suite, using PyUnicode_FromStringAndSize slows down decoding twitter.json. twitter.json contains many non-ASCII (almost UCS2) strings. Most of them are Japanese text in ~140 chars.

## Why Python decoder slow & possible optimization

* Current Python decoder tries ASCII string first. When input is not ASCII, decoder need to convert buffer string into latin1/UCS2/UCS4.
* When UTF-8 is 120 bytes, decoder allocates 120 codepoitns. But if all codepoints are 3byte UTF-8 sequene, result string has 40 codepoints. Decoder need to reallocate string at end.

When text is long, reallocation cost is relatively small. But for short (~200byte) strings, reallocate is much slower than decoding.

We can avoide some reallocation without slowing down decoding ASCII?

### Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

### Links to previous discussion of this feature:

https://discuss.python.org/t/pep-756-c-api-add-pyunicode-export-and-pyunicode-import-c-functions/63891/53


### Linked PRs
* gh-126025
* gh-127566
* gh-127695
* gh-127769

* https://github.com/python/cpython/pull/127790

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Improve UTF-8 decode speed #126024

Feature or enhancement

Proposal:

vs DuckDB

vs orjson

Why Python decoder slow & possible optimization

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Improve UTF-8 decode speed #126024

Description

Feature or enhancement

Proposal:

vs DuckDB

vs orjson

Why Python decoder slow & possible optimization

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions