Skip to content

Improve UTF-8 decode speed #126024

@methane

Description

@methane

Feature or enhancement

Proposal:

DuckDB and orjson don't use PyUnicode_FromStringAndSize() but PyUnicode_4BYTE_DATA() because they think Python UTF-8 decoder is slow.

vs DuckDB

vs orjson

I don't create simple extension module that contains only orjson UTF-8 decoder because I am not familiar with Rust yet.

When running their benchmark suite, using PyUnicode_FromStringAndSize slows down decoding twitter.json. twitter.json contains many non-ASCII (almost UCS2) strings. Most of them are Japanese text in ~140 chars.

Why Python decoder slow & possible optimization

  • Current Python decoder tries ASCII string first. When input is not ASCII, decoder need to convert buffer string into latin1/UCS2/UCS4.
  • When UTF-8 is 120 bytes, decoder allocates 120 codepoitns. But if all codepoints are 3byte UTF-8 sequene, result string has 40 codepoints. Decoder need to reallocate string at end.

When text is long, reallocation cost is relatively small. But for short (~200byte) strings, reallocate is much slower than decoding.

We can avoide some reallocation without slowing down decoding ASCII?

Has this already been discussed elsewhere?

I have already discussed this feature proposal on Discourse

Links to previous discussion of this feature:

https://discuss.python.org/t/pep-756-c-api-add-pyunicode-export-and-pyunicode-import-c-functions/63891/53

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions