-
-
Notifications
You must be signed in to change notification settings - Fork 32.8k
Closed
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagePerformance or resource usagetype-featureA feature request or enhancementA feature request or enhancement
Description
Feature or enhancement
Proposal:
DuckDB and orjson don't use PyUnicode_FromStringAndSize() but PyUnicode_4BYTE_DATA() because they think Python UTF-8 decoder is slow.
vs DuckDB
- DuckDB decoder: https://github.com/duckdb/duckdb/blob/e791508e9bc2eb84bc87eb794074f4893093b743/tools/pythonpkg/src/numpy/array_wrapper.cpp#L215
- PyUnicode_FromStringAndSize is 4x faster when decoding long ASCII string.
- DuckDB is 1.5x faster when decoding short non-ASCII string.
vs orjson
I don't create simple extension module that contains only orjson UTF-8 decoder because I am not familiar with Rust yet.
When running their benchmark suite, using PyUnicode_FromStringAndSize slows down decoding twitter.json. twitter.json contains many non-ASCII (almost UCS2) strings. Most of them are Japanese text in ~140 chars.
Why Python decoder slow & possible optimization
- Current Python decoder tries ASCII string first. When input is not ASCII, decoder need to convert buffer string into latin1/UCS2/UCS4.
- When UTF-8 is 120 bytes, decoder allocates 120 codepoitns. But if all codepoints are 3byte UTF-8 sequene, result string has 40 codepoints. Decoder need to reallocate string at end.
When text is long, reallocation cost is relatively small. But for short (~200byte) strings, reallocate is much slower than decoding.
We can avoide some reallocation without slowing down decoding ASCII?
Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
Linked PRs
- gh-126024: optimize UTF-8 decoder for short non-ASCII string #126025
- gh-126024: fix UBSan failure in
unicodeobject.c:find_first_nonascii
#127566 - Revert "gh-126024: fix UBSan failure in
unicodeobject.c:find_first_nonascii
(GH-127566)" #127695 - gh-126024: Use only memcpy for unaligned loads in find_first_nonascii #127769
dg-pb, Wulian233, nineteendo, mattwang44 and erlend-aasland
Metadata
Metadata
Assignees
Labels
interpreter-core(Objects, Python, Grammar, and Parser dirs)(Objects, Python, Grammar, and Parser dirs)performancePerformance or resource usagePerformance or resource usagetype-featureA feature request or enhancementA feature request or enhancement