-
-
Notifications
You must be signed in to change notification settings - Fork 32k
gh-133968: Add PyUnicodeWriter_WriteASCII() function #133973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
JSON benchmark: #133832 (comment)
Benchmark hidden because not significant (11): encode 100 floats, encode ascii string len=100, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode escaped string len=896, encode 10000 integers, encode 10000 floats, encode 10000 "ascii" strings Up to 1.20x faster to encode booleans is interesting knowing that these strings are very short: "true" (4 characters) and "false" (5 characters). |
@serhiy-storchaka: What do you think of this function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits
Well, we had But We can add private |
Co-authored-by: Peter Bierma <[email protected]>
I don't think that it can become as fast or faster than a function which takes ASCII string as argument. If we know that the input string is ASCII, there is no need to scan the string for non-ASCII characters, and we can take the fast path. You're right that the UTF-8 decoder is already highly optimized. |
In short:
It's hard to beat |
Yes, although it was close, at least for moderately large strings. Could it be optimized even more? I don't know. But decision about |
I created capi-workgroup/decisions#65 issue. |
Benchmark:
On long strings (10,000 bytes), PyUnicodeWriter_WriteASCII() is up to 2x faster (1.36 us => 690 ns) than PyUnicodeWriter_WriteUTF8(). from _testcapi import PyUnicodeWriter
import pyperf
range_100 = range(100)
def bench_write_utf8(text, size):
writer = PyUnicodeWriter(0)
for _ in range_100:
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
writer.write_utf8(text, size)
def bench_write_ascii(text, size):
writer = PyUnicodeWriter(0)
for _ in range_100:
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
writer.write_ascii(text, size)
runner = pyperf.Runner()
sizes = (10, 100, 1_000, 10_000)
for size in sizes:
text = b'x' * size
runner.bench_func(f'write_utf8 size={size:,}', bench_write_utf8, text, size,
inner_loops=1_000)
for size in sizes:
text = b'x' * size
runner.bench_func(f'write_ascii size={size:,}', bench_write_ascii, text, size,
inner_loops=1_000) |
Do we know where the bottleneck is for long strings? |
WriteUTF8() has to check for non-ASCII characters: this check has a cost. That's the bottleneck.
Maybe, I don't know if it would be faster. |
I tried but failed to modify the code to copy while reading (checking if the string is encoded to ASCII). The code is quite complicated. |
Co-authored-by: Bénédikt Tran <[email protected]>
Co-authored-by: Bénédikt Tran <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to have this function public. I always preferred using the faster versions of the writer API when I hardcoded strings, but they were private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, LGTM as well.
The C API Working Group voted in favor of adding the function. |
…3973) Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII(). Unrelated change to please the linter: remove an unused import in test_ctypes. Co-authored-by: Peter Bierma <[email protected]> Co-authored-by: Bénédikt Tran <[email protected]> (cherry picked from commit f49a07b)
GH-134974 is a backport of this pull request to the 3.14 branch. |
Replace most PyUnicodeWriter_WriteUTF8() calls with PyUnicodeWriter_WriteASCII().
📚 Documentation preview 📚: https://cpython-previews--133973.org.readthedocs.build/