gh-133968: Use private unicode writer for json #133832

nineteendo · 2025-05-10T14:29:45Z

pyperformance (with `--enable-optimizations` and `--with-lto`)

main.json
=========

Performance version: 1.11.0
Python version: 3.15.0a0 (64-bit) revision c600310663
Report on macOS-13.7.6-x86_64-i386-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2025-05-30 16:28:48.633929
End date: 2025-05-30 16:29:21.986698

feature.json
============

Performance version: 1.11.0
Python version: 3.15.0a0 (64-bit) revision 566637c24a
Report on macOS-13.7.6-x86_64-i386-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2025-05-30 16:29:28.723283
End date: 2025-05-30 16:29:59.110914

### json_loads ###
Mean +- std dev: 34.2 us +- 7.3 us -> 30.7 us +- 0.9 us: 1.11x faster
Significant (t=5.22)

jsonyx-performance-tests (with `--enable-optimizations` and `--with-lto`)

decode	main	feature	difference
Dict with 65,536 booleans	12295.20 μs	12303.52 μs	no difference
List of 65,536 empty strings	2220.59 μs	1891.46 μs	1.17x faster
List of 65,536 ASCII strings	7524.63 μs	7094.07 μs	1.06x faster
List of 65,536 strings	168058.21 μs	179960.84 μs	1.07x slower

Issue: Using the public PyUnicodeWriter C API made the json module slower #133968

ZeroIntensity · 2025-05-10T14:47:35Z

What's the point? This just adds more maintenance if we make changes to how PyUnicodeWriter works.

nineteendo · 2025-05-10T14:57:03Z

They might have caused a performance regression compared to 3.13: faster-cpython/ideas#726
I'm still benchmarking, but wanted to already run the tests.

nineteendo · 2025-05-10T17:23:52Z

cc @vstinner, @mdboom

ZeroIntensity

I don't think this is a good idea.

json optimizations have been rejected in the past--use ujson or something like that if performance is critical.
If we change how PyUnicodeWriter works, it adds more maintenance, especially if we use this as precedent for doing this elsewhere.
We should aim for optimizing PyUnicodeWriter as a broader change, not speed up each individual case by using the private API.

vstinner

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

vstinner · 2025-05-11T16:52:24Z

See also #133186.

ZeroIntensity · 2025-05-11T16:55:31Z

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

vstinner · 2025-05-11T16:57:29Z

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

I chose to not expose it since it generates an invalid string if the input string contains non-ASCII characters. But yeah, maybe we should expose it. The function only validates the input string in debug mode for best performance.

nineteendo · 2025-05-11T17:31:51Z

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

Maybe, but my current benchmark has too much overhead to measure this accurately. I'll have to rewrite it first.

I hope we can figure out how to get the performance of the public API very close to the private one, such that everyone feels comfortable using it.

nineteendo · 2025-05-11T19:49:20Z

I updated the benchmark, but I don't understand why:

writing integers is now 20% faster
reading and writing unicode strings is now 2-5% slower (shouldn't be caused by noise)

Does this have something to do with the overallocate parameter?

vstinner · 2025-05-11T20:13:41Z

Does this have something to do with the overallocate parameter?

The private API doesn't enable overallocation by default.

vstinner · 2025-05-12T17:28:32Z

I replaced PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString() in Modules/_json.c and ran a benchmark:

Benchmark	ref	write_ascii
encode 100 booleans	9.54 us	8.83 us: 1.08x faster
encode 1000 booleans	60.8 us	53.1 us: 1.15x faster
encode escaped string len=896	4.11 us	4.10 us: 1.00x faster
encode 10000 booleans	569 us	487 us: 1.17x faster
encode 10000 integers	1.03 ms	1.03 ms: 1.00x slower
encode 10000 floats	2.11 ms	2.13 ms: 1.01x slower
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (15): encode 100 integers, encode 100 floats, encode 100 "ascii" strings, encode ascii string len=100, encode escaped string len=128, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode Unicode string len=1000, encode 10000 "ascii" strings, encode ascii string len=10000, encode escaped string len=9984, encode Unicode string len=10000

I built Python with ./configure && make and used CPU Isolation on Linux.

Benchmark code:

import json
import pyperf
runner = pyperf.Runner()

for count in (100, 1_000, 10_000):
    runner.bench_func(f'encode {count} booleans', json.dumps, [True, False] * (count // 2))
    runner.bench_func(f'encode {count} integers', json.dumps, list(range(count)))
    runner.bench_func(f'encode {count} floats', json.dumps, [1.0] * count)
    runner.bench_func(f'encode {count} "ascii" strings', json.dumps, ['ascii'] * count)

    text = 'ascii'
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode ascii string len={len(text)}', json.dumps, text)

    text = ''.join(chr(ch) for ch in range(128))
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode escaped string len={len(text)}', json.dumps, text)

    text = 'abcd€'
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode Unicode string len={len(text)}', json.dumps, text)

vstinner · 2025-05-12T17:42:37Z

I also ran my benchmark on this PR:

Benchmark	ref	change
encode 100 booleans	9.53 us	6.26 us: 1.52x faster
encode 100 integers	13.8 us	11.6 us: 1.20x faster
encode 100 floats	24.8 us	19.0 us: 1.30x faster
encode 100 "ascii" strings	17.1 us	12.4 us: 1.37x faster
encode ascii string len=100	902 ns	877 ns: 1.03x faster
encode escaped string len=128	1.10 us	1.07 us: 1.03x faster
encode Unicode string len=100	1.07 us	1.04 us: 1.03x faster
encode 1000 booleans	58.9 us	29.6 us: 1.99x faster
encode 1000 integers	103 us	81.5 us: 1.26x faster
encode 1000 floats	209 us	152 us: 1.37x faster
encode 1000 "ascii" strings	131 us	86.9 us: 1.51x faster
encode ascii string len=1000	3.48 us	3.46 us: 1.00x faster
encode escaped string len=896	4.12 us	3.96 us: 1.04x faster
encode 10000 booleans	546 us	257 us: 2.12x faster
encode 10000 integers	1.00 ms	788 us: 1.27x faster
encode 10000 floats	2.04 ms	1.46 ms: 1.39x faster
encode 10000 "ascii" strings	1.27 ms	806 us: 1.57x faster
encode ascii string len=10000	28.4 us	28.4 us: 1.00x slower
encode escaped string len=9984	38.5 us	36.3 us: 1.06x faster
encode Unicode string len=10000	42.4 us	43.2 us: 1.02x slower
Geometric mean	(ref)	1.26x faster

Benchmark hidden because not significant (1): encode Unicode string len=1000

nineteendo · 2025-05-12T18:26:04Z

The private API doesn't enable overallocation by default.

Yeah, but both the old code and the public API do, so it's not that. (And you really don't want to turn it off)

encode	overallocate	normal	slow down
List of 65,536 booleans	1222.61 μs	10456.60 μs	8.55x slower
List of 65,536 ints	3174.27 μs	12961.39 μs	4.08x slower
Dict with 65,536 booleans	10011.93 μs	29166.14 μs	2.91x slower
List of 65,536 ASCII strings	13303.33 μs	23714.00 μs	1.78x slower
List of 65,536 floats	37103.57 μs	48716.57 μs	1.31x slower
List of 65,536 strings	91757.30 μs	113949.94 μs	1.24x slower

decode	overallocate	normal	slow down
Dict with 65,536 booleans	12194.03 μs	12098.16 μs	1.01x faster
List of 65,536 ASCII strings	7011.86 μs	7101.61 μs	1.01x slower
List of 65,536 strings	36049.66 μs	36412.55 μs	1.01x slower

nineteendo · 2025-05-12T19:05:53Z

This line is inefficient for exact string instances (Py_INCREF is enough):

cpython/Objects/unicodeobject.c

Line 13936 in 86c1d43

PyObject *str = PyObject_Str(obj);

encode	private	public	slow down
List of 65,536 booleans	1217.10 μs	1854.86 μs	1.52x slower
List of 65,536 ints	3190.74 μs	3701.18 μs	1.16x slower
Dict with 65,536 booleans	8783.92 μs	11459.45 μs	1.30x slower
List of 65,536 ASCII strings	12502.92 μs	14842.52 μs	1.19x slower
List of 65,536 floats	37008.47 μs	39790.32 μs	1.08x slower
List of 65,536 strings	90841.32 μs	94637.59 μs	1.04x slower

decode	private	public	slow down
List of 65,536 ASCII strings	7064.05 μs	7186.32 μs	1.02x slower
List of 65,536 strings	36033.15 μs	36904.06 μs	1.02x slower

nineteendo · 2025-05-13T06:58:44Z

Here's the comparison with a minimal PR:

encode	full	minimal	improvement
List of 65,536 booleans	1222.01 μs	1201.54 μs	1.02x faster
List of 65,536 ints	3156.75 μs	3047.81 μs	1.04x faster
Dict with 65,536 booleans	9442.35 μs	8518.33 μs	1.11x faster
List of 65,536 ASCII strings	13405.99 μs	12027.66 μs	1.11x faster
List of 65,536 floats	37037.70 μs	36684.89 μs	1.01x faster
List of 65,536 strings	92007.41 μs	87721.40 μs	1.05x faster

decode	full	minimal	improvement
List of 65,536 ASCII strings	7029.30 μs	7431.43 μs	1.06x slower
List of 65,536 strings	36401.90 μs	34232.03 μs	1.06x faster

nineteendo · 2025-05-13T09:21:18Z

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()? or can we merge this as is?

vstinner · 2025-05-13T12:58:58Z

I created issue #133968 to track this work.

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()?

I wrote #133969 to add a fast path.

vstinner · 2025-05-13T13:35:11Z

I wrote #133969 to add a fast path.

Merged. I confirmed with two benchmarks that this small optimization makes a big difference on some use cases such as encoding short strings in JSON.

vstinner · 2025-05-13T14:36:02Z

@ZeroIntensity:

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

Ok, I created #133973 to add PyUnicodeWriter_WriteASCII().

nineteendo · 2025-05-13T15:01:50Z

Ok, I created #133973 to add PyUnicodeWriter_WriteASCII().

If that's merged we would use this aproach in 3.14, right?

nineteendo · 2025-05-15T08:48:53Z

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().
I've now delayed the allocation until it's necessary. Thoughts?

vstinner · 2025-05-16T01:06:57Z

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

vstinner · 2025-05-16T01:50:09Z

I've now delayed the allocation until it's necessary. Thoughts?

Would you mind to create a separated PR just for that?

ZeroIntensity · 2025-05-16T02:14:51Z

Are the benchmarks creating an unrealistic number of concurrent writers? That would starve the freelist and create some allocation overhead, but only on the benchmarks.

nineteendo · 2025-05-16T06:41:56Z

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

You're right, it does seem to be using the only entry of the freelist. (I disabled the malloc to check)
There might be some overhead compared to using the stack though.

vstinner · 2025-05-30T11:42:06Z

I suggest closing this PR. It's not worth it anymore (according to the benchmark below) and I prefer to stick to the public C API.

I made two small optimizations in the public PyUnicodeWriter API:

Add fast path to PyUnicodeWriter_WriteStr(): gh-133968: Add fast path to PyUnicodeWriter_WriteStr() #133969
Add PyUnicodeWriter_WriteASCII() function: gh-133968: Add PyUnicodeWriter_WriteASCII() function #133973

With these optimizations, it seems like this PR is less appealing. I ran a benchmark to compare this PR to the current main branch:

Benchmark	main	pr133832
encode 100 booleans	6.52 us	6.61 us: 1.01x slower
encode 100 integers	11.9 us	11.7 us: 1.01x faster
encode 100 floats	19.9 us	20.7 us: 1.04x slower
encode 100 "ascii" strings	13.3 us	13.5 us: 1.01x slower
encode ascii string len=100	901 ns	884 ns: 1.02x faster
encode escaped string len=128	1.11 us	1.07 us: 1.03x faster
encode 1000 booleans	32.6 us	31.8 us: 1.03x faster
encode 1000 integers	88.3 us	83.2 us: 1.06x faster
encode 1000 floats	161 us	168 us: 1.04x slower
encode 1000 "ascii" strings	96.6 us	94.1 us: 1.03x faster
encode ascii string len=1000	3.49 us	3.50 us: 1.00x slower
encode escaped string len=896	4.14 us	3.95 us: 1.05x faster
encode Unicode string len=1000	4.92 us	5.35 us: 1.09x slower
encode 10000 booleans	284 us	272 us: 1.05x faster
encode 10000 integers	850 us	797 us: 1.07x faster
encode 10000 floats	1.56 ms	1.59 ms: 1.02x slower
encode 10000 "ascii" strings	897 us	857 us: 1.05x faster
encode ascii string len=10000	28.5 us	29.2 us: 1.02x slower
encode escaped string len=9984	38.5 us	37.2 us: 1.03x faster
encode Unicode string len=10000	42.4 us	46.8 us: 1.10x slower
Geometric mean	(ref)	1.00x faster

The best speedup is 1.07x faster for "encode 10000 integers".

The worst slowdown is 1.10x slower for "encode Unicode string len=10000".

Overall, the impact is "1.00x faster" which is not impressive.

vstinner · 2025-05-30T11:45:09Z

Hum. I would be interested by a change which would just remove _PyUnicodeWriter_IsEmpty(), without touching WriteUTF8/WriteASCII calls.

nineteendo · 2025-05-30T13:58:38Z

I suggest closing this PR. It's not worth it anymore (according to the benchmark below) and I prefer to stick to the public C API.

According to the pyperformance benchmark, json_loads is still 10% slower because of the freelist. And after I've updated the PR, it will only be using the public API.

nineteendo · 2025-05-30T14:47:45Z

Done. Decoding empty strings is now 17% faster. Annoyingly, decoding strings with escapes is 7% slower.

Use private unicode writer for json

9ea12ad

nineteendo added 5 commits May 10, 2025 17:00

Part 2

46df04f

Restore fast path for integers

ab1aa42

Reduce diff

d18c455

Include necessary headers

51c760f

Use PyUnicodeWriter_WriteRepr

72ae3d0

nineteendo marked this pull request as ready for review May 11, 2025 15:38

bedevere-app bot added the awaiting review label May 11, 2025

ZeroIntensity requested changes May 11, 2025

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels May 11, 2025

vstinner reviewed May 11, 2025

View reviewed changes

Reduce diff

2a6ec43

nineteendo marked this pull request as draft May 13, 2025 07:31

bedevere-app bot removed the awaiting core review label May 13, 2025

nineteendo marked this pull request as ready for review May 13, 2025 09:13

bedevere-app bot added the awaiting review label May 13, 2025

vstinner changed the title ~~Use private unicode writer for json~~ gh-133968: Use private unicode writer for json May 13, 2025

bedevere-app bot mentioned this pull request May 13, 2025

Using the public PyUnicodeWriter C API made the json module slower #133968

Open

vstinner mentioned this pull request May 13, 2025

gh-133968: Add fast path to PyUnicodeWriter_WriteStr() #133969

Merged

nineteendo added 2 commits May 13, 2025 16:03

Merge branch 'main' into json-private-unicode-writer

49a92f3

Reduce diff

822ea86

vstinner mentioned this pull request May 13, 2025

gh-133968: Add PyUnicodeWriter_WriteASCII() function #133973

Merged

Avoid heap allocation

01c45a9

Merge branch 'main' into json-private-unicode-writer

566637c

Uh oh!

gh-133968: Use private unicode writer for json #133832

Are you sure you want to change the base?

gh-133968: Use private unicode writer for json #133832

Conversation

nineteendo commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pyperformance (with --enable-optimizations and --with-lto)

jsonyx-performance-tests (with --enable-optimizations and --with-lto)

Uh oh!

ZeroIntensity commented May 10, 2025

Uh oh!

nineteendo commented May 10, 2025

Uh oh!

nineteendo commented May 10, 2025

Uh oh!

ZeroIntensity left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner left a comment

Choose a reason for hiding this comment

Uh oh!

vstinner commented May 11, 2025

Uh oh!

ZeroIntensity commented May 11, 2025

Uh oh!

vstinner commented May 11, 2025

Uh oh!

nineteendo commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nineteendo commented May 11, 2025

Uh oh!

vstinner commented May 11, 2025

Uh oh!

vstinner commented May 12, 2025

Uh oh!

vstinner commented May 12, 2025

Uh oh!

nineteendo commented May 12, 2025

Uh oh!

nineteendo commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nineteendo commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nineteendo commented May 13, 2025

Uh oh!

vstinner commented May 13, 2025

Uh oh!

vstinner commented May 13, 2025

Uh oh!

vstinner commented May 13, 2025

Uh oh!

nineteendo commented May 13, 2025

Uh oh!

nineteendo commented May 15, 2025

Uh oh!

vstinner commented May 16, 2025

Uh oh!

vstinner commented May 16, 2025

Uh oh!

ZeroIntensity commented May 16, 2025

Uh oh!

nineteendo commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vstinner commented May 30, 2025

Uh oh!

vstinner commented May 30, 2025

Uh oh!

nineteendo commented May 30, 2025

Uh oh!

nineteendo commented May 30, 2025

Uh oh!

Uh oh!

nineteendo commented May 10, 2025 •

edited

Loading

pyperformance (with `--enable-optimizations` and `--with-lto`)

jsonyx-performance-tests (with `--enable-optimizations` and `--with-lto`)

nineteendo commented May 11, 2025 •

edited

Loading

nineteendo commented May 12, 2025 •

edited

Loading

nineteendo commented May 13, 2025 •

edited

Loading

nineteendo commented May 16, 2025 •

edited

Loading