gh-133968: Use private unicode writer for json #133832

nineteendo · May 10, 2025

pyperformance (with `--enable-optimizations` and `--with-lto`)

main.json
=========

Performance version: 1.11.0
Python version: 3.15.0a0 (64-bit) revision fe9f6e829a
Report on macOS-13.7.6-x86_64-i386-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2025-05-15 08:56:50.561560
End date: 2025-05-15 08:57:59.544685

feature.json
============

Performance version: 1.11.0
Python version: 3.15.0a0 (64-bit) revision 822ea86260
Report on macOS-13.7.6-x86_64-i386-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2025-05-15 10:17:44.676609
End date: 2025-05-15 10:18:53.108198

### json_dumps ###
Mean +- std dev: 12.3 ms +- 0.2 ms -> 12.8 ms +- 1.4 ms: 1.04x slower
Significant (t=-3.98)

### json_loads ###
Mean +- std dev: 32.9 us +- 2.0 us -> 30.5 us +- 0.6 us: 1.08x faster
Significant (t=12.39)

jsonyx-performance-tests (with `--enable-optimizations` and `--with-lto`)

encode	main	feature	improvement
List of 65,536 booleans	1815.99 μs	1281.81 μs	1.42x faster

decode	main	feature	improvement
Dict with 65,536 booleans	12384.74 μs	12252.54 μs	1.01x faster
List of 65,536 ASCII strings	7688.18 μs	7393.09 μs	1.04x faster
List of 65,536 strings	169289.79 μs	185353.17 μs	1.09x slower

Issue: Using the public PyUnicodeWriter C API made the json module slower #133968

ZeroIntensity · May 10, 2025

What's the point? This just adds more maintenance if we make changes to how PyUnicodeWriter works.

nineteendo · May 10, 2025

They might have caused a performance regression compared to 3.13: faster-cpython/ideas#726
I'm still benchmarking, but wanted to already run the tests.

nineteendo · May 10, 2025

cc @vstinner, @mdboom

ZeroIntensity

I don't think this is a good idea.

json optimizations have been rejected in the past--use ujson or something like that if performance is critical.
If we change how PyUnicodeWriter works, it adds more maintenance, especially if we use this as precedent for doing this elsewhere.
We should aim for optimizing PyUnicodeWriter as a broader change, not speed up each individual case by using the private API.

vstinner

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

vstinner · May 11, 2025

See also #133186.

ZeroIntensity · May 11, 2025

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

vstinner · May 11, 2025

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

I chose to not expose it since it generates an invalid string if the input string contains non-ASCII characters. But yeah, maybe we should expose it. The function only validates the input string in debug mode for best performance.

nineteendo · May 11, 2025

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

Maybe, but my current benchmark has too much overhead to measure this accurately. I'll have to rewrite it first.

I hope we can figure out how to get the performance of the public API very close to the private one, such that everyone feels comfortable using it.

nineteendo · May 11, 2025

I updated the benchmark, but I don't understand why:

writing integers is now 20% faster
reading and writing unicode strings is now 2-5% slower (shouldn't be caused by noise)

Does this have something to do with the overallocate parameter?

vstinner · May 11, 2025

Does this have something to do with the overallocate parameter?

The private API doesn't enable overallocation by default.

vstinner · May 12, 2025

I replaced PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString() in Modules/_json.c and ran a benchmark:

Benchmark	ref	write_ascii
encode 100 booleans	9.54 us	8.83 us: 1.08x faster
encode 1000 booleans	60.8 us	53.1 us: 1.15x faster
encode escaped string len=896	4.11 us	4.10 us: 1.00x faster
encode 10000 booleans	569 us	487 us: 1.17x faster
encode 10000 integers	1.03 ms	1.03 ms: 1.00x slower
encode 10000 floats	2.11 ms	2.13 ms: 1.01x slower
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (15): encode 100 integers, encode 100 floats, encode 100 "ascii" strings, encode ascii string len=100, encode escaped string len=128, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode Unicode string len=1000, encode 10000 "ascii" strings, encode ascii string len=10000, encode escaped string len=9984, encode Unicode string len=10000

I built Python with ./configure && make and used CPU Isolation on Linux.

Benchmark code:

import json
import pyperf
runner = pyperf.Runner()

for count in (100, 1_000, 10_000):
    runner.bench_func(f'encode {count} booleans', json.dumps, [True, False] * (count // 2))
    runner.bench_func(f'encode {count} integers', json.dumps, list(range(count)))
    runner.bench_func(f'encode {count} floats', json.dumps, [1.0] * count)
    runner.bench_func(f'encode {count} "ascii" strings', json.dumps, ['ascii'] * count)

    text = 'ascii'
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode ascii string len={len(text)}', json.dumps, text)

    text = ''.join(chr(ch) for ch in range(128))
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode escaped string len={len(text)}', json.dumps, text)

    text = 'abcd€'
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode Unicode string len={len(text)}', json.dumps, text)

vstinner · May 12, 2025

I also ran my benchmark on this PR:

Benchmark	ref	change
encode 100 booleans	9.53 us	6.26 us: 1.52x faster
encode 100 integers	13.8 us	11.6 us: 1.20x faster
encode 100 floats	24.8 us	19.0 us: 1.30x faster
encode 100 "ascii" strings	17.1 us	12.4 us: 1.37x faster
encode ascii string len=100	902 ns	877 ns: 1.03x faster
encode escaped string len=128	1.10 us	1.07 us: 1.03x faster
encode Unicode string len=100	1.07 us	1.04 us: 1.03x faster
encode 1000 booleans	58.9 us	29.6 us: 1.99x faster
encode 1000 integers	103 us	81.5 us: 1.26x faster
encode 1000 floats	209 us	152 us: 1.37x faster
encode 1000 "ascii" strings	131 us	86.9 us: 1.51x faster
encode ascii string len=1000	3.48 us	3.46 us: 1.00x faster
encode escaped string len=896	4.12 us	3.96 us: 1.04x faster
encode 10000 booleans	546 us	257 us: 2.12x faster
encode 10000 integers	1.00 ms	788 us: 1.27x faster
encode 10000 floats	2.04 ms	1.46 ms: 1.39x faster
encode 10000 "ascii" strings	1.27 ms	806 us: 1.57x faster
encode ascii string len=10000	28.4 us	28.4 us: 1.00x slower
encode escaped string len=9984	38.5 us	36.3 us: 1.06x faster
encode Unicode string len=10000	42.4 us	43.2 us: 1.02x slower
Geometric mean	(ref)	1.26x faster

Benchmark hidden because not significant (1): encode Unicode string len=1000

nineteendo · May 12, 2025

The private API doesn't enable overallocation by default.

Yeah, but both the old code and the public API do, so it's not that. (And you really don't want to turn it off)

encode	overallocate	normal	slow down
List of 65,536 booleans	1222.61 μs	10456.60 μs	8.55x slower
List of 65,536 ints	3174.27 μs	12961.39 μs	4.08x slower
Dict with 65,536 booleans	10011.93 μs	29166.14 μs	2.91x slower
List of 65,536 ASCII strings	13303.33 μs	23714.00 μs	1.78x slower
List of 65,536 floats	37103.57 μs	48716.57 μs	1.31x slower
List of 65,536 strings	91757.30 μs	113949.94 μs	1.24x slower

decode	overallocate	normal	slow down
Dict with 65,536 booleans	12194.03 μs	12098.16 μs	1.01x faster
List of 65,536 ASCII strings	7011.86 μs	7101.61 μs	1.01x slower
List of 65,536 strings	36049.66 μs	36412.55 μs	1.01x slower

nineteendo · May 12, 2025

This line is inefficient for exact string instances (Py_INCREF is enough):

cpython/Objects/unicodeobject.c

Line 13936 in 86c1d43

PyObject *str = PyObject_Str(obj);

encode	private	public	slow down
List of 65,536 booleans	1217.10 μs	1854.86 μs	1.52x slower
List of 65,536 ints	3190.74 μs	3701.18 μs	1.16x slower
Dict with 65,536 booleans	8783.92 μs	11459.45 μs	1.30x slower
List of 65,536 ASCII strings	12502.92 μs	14842.52 μs	1.19x slower
List of 65,536 floats	37008.47 μs	39790.32 μs	1.08x slower
List of 65,536 strings	90841.32 μs	94637.59 μs	1.04x slower

decode	private	public	slow down
List of 65,536 ASCII strings	7064.05 μs	7186.32 μs	1.02x slower
List of 65,536 strings	36033.15 μs	36904.06 μs	1.02x slower

nineteendo · May 13, 2025

Here's the comparison with a minimal PR:

encode	full	minimal	improvement
List of 65,536 booleans	1222.01 μs	1201.54 μs	1.02x faster
List of 65,536 ints	3156.75 μs	3047.81 μs	1.04x faster
Dict with 65,536 booleans	9442.35 μs	8518.33 μs	1.11x faster
List of 65,536 ASCII strings	13405.99 μs	12027.66 μs	1.11x faster
List of 65,536 floats	37037.70 μs	36684.89 μs	1.01x faster
List of 65,536 strings	92007.41 μs	87721.40 μs	1.05x faster

decode	full	minimal	improvement
List of 65,536 ASCII strings	7029.30 μs	7431.43 μs	1.06x slower
List of 65,536 strings	36401.90 μs	34232.03 μs	1.06x faster

nineteendo · May 13, 2025

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()? or can we merge this as is?

vstinner · May 13, 2025

I created issue #133968 to track this work.

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()?

I wrote #133969 to add a fast path.

vstinner · May 13, 2025

I wrote #133969 to add a fast path.

Merged. I confirmed with two benchmarks that this small optimization makes a big difference on some use cases such as encoding short strings in JSON.

vstinner · May 13, 2025

@ZeroIntensity:

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

Ok, I created #133973 to add PyUnicodeWriter_WriteASCII().

nineteendo · May 13, 2025

Ok, I created #133973 to add PyUnicodeWriter_WriteASCII().

If that's merged we would use this aproach in 3.14, right?

nineteendo · May 15, 2025

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().
I've now delayed the allocation until it's necessary. Thoughts?

vstinner · May 16, 2025

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

vstinner · May 16, 2025

I've now delayed the allocation until it's necessary. Thoughts?

Would you mind to create a separated PR just for that?

ZeroIntensity · May 16, 2025

Are the benchmarks creating an unrealistic number of concurrent writers? That would starve the freelist and create some allocation overhead, but only on the benchmarks.

nineteendo · May 16, 2025

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

You're right, it does seem to be using the only entry of the freelist. (I disabled the malloc to check)
There might be some overhead compared to using the stack though.

Use private unicode writer for json

9ea12ad

nineteendo added 5 commits May 10, 2025 17:00

Part 2

46df04f

Restore fast path for integers

ab1aa42

Reduce diff

d18c455

Include necessary headers

51c760f

Use PyUnicodeWriter_WriteRepr

72ae3d0

nineteendo marked this pull request as ready for review May 11, 2025 15:38

bedevere-app bot added the awaiting review label May 11, 2025

ZeroIntensity requested changes May 11, 2025

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels May 11, 2025

vstinner reviewed May 11, 2025

View reviewed changes

Reduce diff

2a6ec43

nineteendo marked this pull request as draft May 13, 2025 07:31

bedevere-app bot removed the awaiting core review label May 13, 2025

nineteendo marked this pull request as ready for review May 13, 2025 09:13

bedevere-app bot added the awaiting review label May 13, 2025

vstinner changed the title ~~Use private unicode writer for json~~ gh-133968: Use private unicode writer for json May 13, 2025

bedevere-app bot mentioned this pull request May 13, 2025

Using the public PyUnicodeWriter C API made the json module slower #133968

Open

vstinner mentioned this pull request May 13, 2025

gh-133968: Add fast path to PyUnicodeWriter_WriteStr() #133969

Merged

nineteendo added 2 commits May 13, 2025 16:03

Merge branch 'main' into json-private-unicode-writer

49a92f3

Reduce diff

822ea86

vstinner mentioned this pull request May 13, 2025

gh-133968: Add PyUnicodeWriter_WriteASCII() function #133973

Open

Avoid heap allocation

01c45a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-133968: Use private unicode writer for json #133832

gh-133968: Use private unicode writer for json #133832

nineteendo commented May 10, 2025 •

edited

Loading

ZeroIntensity commented May 10, 2025

nineteendo commented May 10, 2025

nineteendo commented May 10, 2025

ZeroIntensity left a comment

vstinner left a comment

vstinner commented May 11, 2025

ZeroIntensity commented May 11, 2025

vstinner commented May 11, 2025

nineteendo commented May 11, 2025 •

edited

Loading

nineteendo commented May 11, 2025

vstinner commented May 11, 2025

vstinner commented May 12, 2025

vstinner commented May 12, 2025

nineteendo commented May 12, 2025

nineteendo commented May 12, 2025 •

edited

Loading

nineteendo commented May 13, 2025 •

edited

Loading

nineteendo commented May 13, 2025

vstinner commented May 13, 2025

vstinner commented May 13, 2025

vstinner commented May 13, 2025

nineteendo commented May 13, 2025

nineteendo commented May 15, 2025

vstinner commented May 16, 2025

vstinner commented May 16, 2025

ZeroIntensity commented May 16, 2025

nineteendo commented May 16, 2025 •

edited

Loading

Search code, repositories, users, issues, pull requests...

gh-133968: Use private unicode writer for json #133832

Are you sure you want to change the base?

gh-133968: Use private unicode writer for json #133832

Conversation

nineteendo commented May 10, 2025 • edited Loading

pyperformance (with --enable-optimizations and --with-lto)

jsonyx-performance-tests (with --enable-optimizations and --with-lto)

ZeroIntensity commented May 10, 2025

nineteendo commented May 10, 2025

nineteendo commented May 10, 2025

ZeroIntensity left a comment

Choose a reason for hiding this comment

vstinner left a comment

Choose a reason for hiding this comment

vstinner commented May 11, 2025

ZeroIntensity commented May 11, 2025

vstinner commented May 11, 2025

nineteendo commented May 11, 2025 • edited Loading

nineteendo commented May 11, 2025

vstinner commented May 11, 2025

vstinner commented May 12, 2025

vstinner commented May 12, 2025

nineteendo commented May 12, 2025

nineteendo commented May 12, 2025 • edited Loading

nineteendo commented May 13, 2025 • edited Loading

nineteendo commented May 13, 2025

vstinner commented May 13, 2025

vstinner commented May 13, 2025

vstinner commented May 13, 2025

nineteendo commented May 13, 2025

nineteendo commented May 15, 2025

vstinner commented May 16, 2025

vstinner commented May 16, 2025

ZeroIntensity commented May 16, 2025

nineteendo commented May 16, 2025 • edited Loading

nineteendo commented May 10, 2025 •

edited

Loading

pyperformance (with `--enable-optimizations` and `--with-lto`)

jsonyx-performance-tests (with `--enable-optimizations` and `--with-lto`)

nineteendo commented May 11, 2025 •

edited

Loading

nineteendo commented May 12, 2025 •

edited

Loading

nineteendo commented May 13, 2025 •

edited

Loading

nineteendo commented May 16, 2025 •

edited

Loading