Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

gh-133968: Use private unicode writer for json #133832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
Loading
from

Conversation

nineteendo
Copy link
Contributor

@nineteendo nineteendo commented May 10, 2025

pyperformance (with --enable-optimizations and --with-lto)

main.json
=========

Performance version: 1.11.0
Python version: 3.15.0a0 (64-bit) revision fe9f6e829a
Report on macOS-13.7.6-x86_64-i386-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2025-05-15 08:56:50.561560
End date: 2025-05-15 08:57:59.544685

feature.json
============

Performance version: 1.11.0
Python version: 3.15.0a0 (64-bit) revision 822ea86260
Report on macOS-13.7.6-x86_64-i386-64bit-Mach-O
Number of logical CPUs: 8
Start date: 2025-05-15 10:17:44.676609
End date: 2025-05-15 10:18:53.108198

### json_dumps ###
Mean +- std dev: 12.3 ms +- 0.2 ms -> 12.8 ms +- 1.4 ms: 1.04x slower
Significant (t=-3.98)

### json_loads ###
Mean +- std dev: 32.9 us +- 2.0 us -> 30.5 us +- 0.6 us: 1.08x faster
Significant (t=12.39)

jsonyx-performance-tests (with --enable-optimizations and --with-lto)

encode main feature improvement
List of 65,536 booleans 1815.99 μs 1281.81 μs 1.42x faster
decode main feature improvement
Dict with 65,536 booleans 12384.74 μs 12252.54 μs 1.01x faster
List of 65,536 ASCII strings 7688.18 μs 7393.09 μs 1.04x faster
List of 65,536 strings 169289.79 μs 185353.17 μs 1.09x slower

@ZeroIntensity
Copy link
Member

What's the point? This just adds more maintenance if we make changes to how PyUnicodeWriter works.

@nineteendo
Copy link
Contributor Author

They might have caused a performance regression compared to 3.13: faster-cpython/ideas#726
I'm still benchmarking, but wanted to already run the tests.

@nineteendo
Copy link
Contributor Author

cc @vstinner, @mdboom

@nineteendo nineteendo marked this pull request as ready for review May 11, 2025 15:38
Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good idea.

  • json optimizations have been rejected in the past--use ujson or something like that if performance is critical.
  • If we change how PyUnicodeWriter works, it adds more maintenance, especially if we use this as precedent for doing this elsewhere.
  • We should aim for optimizing PyUnicodeWriter as a broader change, not speed up each individual case by using the private API.

Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

@vstinner
Copy link
Member

See also #133186.

@ZeroIntensity
Copy link
Member

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

@vstinner
Copy link
Member

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

I chose to not expose it since it generates an invalid string if the input string contains non-ASCII characters. But yeah, maybe we should expose it. The function only validates the input string in debug mode for best performance.

@nineteendo
Copy link
Contributor Author

nineteendo commented May 11, 2025

Would it be possible to only replace PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString()? Do you get similar performance in this case?

Maybe, but my current benchmark has too much overhead to measure this accurately. I'll have to rewrite it first.

I hope we can figure out how to get the performance of the public API very close to the private one, such that everyone feels comfortable using it.

@nineteendo
Copy link
Contributor Author

I updated the benchmark, but I don't understand why:

  • writing integers is now 20% faster
  • reading and writing unicode strings is now 2-5% slower (shouldn't be caused by noise)

Does this have something to do with the overallocate parameter?

@vstinner
Copy link
Member

Does this have something to do with the overallocate parameter?

The private API doesn't enable overallocation by default.

@vstinner
Copy link
Member

I replaced PyUnicodeWriter_WriteUTF8() with _PyUnicodeWriter_WriteASCIIString() in Modules/_json.c and ran a benchmark:

Benchmark ref write_ascii
encode 100 booleans 9.54 us 8.83 us: 1.08x faster
encode 1000 booleans 60.8 us 53.1 us: 1.15x faster
encode escaped string len=896 4.11 us 4.10 us: 1.00x faster
encode 10000 booleans 569 us 487 us: 1.17x faster
encode 10000 integers 1.03 ms 1.03 ms: 1.00x slower
encode 10000 floats 2.11 ms 2.13 ms: 1.01x slower
Geometric mean (ref) 1.02x faster

Benchmark hidden because not significant (15): encode 100 integers, encode 100 floats, encode 100 "ascii" strings, encode ascii string len=100, encode escaped string len=128, encode Unicode string len=100, encode 1000 integers, encode 1000 floats, encode 1000 "ascii" strings, encode ascii string len=1000, encode Unicode string len=1000, encode 10000 "ascii" strings, encode ascii string len=10000, encode escaped string len=9984, encode Unicode string len=10000

I built Python with ./configure && make and used CPU Isolation on Linux.

Benchmark code:

import json
import pyperf
runner = pyperf.Runner()

for count in (100, 1_000, 10_000):
    runner.bench_func(f'encode {count} booleans', json.dumps, [True, False] * (count // 2))
    runner.bench_func(f'encode {count} integers', json.dumps, list(range(count)))
    runner.bench_func(f'encode {count} floats', json.dumps, [1.0] * count)
    runner.bench_func(f'encode {count} "ascii" strings', json.dumps, ['ascii'] * count)

    text = 'ascii'
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode ascii string len={len(text)}', json.dumps, text)

    text = ''.join(chr(ch) for ch in range(128))
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode escaped string len={len(text)}', json.dumps, text)

    text = 'abcd€'
    text *= (count // len(text) or 1)
    runner.bench_func(f'encode Unicode string len={len(text)}', json.dumps, text)

@vstinner
Copy link
Member

I also ran my benchmark on this PR:

Benchmark ref change
encode 100 booleans 9.53 us 6.26 us: 1.52x faster
encode 100 integers 13.8 us 11.6 us: 1.20x faster
encode 100 floats 24.8 us 19.0 us: 1.30x faster
encode 100 "ascii" strings 17.1 us 12.4 us: 1.37x faster
encode ascii string len=100 902 ns 877 ns: 1.03x faster
encode escaped string len=128 1.10 us 1.07 us: 1.03x faster
encode Unicode string len=100 1.07 us 1.04 us: 1.03x faster
encode 1000 booleans 58.9 us 29.6 us: 1.99x faster
encode 1000 integers 103 us 81.5 us: 1.26x faster
encode 1000 floats 209 us 152 us: 1.37x faster
encode 1000 "ascii" strings 131 us 86.9 us: 1.51x faster
encode ascii string len=1000 3.48 us 3.46 us: 1.00x faster
encode escaped string len=896 4.12 us 3.96 us: 1.04x faster
encode 10000 booleans 546 us 257 us: 2.12x faster
encode 10000 integers 1.00 ms 788 us: 1.27x faster
encode 10000 floats 2.04 ms 1.46 ms: 1.39x faster
encode 10000 "ascii" strings 1.27 ms 806 us: 1.57x faster
encode ascii string len=10000 28.4 us 28.4 us: 1.00x slower
encode escaped string len=9984 38.5 us 36.3 us: 1.06x faster
encode Unicode string len=10000 42.4 us 43.2 us: 1.02x slower
Geometric mean (ref) 1.26x faster

Benchmark hidden because not significant (1): encode Unicode string len=1000

@nineteendo
Copy link
Contributor Author

The private API doesn't enable overallocation by default.

Yeah, but both the old code and the public API do, so it's not that. (And you really don't want to turn it off)

encode overallocate normal slow down
List of 65,536 booleans 1222.61 μs 10456.60 μs 8.55x slower
List of 65,536 ints 3174.27 μs 12961.39 μs 4.08x slower
Dict with 65,536 booleans 10011.93 μs 29166.14 μs 2.91x slower
List of 65,536 ASCII strings 13303.33 μs 23714.00 μs 1.78x slower
List of 65,536 floats 37103.57 μs 48716.57 μs 1.31x slower
List of 65,536 strings 91757.30 μs 113949.94 μs 1.24x slower
decode overallocate normal slow down
Dict with 65,536 booleans 12194.03 μs 12098.16 μs 1.01x faster
List of 65,536 ASCII strings 7011.86 μs 7101.61 μs 1.01x slower
List of 65,536 strings 36049.66 μs 36412.55 μs 1.01x slower

@nineteendo
Copy link
Contributor Author

nineteendo commented May 12, 2025

This line is inefficient for exact string instances (Py_INCREF is enough):

PyObject *str = PyObject_Str(obj);

encode private public slow down
List of 65,536 booleans 1217.10 μs 1854.86 μs 1.52x slower
List of 65,536 ints 3190.74 μs 3701.18 μs 1.16x slower
Dict with 65,536 booleans 8783.92 μs 11459.45 μs 1.30x slower
List of 65,536 ASCII strings 12502.92 μs 14842.52 μs 1.19x slower
List of 65,536 floats 37008.47 μs 39790.32 μs 1.08x slower
List of 65,536 strings 90841.32 μs 94637.59 μs 1.04x slower
decode private public slow down
List of 65,536 ASCII strings 7064.05 μs 7186.32 μs 1.02x slower
List of 65,536 strings 36033.15 μs 36904.06 μs 1.02x slower

@nineteendo
Copy link
Contributor Author

nineteendo commented May 13, 2025

Here's the comparison with a minimal PR:

encode full minimal improvement
List of 65,536 booleans 1222.01 μs 1201.54 μs 1.02x faster
List of 65,536 ints 3156.75 μs 3047.81 μs 1.04x faster
Dict with 65,536 booleans 9442.35 μs 8518.33 μs 1.11x faster
List of 65,536 ASCII strings 13405.99 μs 12027.66 μs 1.11x faster
List of 65,536 floats 37037.70 μs 36684.89 μs 1.01x faster
List of 65,536 strings 92007.41 μs 87721.40 μs 1.05x faster
decode full minimal improvement
List of 65,536 ASCII strings 7029.30 μs 7431.43 μs 1.06x slower
List of 65,536 strings 36401.90 μs 34232.03 μs 1.06x faster

@nineteendo nineteendo marked this pull request as draft May 13, 2025 07:31
@nineteendo nineteendo marked this pull request as ready for review May 13, 2025 09:13
@nineteendo
Copy link
Contributor Author

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()? or can we merge this as is?

@vstinner vstinner changed the title Use private unicode writer for json gh-133968: Use private unicode writer for json May 13, 2025
@vstinner
Copy link
Member

I created issue #133968 to track this work.

@vstinner, could you add a fast path for exact string instances in PyUnicodeWriter_WriteStr()?

I wrote #133969 to add a fast path.

@vstinner
Copy link
Member

I wrote #133969 to add a fast path.

Merged. I confirmed with two benchmarks that this small optimization makes a big difference on some use cases such as encoding short strings in JSON.

@vstinner
Copy link
Member

@ZeroIntensity:

If _PyUnicodeWriter_WriteASCIIString is significantly faster than PyUnicodeWriter_WriteUTF8, then we should expose it as a public API.

Ok, I created #133973 to add PyUnicodeWriter_WriteASCII().

@nineteendo
Copy link
Contributor Author

Ok, I created #133973 to add PyUnicodeWriter_WriteASCII().

If that's merged we would use this aproach in 3.14, right?

@nineteendo
Copy link
Contributor Author

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().
I've now delayed the allocation until it's necessary. Thoughts?

@vstinner
Copy link
Member

@vstinner, it looks like the regression in json.loads() is caused by the heap allocation in PyUnicodeWriter_Create().

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

@vstinner
Copy link
Member

I've now delayed the allocation until it's necessary. Thoughts?

Would you mind to create a separated PR just for that?

@ZeroIntensity
Copy link
Member

Are the benchmarks creating an unrealistic number of concurrent writers? That would starve the freelist and create some allocation overhead, but only on the benchmarks.

@nineteendo
Copy link
Contributor Author

nineteendo commented May 16, 2025

Are you sure about that? The implementation uses a freelist which avoids the heap allocation in most cases.

You're right, it does seem to be using the only entry of the freelist. (I disabled the malloc to check)
There might be some overhead compared to using the stack though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.