Closed
Description
Bug report
Bug description:
There seems to be a significant performance regression in tokenize.generate_tokens()
between 3.11 and 3.12 when tokenizing a (very) large dict on a single line. I searched the existing issues but couldn't find anything about this.
To reproduce, rename the file largedict.py.txt to largedict.py
in the same directory as the script below, then run the script. That file comes from nedbat/coveragepy#1785.
import io, time, sys, tokenize
import largedict
text = largedict.d
readline = io.StringIO(text).readline
glob_start = start = time.time()
print(f"{sys.implementation.name} {sys.platform} {sys.version}")
for i, (ttype, ttext, (sline, scol), (_, ecol), _) in enumerate(tokenize.generate_tokens(readline)):
if i % 500 == 0:
print(i, ttype, ttext, sline, scol, time.time() - start)
start = time.time()
if i % 5000 == 0:
print(time.time() - glob_start)
print(f"Time taken: {time.time() - glob_start}")
For Python 3.12, this results in:
cpython linux 3.12.3 (main, May 17 2024, 07:19:22) [GCC 11.4.0]
0 1 a_large_dict_literal 1 0 0.04641866683959961
0.046633005142211914
500 3 ':tombol_a_(golongan_darah):' 1 2675 9.689745903015137
1000 3 ':flagge_anguilla:' 1 5261 9.767053604125977
1500 3 ':флаг_Армения:' 1 7879 9.258271932601929
[...]
For Python 3.11, this results in:
cpython linux 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
0 1 a_large_dict_literal 1 0 0.013637304306030273
0.013663768768310547
500 3 ':tombol_a_(golongan_darah):' 1 2675 0.002939462661743164
1000 3 ':flagge_anguilla:' 1 5261 0.0028715133666992188
1500 3 ':флаг_Армения:' 1 7879 0.002806425094604492
[...]
352500 3 'pt' 1 2589077 0.003370046615600586
Time taken: 2.1244866847991943
That is, each 500 tokens in Python 3.12 is taking over 9 seconds to process, while the 352500 tokens in Python 3.11 is taking a bit over 2 seconds to process.
I can reproduce this on Linux (WSL) and Windows. Also seems to affect 3.13.
CPython versions tested on:
3.9, 3.10, 3.11, 3.12
Operating systems tested on:
Linux, Windows
Linked PRs
Metadata
Metadata
Assignees
Labels
Performance or resource usagePerformance or resource usageAn unexpected behavior, bug, or errorAn unexpected behavior, bug, or error