-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: optimize timsort by using binary insertion sort #24843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The current sorting algorithm in numpy's source code uses a standard insertion sort, which involves N^2 comparisons, leading to performance bottlenecks in cases where element comparisons are expensive, such as string sorts and generic sorting operations. This commit replaces the standard insertion sort with a binary insertion sort, which only needs N*log(N) comparisons. Additionally, this change aligns with the initial intention mentioned in the source code comments, which suggested the use of binary insertion sort to boost performance.
fcb9730
to
441bdb4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking the benchmarks locally (x86_64 Linux, gnu, i9-13900K) with asv continuous -e -b ".*Sort.*" main binary-insertion-sort
I do see a few regressions on this branch mixed in with the speedups:
| Change | Before [8c129baf] <main> | After [441bdb4d] <binary-insertion-sort> | Ratio | Benchmark (Parameter) |
|----------|----------------------------|--------------------------------------------|---------|-----------------------------------------------------------------------------------|
| + | 41.7±2μs | 66.2±2μs | 1.59 | bench_function_base.Sort.time_argsort('heap', 'float16', ('reversed',)) |
| + | 43.7±2μs | 67.4±1μs | 1.54 | bench_function_base.Sort.time_argsort('heap', 'float16', ('uniform',)) |
| + | 3.27±0.02μs | 4.96±0.8μs | 1.51 | bench_function_base.Sort.time_sort('merge', 'uint32', ('reversed',)) |
| + | 54.7±2μs | 78.5±0.1μs | 1.43 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 10)) |
| + | 58.5±1μs | 78.8±0.2μs | 1.35 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 10)) |
| + | 47.6±0.2μs | 60.9±0.4μs | 1.28 | bench_function_base.Sort.time_argsort('merge', 'int16', ('reversed',)) |
| + | 4.58±0.2μs | 5.80±0.06μs | 1.27 | bench_function_base.Sort.time_argsort('merge', 'int64', ('uniform',)) |
| + | 606±30μs | 756±20μs | 1.25 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 1000)) |
| + | 4.58±0.06μs | 5.71±0.06μs | 1.25 | bench_function_base.Sort.time_argsort('merge', 'int64', ('ordered',)) |
| + | 2.73±0.02μs | 3.38±0.4μs | 1.24 | bench_function_base.Sort.time_sort('merge', 'uint32', ('uniform',)) |
| + | 32.8±0.9μs | 40.3±0.1μs | 1.23 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100)) |
| + | 32.7±0.5μs | 39.8±0.07μs | 1.22 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100)) |
| + | 5.58±0.01μs | 6.80±0.03μs | 1.22 | bench_function_base.Sort.time_argsort('merge', 'int64', ('reversed',)) |
| + | 27.9±0.8μs | 33.8±0.05μs | 1.21 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 100)) |
| + | 4.16±0.03μs | 4.85±0.07μs | 1.17 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('uniform',)) |
| + | 35.2±0.4ms | 41.2±0.2ms | 1.17 | bench_function_base.Sort.time_sort_worst |
| + | 6.99±0.2μs | 8.13±0.2μs | 1.16 | bench_function_base.Sort.time_argsort('merge', 'float32', ('reversed',)) |
| + | 49.5±1μs | 56.7±0.7μs | 1.14 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 10)) |
| + | 4.28±0.08μs | 4.83±0.1μs | 1.13 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('ordered',)) |
| + | 5.19±0.02μs | 5.85±0.04μs | 1.13 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('reversed',)) |
| + | 710±30μs | 794±20μs | 1.12 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 100)) |
| + | 30.3±0.02μs | 33.7±0.9μs | 1.12 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',)) |
| + | 4.20±0.1μs | 4.67±0.04μs | 1.11 | bench_function_base.Sort.time_argsort('merge', 'int16', ('uniform',)) |
| + | 6.55±0.1μs | 7.21±0.1μs | 1.1 | bench_function_base.Sort.time_argsort('merge', 'float32', ('uniform',)) |
| + | 4.35±0.05μs | 4.80±0.05μs | 1.1 | bench_function_base.Sort.time_argsort('merge', 'int32', ('uniform',)) |
| + | 26.0±0.03μs | 28.7±0.06μs | 1.1 | bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',)) |
| - | 292±7μs | 265±0.9μs | 0.91 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',)) |
| - | 359±0.5μs | 321±1μs | 0.9 | bench_function_base.Sort.time_argsort('merge', 'int32', ('random',)) |
| - | 357±7μs | 320±8μs | 0.9 | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000)) |
| - | 325±5μs | 290±0.6μs | 0.89 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('reversed',)) |
| - | 82.5±3μs | 73.6±0.5μs | 0.89 | bench_function_base.Sort.time_argsort('quick', 'float16', ('reversed',)) |
| - | 279±2μs | 247±0.5μs | 0.89 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10)) |
| - | 268±0.7μs | 237±0.7μs | 0.89 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100)) |
| - | 43.8±0.3μs | 39.1±1μs | 0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',)) |
| - | 419±4μs | 367±2μs | 0.88 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('sorted_block', 1000)) |
| - | 321±5μs | 280±7μs | 0.87 | bench_function_base.Sort.time_argsort('heap', 'int16', ('ordered',)) |
| - | 51.4±1μs | 44.1±0.05μs | 0.86 | bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',)) |
| - | 190±0.4μs | 163±1μs | 0.86 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000)) |
| - | 282±5μs | 243±1μs | 0.86 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',)) |
| - | 15.6±0.7μs | 13.2±0.4μs | 0.85 | bench_function_base.Sort.time_sort('merge', 'int16', ('random',)) |
| - | 65.2±2μs | 55.5±0.2μs | 0.85 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 10)) |
| - | 82.6±0.6μs | 70.4±0.3μs | 0.85 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',)) |
| - | 132±3μs | 107±0.2μs | 0.81 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 10)) |
| - | 82.9±0.6μs | 66.7±0.07μs | 0.8 | bench_function_base.Sort.time_argsort('quick', 'float16', ('ordered',)) |
| - | 65.4±0.1μs | 52.5±1μs | 0.8 | bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 10)) |
| - | 84.5±0.1μs | 64.0±0.2μs | 0.76 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',)) |
| - | 93.9±0.4μs | 68.2±0.3μs | 0.73 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 100)) |
| - | 751±3μs | 506±10μs | 0.67 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',)) |
| - | 699±20μs | 432±1μs | 0.62 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10)) |
| - | 723±20μs | 451±10μs | 0.62 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100)) |
| - | 668±0.9μs | 407±0.5μs | 0.61 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000)) |
| - | 56.7±0.9μs | 33.7±0.9μs | 0.59 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000)) |
| - | 605±20μs | 335±0.2μs | 0.55 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',)) |
| - | 83.4±2μs | 38.7±0.09μs | 0.46 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 1000)) |
| - | 71.4±2μs | 27.3±0.7μs | 0.38 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',)) |
| - | 69.5±3μs | 26.3±0.4μs | 0.38 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',)) |
| - | 29.6±0.02μs | 7.85±0.06μs | 0.27 | bench_function_base.Sort.time_argsort('merge', 'float16', ('uniform',)) |
| - | 30.5±0.9μs | 7.81±0.05μs | 0.26 | bench_function_base.Sort.time_argsort('merge', 'float16', ('ordered',)) |
| - | 30.3±0.8μs | 7.83±0.06μs | 0.26 | bench_function_base.Sort.time_argsort('merge', 'float16', ('reversed',)) |
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.
Thank you for conducting the tests and providing feedback. Based on the data, it appears that we haven't achieved a significant speedup. Perhaps I should consider closing this PR for now, reassessing the changes, and conducting further testing. Furthermore, there's something that's puzzling me. I've only made modifications to the timsort code, but as indicated in the data above, some tests with 'kind='heap'' or 'quick'' exhibit significant speed fluctuations, either faster or slower. I encountered similar behavior during local testing, and even after multiple tests, the results consistently showed these speed variations. Could these variations be attributed to hardware characteristics or OS scheduling, causing discrepancies in the results? |
Maybe wait for one of the more performance-oriented core devs to take a look before you close it. I'm pretty sure NumPy now has a performance team that meets every few weeks as well, if you're keen on that. I believe they focus a lot on SIMD stuff, but anyway in case that is of interest to you. |
Straight insertion sort is pretty good for merging small, almost ordered lists. In that case, the simplicity of the inner loop makes a difference. IIRC, the cutoff here is around 15-20 elements. I originally choose that number by timing how long sorting took for different values. It could be that an adjustment could be made these days. In any case, getting stable timing can be difficult, especially for small effects. You want to make sure that your OS isn't being power efficient and that there is not a lot of other stuff running. Cache usage can also make an enormous difference, operations can run 6x-10x faster if everything is in cache. When I wrote the original sorting code back around 2003, CPUs and Linux were a lot simpler. |
The current sorting algorithm in numpy's source code uses a standard insertion sort, which involves N^2 comparisons, leading to performance bottlenecks in cases where element comparisons are expensive, such as string sorts and generic sorting operations.
This commit replaces the standard insertion sort with a binary insertion sort, which only needs N*log(N) comparisons.
Additionally, this change aligns with the initial intention mentioned in the source code comments, which suggested the use of binary insertion sort to boost performance.