ENH: optimize timsort by using binary insertion sort #24843

visitorckw · Oct 2, 2023

The current sorting algorithm in numpy's source code uses a standard insertion sort, which involves N^2 comparisons, leading to performance bottlenecks in cases where element comparisons are expensive, such as string sorts and generic sorting operations.

This commit replaces the standard insertion sort with a binary insertion sort, which only needs N*log(N) comparisons.

Additionally, this change aligns with the initial intention mentioned in the source code comments, which suggested the use of binary insertion sort to boost performance.

The current sorting algorithm in numpy's source code uses a standard insertion sort, which involves N^2 comparisons, leading to performance bottlenecks in cases where element comparisons are expensive, such as string sorts and generic sorting operations. This commit replaces the standard insertion sort with a binary insertion sort, which only needs N*log(N) comparisons. Additionally, this change aligns with the initial intention mentioned in the source code comments, which suggested the use of binary insertion sort to boost performance.

tylerjereddy

Checking the benchmarks locally (x86_64 Linux, gnu, i9-13900K) with asv continuous -e -b ".*Sort.*" main binary-insertion-sort I do see a few regressions on this branch mixed in with the speedups:

| Change   | Before [8c129baf] <main>   | After [441bdb4d] <binary-insertion-sort>   |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|--------------------------------------------|---------|-----------------------------------------------------------------------------------|
| +        | 41.7±2μs                   | 66.2±2μs                                   |    1.59 | bench_function_base.Sort.time_argsort('heap', 'float16', ('reversed',))           |
| +        | 43.7±2μs                   | 67.4±1μs                                   |    1.54 | bench_function_base.Sort.time_argsort('heap', 'float16', ('uniform',))            |
| +        | 3.27±0.02μs                | 4.96±0.8μs                                 |    1.51 | bench_function_base.Sort.time_sort('merge', 'uint32', ('reversed',))              |
| +        | 54.7±2μs                   | 78.5±0.1μs                                 |    1.43 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 10))    |
| +        | 58.5±1μs                   | 78.8±0.2μs                                 |    1.35 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 10))     |
| +        | 47.6±0.2μs                 | 60.9±0.4μs                                 |    1.28 | bench_function_base.Sort.time_argsort('merge', 'int16', ('reversed',))            |
| +        | 4.58±0.2μs                 | 5.80±0.06μs                                |    1.27 | bench_function_base.Sort.time_argsort('merge', 'int64', ('uniform',))             |
| +        | 606±30μs                   | 756±20μs                                   |    1.25 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 1000))  |
| +        | 4.58±0.06μs                | 5.71±0.06μs                                |    1.25 | bench_function_base.Sort.time_argsort('merge', 'int64', ('ordered',))             |
| +        | 2.73±0.02μs                | 3.38±0.4μs                                 |    1.24 | bench_function_base.Sort.time_sort('merge', 'uint32', ('uniform',))               |
| +        | 32.8±0.9μs                 | 40.3±0.1μs                                 |    1.23 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))   |
| +        | 32.7±0.5μs                 | 39.8±0.07μs                                |    1.22 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))    |
| +        | 5.58±0.01μs                | 6.80±0.03μs                                |    1.22 | bench_function_base.Sort.time_argsort('merge', 'int64', ('reversed',))            |
| +        | 27.9±0.8μs                 | 33.8±0.05μs                                |    1.21 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 100))       |
| +        | 4.16±0.03μs                | 4.85±0.07μs                                |    1.17 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('uniform',))            |
| +        | 35.2±0.4ms                 | 41.2±0.2ms                                 |    1.17 | bench_function_base.Sort.time_sort_worst                                          |
| +        | 6.99±0.2μs                 | 8.13±0.2μs                                 |    1.16 | bench_function_base.Sort.time_argsort('merge', 'float32', ('reversed',))          |
| +        | 49.5±1μs                   | 56.7±0.7μs                                 |    1.14 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 10))       |
| +        | 4.28±0.08μs                | 4.83±0.1μs                                 |    1.13 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('ordered',))            |
| +        | 5.19±0.02μs                | 5.85±0.04μs                                |    1.13 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('reversed',))           |
| +        | 710±30μs                   | 794±20μs                                   |    1.12 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 100))   |
| +        | 30.3±0.02μs                | 33.7±0.9μs                                 |    1.12 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))              |
| +        | 4.20±0.1μs                 | 4.67±0.04μs                                |    1.11 | bench_function_base.Sort.time_argsort('merge', 'int16', ('uniform',))             |
| +        | 6.55±0.1μs                 | 7.21±0.1μs                                 |    1.1  | bench_function_base.Sort.time_argsort('merge', 'float32', ('uniform',))           |
| +        | 4.35±0.05μs                | 4.80±0.05μs                                |    1.1  | bench_function_base.Sort.time_argsort('merge', 'int32', ('uniform',))             |
| +        | 26.0±0.03μs                | 28.7±0.06μs                                |    1.1  | bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))            |
| -        | 292±7μs                    | 265±0.9μs                                  |    0.91 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))                |
| -        | 359±0.5μs                  | 321±1μs                                    |    0.9  | bench_function_base.Sort.time_argsort('merge', 'int32', ('random',))              |
| -        | 357±7μs                    | 320±8μs                                    |    0.9  | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))       |
| -        | 325±5μs                    | 290±0.6μs                                  |    0.89 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('reversed',))            |
| -        | 82.5±3μs                   | 73.6±0.5μs                                 |    0.89 | bench_function_base.Sort.time_argsort('quick', 'float16', ('reversed',))          |
| -        | 279±2μs                    | 247±0.5μs                                  |    0.89 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))     |
| -        | 268±0.7μs                  | 237±0.7μs                                  |    0.89 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))    |
| -        | 43.8±0.3μs                 | 39.1±1μs                                   |    0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))               |
| -        | 419±4μs                    | 367±2μs                                    |    0.88 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('sorted_block', 1000))   |
| -        | 321±5μs                    | 280±7μs                                    |    0.87 | bench_function_base.Sort.time_argsort('heap', 'int16', ('ordered',))              |
| -        | 51.4±1μs                   | 44.1±0.05μs                                |    0.86 | bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))            |
| -        | 190±0.4μs                  | 163±1μs                                    |    0.86 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))   |
| -        | 282±5μs                    | 243±1μs                                    |    0.86 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))                 |
| -        | 15.6±0.7μs                 | 13.2±0.4μs                                 |    0.85 | bench_function_base.Sort.time_sort('merge', 'int16', ('random',))                 |
| -        | 65.2±2μs                   | 55.5±0.2μs                                 |    0.85 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 10))        |
| -        | 82.6±0.6μs                 | 70.4±0.3μs                                 |    0.85 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))             |
| -        | 132±3μs                    | 107±0.2μs                                  |    0.81 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 10))      |
| -        | 82.9±0.6μs                 | 66.7±0.07μs                                |    0.8  | bench_function_base.Sort.time_argsort('quick', 'float16', ('ordered',))           |
| -        | 65.4±0.1μs                 | 52.5±1μs                                   |    0.8  | bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 10))        |
| -        | 84.5±0.1μs                 | 64.0±0.2μs                                 |    0.76 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))              |
| -        | 93.9±0.4μs                 | 68.2±0.3μs                                 |    0.73 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 100))  |
| -        | 751±3μs                    | 506±10μs                                   |    0.67 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))                |
| -        | 699±20μs                   | 432±1μs                                    |    0.62 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))       |
| -        | 723±20μs                   | 451±10μs                                   |    0.62 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))      |
| -        | 668±0.9μs                  | 407±0.5μs                                  |    0.61 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))     |
| -        | 56.7±0.9μs                 | 33.7±0.9μs                                 |    0.59 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000))    |
| -        | 605±20μs                   | 335±0.2μs                                  |    0.55 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',))               |
| -        | 83.4±2μs                   | 38.7±0.09μs                                |    0.46 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 1000)) |
| -        | 71.4±2μs                   | 27.3±0.7μs                                 |    0.38 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',))              |
| -        | 69.5±3μs                   | 26.3±0.4μs                                 |    0.38 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))               |
| -        | 29.6±0.02μs                | 7.85±0.06μs                                |    0.27 | bench_function_base.Sort.time_argsort('merge', 'float16', ('uniform',))           |
| -        | 30.5±0.9μs                 | 7.81±0.05μs                                |    0.26 | bench_function_base.Sort.time_argsort('merge', 'float16', ('ordered',))           |
| -        | 30.3±0.8μs                 | 7.83±0.06μs                                |    0.26 | bench_function_base.Sort.time_argsort('merge', 'float16', ('reversed',))          |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

visitorckw · Oct 7, 2023

Thank you for conducting the tests and providing feedback. Based on the data, it appears that we haven't achieved a significant speedup.

Perhaps I should consider closing this PR for now, reassessing the changes, and conducting further testing.

Furthermore, there's something that's puzzling me. I've only made modifications to the timsort code, but as indicated in the data above, some tests with 'kind='heap'' or 'quick'' exhibit significant speed fluctuations, either faster or slower. I encountered similar behavior during local testing, and even after multiple tests, the results consistently showed these speed variations.

Could these variations be attributed to hardware characteristics or OS scheduling, causing discrepancies in the results?

tylerjereddy · Oct 7, 2023

Maybe wait for one of the more performance-oriented core devs to take a look before you close it. I'm pretty sure NumPy now has a performance team that meets every few weeks as well, if you're keen on that. I believe they focus a lot on SIMD stuff, but anyway in case that is of interest to you.

charris · Oct 8, 2023

Straight insertion sort is pretty good for merging small, almost ordered lists. In that case, the simplicity of the inner loop makes a difference. IIRC, the cutoff here is around 15-20 elements. I originally choose that number by timing how long sorting took for different values. It could be that an adjustment could be made these days. In any case, getting stable timing can be difficult, especially for small effects. You want to make sure that your OS isn't being power efficient and that there is not a lot of other stuff running. Cache usage can also make an enormous difference, operations can run 6x-10x faster if everything is in cache. When I wrote the original sorting code back around 2003, CPUs and Linux were a lot simpler.

visitorckw force-pushed the binary-insertion-sort branch from fcb9730 to 441bdb4 Compare October 2, 2023 19:53

visitorckw changed the title ~~EHN: optimize timsort by using binary insertion sort~~ ENH: optimize timsort by using binary insertion sort Oct 2, 2023

tylerjereddy added 01 - Enhancement component: numpy._core labels Oct 5, 2023

tylerjereddy reviewed Oct 5, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: optimize timsort by using binary insertion sort #24843

ENH: optimize timsort by using binary insertion sort #24843

Uh oh!

visitorckw commented Oct 2, 2023

Uh oh!

tylerjereddy left a comment

Uh oh!

visitorckw commented Oct 7, 2023

Uh oh!

tylerjereddy commented Oct 7, 2023

Uh oh!

charris commented Oct 8, 2023

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH: optimize timsort by using binary insertion sort #24843

Are you sure you want to change the base?

ENH: optimize timsort by using binary insertion sort #24843

Uh oh!

Conversation

visitorckw commented Oct 2, 2023

Uh oh!

tylerjereddy left a comment

Choose a reason for hiding this comment

Uh oh!

visitorckw commented Oct 7, 2023

Uh oh!

tylerjereddy commented Oct 7, 2023

Uh oh!

charris commented Oct 8, 2023

Uh oh!

Uh oh!