Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH: optimize timsort by using binary insertion sort #24843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
Loading
from

Conversation

visitorckw
Copy link
Contributor

The current sorting algorithm in numpy's source code uses a standard insertion sort, which involves N^2 comparisons, leading to performance bottlenecks in cases where element comparisons are expensive, such as string sorts and generic sorting operations.

This commit replaces the standard insertion sort with a binary insertion sort, which only needs N*log(N) comparisons.

Additionally, this change aligns with the initial intention mentioned in the source code comments, which suggested the use of binary insertion sort to boost performance.

The current sorting algorithm in numpy's source code uses a standard
insertion sort, which involves N^2 comparisons, leading to performance
bottlenecks in cases where element comparisons are expensive, such as
string sorts and generic sorting operations.

This commit replaces the standard insertion sort with a binary
insertion sort, which only needs N*log(N) comparisons.

Additionally, this change aligns with the initial intention mentioned
in the source code comments, which suggested the use of binary
insertion sort to boost performance.
@visitorckw visitorckw force-pushed the binary-insertion-sort branch from fcb9730 to 441bdb4 Compare October 2, 2023 19:53
@visitorckw visitorckw changed the title EHN: optimize timsort by using binary insertion sort ENH: optimize timsort by using binary insertion sort Oct 2, 2023
Copy link
Contributor

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking the benchmarks locally (x86_64 Linux, gnu, i9-13900K) with asv continuous -e -b ".*Sort.*" main binary-insertion-sort I do see a few regressions on this branch mixed in with the speedups:

| Change   | Before [8c129baf] <main>   | After [441bdb4d] <binary-insertion-sort>   |   Ratio | Benchmark (Parameter)                                                             |
|----------|----------------------------|--------------------------------------------|---------|-----------------------------------------------------------------------------------|
| +        | 41.7±2μs                   | 66.2±2μs                                   |    1.59 | bench_function_base.Sort.time_argsort('heap', 'float16', ('reversed',))           |
| +        | 43.7±2μs                   | 67.4±1μs                                   |    1.54 | bench_function_base.Sort.time_argsort('heap', 'float16', ('uniform',))            |
| +        | 3.27±0.02μs                | 4.96±0.8μs                                 |    1.51 | bench_function_base.Sort.time_sort('merge', 'uint32', ('reversed',))              |
| +        | 54.7±2μs                   | 78.5±0.1μs                                 |    1.43 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 10))    |
| +        | 58.5±1μs                   | 78.8±0.2μs                                 |    1.35 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 10))     |
| +        | 47.6±0.2μs                 | 60.9±0.4μs                                 |    1.28 | bench_function_base.Sort.time_argsort('merge', 'int16', ('reversed',))            |
| +        | 4.58±0.2μs                 | 5.80±0.06μs                                |    1.27 | bench_function_base.Sort.time_argsort('merge', 'int64', ('uniform',))             |
| +        | 606±30μs                   | 756±20μs                                   |    1.25 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 1000))  |
| +        | 4.58±0.06μs                | 5.71±0.06μs                                |    1.25 | bench_function_base.Sort.time_argsort('merge', 'int64', ('ordered',))             |
| +        | 2.73±0.02μs                | 3.38±0.4μs                                 |    1.24 | bench_function_base.Sort.time_sort('merge', 'uint32', ('uniform',))               |
| +        | 32.8±0.9μs                 | 40.3±0.1μs                                 |    1.23 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('sorted_block', 100))   |
| +        | 32.7±0.5μs                 | 39.8±0.07μs                                |    1.22 | bench_function_base.Sort.time_argsort('merge', 'int32', ('sorted_block', 100))    |
| +        | 5.58±0.01μs                | 6.80±0.03μs                                |    1.22 | bench_function_base.Sort.time_argsort('merge', 'int64', ('reversed',))            |
| +        | 27.9±0.8μs                 | 33.8±0.05μs                                |    1.21 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 100))       |
| +        | 4.16±0.03μs                | 4.85±0.07μs                                |    1.17 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('uniform',))            |
| +        | 35.2±0.4ms                 | 41.2±0.2ms                                 |    1.17 | bench_function_base.Sort.time_sort_worst                                          |
| +        | 6.99±0.2μs                 | 8.13±0.2μs                                 |    1.16 | bench_function_base.Sort.time_argsort('merge', 'float32', ('reversed',))          |
| +        | 49.5±1μs                   | 56.7±0.7μs                                 |    1.14 | bench_function_base.Sort.time_sort('merge', 'uint32', ('sorted_block', 10))       |
| +        | 4.28±0.08μs                | 4.83±0.1μs                                 |    1.13 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('ordered',))            |
| +        | 5.19±0.02μs                | 5.85±0.04μs                                |    1.13 | bench_function_base.Sort.time_argsort('merge', 'uint32', ('reversed',))           |
| +        | 710±30μs                   | 794±20μs                                   |    1.12 | bench_function_base.Sort.time_argsort('heap', 'float16', ('sorted_block', 100))   |
| +        | 30.3±0.02μs                | 33.7±0.9μs                                 |    1.12 | bench_function_base.Sort.time_sort('quick', 'float64', ('ordered',))              |
| +        | 4.20±0.1μs                 | 4.67±0.04μs                                |    1.11 | bench_function_base.Sort.time_argsort('merge', 'int16', ('uniform',))             |
| +        | 6.55±0.1μs                 | 7.21±0.1μs                                 |    1.1  | bench_function_base.Sort.time_argsort('merge', 'float32', ('uniform',))           |
| +        | 4.35±0.05μs                | 4.80±0.05μs                                |    1.1  | bench_function_base.Sort.time_argsort('merge', 'int32', ('uniform',))             |
| +        | 26.0±0.03μs                | 28.7±0.06μs                                |    1.1  | bench_function_base.Sort.time_argsort('quick', 'uint32', ('uniform',))            |
| -        | 292±7μs                    | 265±0.9μs                                  |    0.91 | bench_function_base.Sort.time_sort('heap', 'int64', ('reversed',))                |
| -        | 359±0.5μs                  | 321±1μs                                    |    0.9  | bench_function_base.Sort.time_argsort('merge', 'int32', ('random',))              |
| -        | 357±7μs                    | 320±8μs                                    |    0.9  | bench_function_base.Sort.time_sort('heap', 'int64', ('sorted_block', 1000))       |
| -        | 325±5μs                    | 290±0.6μs                                  |    0.89 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('reversed',))            |
| -        | 82.5±3μs                   | 73.6±0.5μs                                 |    0.89 | bench_function_base.Sort.time_argsort('quick', 'float16', ('reversed',))          |
| -        | 279±2μs                    | 247±0.5μs                                  |    0.89 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 10))     |
| -        | 268±0.7μs                  | 237±0.7μs                                  |    0.89 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 100))    |
| -        | 43.8±0.3μs                 | 39.1±1μs                                   |    0.89 | bench_function_base.Sort.time_sort('quick', 'int32', ('reversed',))               |
| -        | 419±4μs                    | 367±2μs                                    |    0.88 | bench_function_base.Sort.time_argsort('heap', 'uint32', ('sorted_block', 1000))   |
| -        | 321±5μs                    | 280±7μs                                    |    0.87 | bench_function_base.Sort.time_argsort('heap', 'int16', ('ordered',))              |
| -        | 51.4±1μs                   | 44.1±0.05μs                                |    0.86 | bench_function_base.Sort.time_argsort('quick', 'int64', ('reversed',))            |
| -        | 190±0.4μs                  | 163±1μs                                    |    0.86 | bench_function_base.Sort.time_argsort('quick', 'int64', ('sorted_block', 1000))   |
| -        | 282±5μs                    | 243±1μs                                    |    0.86 | bench_function_base.Sort.time_sort('heap', 'int64', ('ordered',))                 |
| -        | 15.6±0.7μs                 | 13.2±0.4μs                                 |    0.85 | bench_function_base.Sort.time_sort('merge', 'int16', ('random',))                 |
| -        | 65.2±2μs                   | 55.5±0.2μs                                 |    0.85 | bench_function_base.Sort.time_sort('merge', 'int32', ('sorted_block', 10))        |
| -        | 82.6±0.6μs                 | 70.4±0.3μs                                 |    0.85 | bench_function_base.Sort.time_sort('quick', 'float16', ('reversed',))             |
| -        | 132±3μs                    | 107±0.2μs                                  |    0.81 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 10))      |
| -        | 82.9±0.6μs                 | 66.7±0.07μs                                |    0.8  | bench_function_base.Sort.time_argsort('quick', 'float16', ('ordered',))           |
| -        | 65.4±0.1μs                 | 52.5±1μs                                   |    0.8  | bench_function_base.Sort.time_sort('merge', 'int64', ('sorted_block', 10))        |
| -        | 84.5±0.1μs                 | 64.0±0.2μs                                 |    0.76 | bench_function_base.Sort.time_sort('quick', 'float16', ('ordered',))              |
| -        | 93.9±0.4μs                 | 68.2±0.3μs                                 |    0.73 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 100))  |
| -        | 751±3μs                    | 506±10μs                                   |    0.67 | bench_function_base.Sort.time_sort('heap', 'float16', ('random',))                |
| -        | 699±20μs                   | 432±1μs                                    |    0.62 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 10))       |
| -        | 723±20μs                   | 451±10μs                                   |    0.62 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 100))      |
| -        | 668±0.9μs                  | 407±0.5μs                                  |    0.61 | bench_function_base.Sort.time_sort('heap', 'float16', ('sorted_block', 1000))     |
| -        | 56.7±0.9μs                 | 33.7±0.9μs                                 |    0.59 | bench_function_base.Sort.time_sort('merge', 'float16', ('sorted_block', 1000))    |
| -        | 605±20μs                   | 335±0.2μs                                  |    0.55 | bench_function_base.Sort.time_sort('heap', 'float16', ('ordered',))               |
| -        | 83.4±2μs                   | 38.7±0.09μs                                |    0.46 | bench_function_base.Sort.time_argsort('merge', 'float16', ('sorted_block', 1000)) |
| -        | 71.4±2μs                   | 27.3±0.7μs                                 |    0.38 | bench_function_base.Sort.time_sort('heap', 'float16', ('reversed',))              |
| -        | 69.5±3μs                   | 26.3±0.4μs                                 |    0.38 | bench_function_base.Sort.time_sort('heap', 'float16', ('uniform',))               |
| -        | 29.6±0.02μs                | 7.85±0.06μs                                |    0.27 | bench_function_base.Sort.time_argsort('merge', 'float16', ('uniform',))           |
| -        | 30.5±0.9μs                 | 7.81±0.05μs                                |    0.26 | bench_function_base.Sort.time_argsort('merge', 'float16', ('ordered',))           |
| -        | 30.3±0.8μs                 | 7.83±0.06μs                                |    0.26 | bench_function_base.Sort.time_argsort('merge', 'float16', ('reversed',))          |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@visitorckw
Copy link
Contributor Author

Thank you for conducting the tests and providing feedback. Based on the data, it appears that we haven't achieved a significant speedup.

Perhaps I should consider closing this PR for now, reassessing the changes, and conducting further testing.

Furthermore, there's something that's puzzling me. I've only made modifications to the timsort code, but as indicated in the data above, some tests with 'kind='heap'' or 'quick'' exhibit significant speed fluctuations, either faster or slower. I encountered similar behavior during local testing, and even after multiple tests, the results consistently showed these speed variations.

Could these variations be attributed to hardware characteristics or OS scheduling, causing discrepancies in the results?

@tylerjereddy
Copy link
Contributor

Maybe wait for one of the more performance-oriented core devs to take a look before you close it. I'm pretty sure NumPy now has a performance team that meets every few weeks as well, if you're keen on that. I believe they focus a lot on SIMD stuff, but anyway in case that is of interest to you.

@charris
Copy link
Member

charris commented Oct 8, 2023

Straight insertion sort is pretty good for merging small, almost ordered lists. In that case, the simplicity of the inner loop makes a difference. IIRC, the cutoff here is around 15-20 elements. I originally choose that number by timing how long sorting took for different values. It could be that an adjustment could be made these days. In any case, getting stable timing can be difficult, especially for small effects. You want to make sure that your OS isn't being power efficient and that there is not a lot of other stuff running. Cache usage can also make an enormous difference, operations can run 6x-10x faster if everything is in cache. When I wrote the original sorting code back around 2003, CPUs and Linux were a lot simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting a code review
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.