Add Neon implementation of minmax #5963

hazzlim · Dec 15, 2025

Add and enable a Neon implementation of minmax on ARM64 platforms. The
existing code is refactored to support an unrolled version for
sufficiently large inputs.

This is stacked on #5949.

Performance numbers:

Benchmark	Speedup MSVC	Speedup Clang
bm<uint8_t, Op::Min_val>/8021	26.191	1.073
bm<uint8_t, Op::Min_val>/63	2.833	0.931
bm<uint8_t, Op::Max_val>/8021	24.444	1.04
bm<uint8_t, Op::Max_val>/63	3.209	1.004
bm<uint8_t, Op::Both_val>/8021	66.888	28.622
bm<uint8_t, Op::Both_val>/63	2.219	2.946
bm<uint16_t, Op::Min_val>/8021	13.457	1.023
bm<uint16_t, Op::Min_val>/31	2.25	1.023
bm<uint16_t, Op::Max_val>/8021	13.457	1.023
bm<uint16_t, Op::Max_val>/31	2.588	1.023
bm<uint16_t, Op::Both_val>/8021	38.111	14.178
bm<uint16_t, Op::Both_val>/31	2.324	2.689
bm<uint32_t, Op::Min_val>/8021	23.837	7.554
bm<uint32_t, Op::Min_val>/15	3.896	3
bm<uint32_t, Op::Max_val>/8021	24.318	7.495
bm<uint32_t, Op::Max_val>/15	3.987	2.921
bm<uint32_t, Op::Both_val>/8021	12.5	8.182
bm<uint32_t, Op::Both_val>/15	2.338	2.857
bm<uint64_t, Op::Min_val>/8021	2.01	0.684
bm<uint64_t, Op::Min_val>/7	1.374	1.5
bm<uint64_t, Op::Max_val>/8021	2.115	0.7
bm<uint64_t, Op::Max_val>/7	1.495	1.5
bm<uint64_t, Op::Both_val>/8021	1.029	0.741
bm<uint64_t, Op::Both_val>/7	0.967	1.321
bm<int8_t, Op::Min_val>/8021	26.191	1.073
bm<int8_t, Op::Min_val>/63	3.103	1.111
bm<int8_t, Op::Max_val>/8021	26.111	1.087
bm<int8_t, Op::Max_val>/63	3.286	1.167
bm<int8_t, Op::Both_val>/8021	66.957	28
bm<int8_t, Op::Both_val>/63	1.87	2.574
bm<int16_t, Op::Min_val>/8021	12.866	0.99
bm<int16_t, Op::Min_val>/31	1.952	0.978
bm<int16_t, Op::Max_val>/8021	12.866	1.023
bm<int16_t, Op::Max_val>/31	1.933	0.996
bm<int16_t, Op::Both_val>/8021	28.667	14.178
bm<int16_t, Op::Both_val>/31	2.066	2.42
bm<int32_t, Op::Min_val>/8021	24.405	7.595
bm<int32_t, Op::Min_val>/15	3.983	2.921
bm<int32_t, Op::Max_val>/8021	24.405	7.667
bm<int32_t, Op::Max_val>/15	4.075	2.921
bm<int32_t, Op::Both_val>/8021	14.659	8.182
bm<int32_t, Op::Both_val>/15	2.442	2.921
bm<int64_t, Op::Min_val>/8021	2.16	0.735
bm<int64_t, Op::Min_val>/7	1.6	1.64
bm<int64_t, Op::Max_val>/8021	2.07	0.713
bm<int64_t, Op::Max_val>/7	1.495	1.605
bm<int64_t, Op::Both_val>/8021	1.23	0.761
bm<int64_t, Op::Both_val>/7	0.946	1.35
bm<float, Op::Min_val>/8021	8.928	4.167
bm<float, Op::Min_val>/15	1.971	1.338
bm<float, Op::Max_val>/8021	9.126	4.148
bm<float, Op::Max_val>/15	2.017	1.307
bm<float, Op::Both_val>/8021	4.767	3.855
bm<float, Op::Both_val>/15	0.954	0.933
bm<double, Op::Min_val>/8021	4.563	2.182
bm<double, Op::Min_val>/7	0.854	0.76
bm<double, Op::Max_val>/8021	4.362	2.121
bm<double, Op::Max_val>/7	0.937	0.767
bm<double, Op::Both_val>/8021	2.118	1.964
bm<double, Op::Both_val>/7	0.899	0.47

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX on ARM64 targets.

Add and enable a Neon implementation of minmax on ARM64 platforms. The existing code is refactored to support an unrolled version for sufficiently large inputs.

AlexGuteniev · Dec 15, 2025

stl/inc/xutility

+_INLINE_VAR constexpr bool _Is_64bit_int_on_arm64_v = false;
+#endif // ^^^ defined(_M_ARM64) ^^^
+
+template <class _Iter, class _Pr, bool _Support64BitIntOnArm = false>


I'd prefer that we expose this quirk less. Instead of adding template parameter, you could do one of this:

Just repeat the _Is_min_max_iterators_safe<_Iter> && _Is_predicate_less<_Iter, _Pr> formula in _Is_min_max_value_optimization_safe

Extract common base for _Is_min_max_optimization_safe and _Is_min_max_value_optimization_safe

AlexGuteniev · Dec 15, 2025

stl/src/vector_algorithms.cpp

+                return vreinterpretq_s64_u64(vcgtq_u64(vreinterpretq_u64_s64(_First), vreinterpretq_u64_s64(_Second)));
+            }
+
+            static _Vec_t _Min(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept {


const _Mask .

I observe it is interesting that both ISA have same limitation of having usual vector max for 32-bit elements, but having to blend for 64bit elements.

AlexGuteniev · Dec 15, 2025

stl/src/vector_algorithms.cpp

+                return _Min(_First, _Second, _Cmp_gt_u(_First, _Second));
+            }
+
+            static _Vec_t _Max(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept {


Ditto const _Mask

AlexGuteniev · Dec 15, 2025

stl/src/vector_algorithms.cpp

                _Advance_bytes(_First, sizeof(_Ty));
            }

+#pragma loop(no_vector)


Is this a bug fix, or just a performance thing?

In either cases may worth commenting

hazzlim added 11 commits December 8, 2025 17:02

Enable vectorized minmax_element using Neon on ARM64

e9ad403

Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX on ARM64 targets.

Roll into _Is_min_max_optimization_safe

2329cc4

Don't const-qualify unnamed parameter

58f953b

Don't define minmax,is_sorted_until on ARM64 for now

9b849a3

Remove _Traits_8_neon and don't define *_8 functions

b3cd60c

Use non-floating point types where necessary in _Traits_d_neon

05f1ee9

Unify _Get_v_pos interface

1f1ee32

Use _tzcnt_u32/_lzcnt_u32 on avx

3a0a371

Don't declare _8 functions

0b93055

Add missing const qualifiers

017caa1

Don't add dispatch for minmax,is_sorted_until for now

f995cbc

github-project-automation bot added this to STL Code Reviews Dec 15, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews Dec 15, 2025

Add Neon implementation of minmax

c92dcef

Add and enable a Neon implementation of minmax on ARM64 platforms. The existing code is refactored to support an unrolled version for sufficiently large inputs.

hazzlim force-pushed the minmax-val-pr branch from 3119a3a to c92dcef Compare December 15, 2025 12:24

AlexGuteniev reviewed Dec 15, 2025

View reviewed changes

StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Dec 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Neon implementation of minmax #5963

Add Neon implementation of minmax #5963

hazzlim commented Dec 15, 2025

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

AlexGuteniev Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

Add Neon implementation of minmax #5963

Are you sure you want to change the base?

Add Neon implementation of minmax #5963

Conversation

hazzlim commented Dec 15, 2025

Uh oh!

AlexGuteniev Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

AlexGuteniev Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants