Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Dec 15, 2025

Add and enable a Neon implementation of minmax on ARM64 platforms. The
existing code is refactored to support an unrolled version for
sufficiently large inputs.

This is stacked on #5949.

Performance numbers:

Benchmark Speedup MSVC Speedup Clang
bm<uint8_t, Op::Min_val>/8021 26.191 1.073
bm<uint8_t, Op::Min_val>/63 2.833 0.931
bm<uint8_t, Op::Max_val>/8021 24.444 1.04
bm<uint8_t, Op::Max_val>/63 3.209 1.004
bm<uint8_t, Op::Both_val>/8021 66.888 28.622
bm<uint8_t, Op::Both_val>/63 2.219 2.946
bm<uint16_t, Op::Min_val>/8021 13.457 1.023
bm<uint16_t, Op::Min_val>/31 2.25 1.023
bm<uint16_t, Op::Max_val>/8021 13.457 1.023
bm<uint16_t, Op::Max_val>/31 2.588 1.023
bm<uint16_t, Op::Both_val>/8021 38.111 14.178
bm<uint16_t, Op::Both_val>/31 2.324 2.689
bm<uint32_t, Op::Min_val>/8021 23.837 7.554
bm<uint32_t, Op::Min_val>/15 3.896 3
bm<uint32_t, Op::Max_val>/8021 24.318 7.495
bm<uint32_t, Op::Max_val>/15 3.987 2.921
bm<uint32_t, Op::Both_val>/8021 12.5 8.182
bm<uint32_t, Op::Both_val>/15 2.338 2.857
bm<uint64_t, Op::Min_val>/8021 2.01 0.684
bm<uint64_t, Op::Min_val>/7 1.374 1.5
bm<uint64_t, Op::Max_val>/8021 2.115 0.7
bm<uint64_t, Op::Max_val>/7 1.495 1.5
bm<uint64_t, Op::Both_val>/8021 1.029 0.741
bm<uint64_t, Op::Both_val>/7 0.967 1.321
bm<int8_t, Op::Min_val>/8021 26.191 1.073
bm<int8_t, Op::Min_val>/63 3.103 1.111
bm<int8_t, Op::Max_val>/8021 26.111 1.087
bm<int8_t, Op::Max_val>/63 3.286 1.167
bm<int8_t, Op::Both_val>/8021 66.957 28
bm<int8_t, Op::Both_val>/63 1.87 2.574
bm<int16_t, Op::Min_val>/8021 12.866 0.99
bm<int16_t, Op::Min_val>/31 1.952 0.978
bm<int16_t, Op::Max_val>/8021 12.866 1.023
bm<int16_t, Op::Max_val>/31 1.933 0.996
bm<int16_t, Op::Both_val>/8021 28.667 14.178
bm<int16_t, Op::Both_val>/31 2.066 2.42
bm<int32_t, Op::Min_val>/8021 24.405 7.595
bm<int32_t, Op::Min_val>/15 3.983 2.921
bm<int32_t, Op::Max_val>/8021 24.405 7.667
bm<int32_t, Op::Max_val>/15 4.075 2.921
bm<int32_t, Op::Both_val>/8021 14.659 8.182
bm<int32_t, Op::Both_val>/15 2.442 2.921
bm<int64_t, Op::Min_val>/8021 2.16 0.735
bm<int64_t, Op::Min_val>/7 1.6 1.64
bm<int64_t, Op::Max_val>/8021 2.07 0.713
bm<int64_t, Op::Max_val>/7 1.495 1.605
bm<int64_t, Op::Both_val>/8021 1.23 0.761
bm<int64_t, Op::Both_val>/7 0.946 1.35
bm<float, Op::Min_val>/8021 8.928 4.167
bm<float, Op::Min_val>/15 1.971 1.338
bm<float, Op::Max_val>/8021 9.126 4.148
bm<float, Op::Max_val>/15 2.017 1.307
bm<float, Op::Both_val>/8021 4.767 3.855
bm<float, Op::Both_val>/15 0.954 0.933
bm<double, Op::Min_val>/8021 4.563 2.182
bm<double, Op::Min_val>/7 0.854 0.76
bm<double, Op::Max_val>/8021 4.362 2.121
bm<double, Op::Max_val>/7 0.937 0.767
bm<double, Op::Both_val>/8021 2.118 1.964
bm<double, Op::Both_val>/7 0.899 0.47

Add and enable a Neon implementation of minmax on ARM64 platforms. The
existing code is refactored to support an unrolled version for
sufficiently large inputs.
_INLINE_VAR constexpr bool _Is_64bit_int_on_arm64_v = false;
#endif // ^^^ defined(_M_ARM64) ^^^

template <class _Iter, class _Pr, bool _Support64BitIntOnArm = false>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that we expose this quirk less. Instead of adding template parameter, you could do one of this:

  • Just repeat the _Is_min_max_iterators_safe<_Iter> && _Is_predicate_less<_Iter, _Pr> formula in _Is_min_max_value_optimization_safe
  • Extract common base for _Is_min_max_optimization_safe and _Is_min_max_value_optimization_safe

return vreinterpretq_s64_u64(vcgtq_u64(vreinterpretq_u64_s64(_First), vreinterpretq_u64_s64(_Second)));
}

static _Vec_t _Min(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const _Mask .

I observe it is interesting that both ISA have same limitation of having usual vector max for 32-bit elements, but having to blend for 64bit elements.

return _Min(_First, _Second, _Cmp_gt_u(_First, _Second));
}

static _Vec_t _Max(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto const _Mask

_Advance_bytes(_First, sizeof(_Ty));
}

#pragma loop(no_vector)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug fix, or just a performance thing?

In either cases may worth commenting

@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture labels Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Initial Review

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.