-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add Neon implementation of minmax #5963
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implement the namespace _Sorting algorithms using Neon, and enable _VECTORIZED_MINMAX on ARM64 targets.
Add and enable a Neon implementation of minmax on ARM64 platforms. The existing code is refactored to support an unrolled version for sufficiently large inputs.
3119a3a to
c92dcef
Compare
| _INLINE_VAR constexpr bool _Is_64bit_int_on_arm64_v = false; | ||
| #endif // ^^^ defined(_M_ARM64) ^^^ | ||
|
|
||
| template <class _Iter, class _Pr, bool _Support64BitIntOnArm = false> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer that we expose this quirk less. Instead of adding template parameter, you could do one of this:
- Just repeat the
_Is_min_max_iterators_safe<_Iter> && _Is_predicate_less<_Iter, _Pr>formula in_Is_min_max_value_optimization_safe - Extract common base for
_Is_min_max_optimization_safeand_Is_min_max_value_optimization_safe
| return vreinterpretq_s64_u64(vcgtq_u64(vreinterpretq_u64_s64(_First), vreinterpretq_u64_s64(_Second))); | ||
| } | ||
|
|
||
| static _Vec_t _Min(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const _Mask .
I observe it is interesting that both ISA have same limitation of having usual vector max for 32-bit elements, but having to blend for 64bit elements.
| return _Min(_First, _Second, _Cmp_gt_u(_First, _Second)); | ||
| } | ||
|
|
||
| static _Vec_t _Max(const _Vec_t _First, const _Vec_t _Second, _Vec_t _Mask) noexcept { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto const _Mask
| _Advance_bytes(_First, sizeof(_Ty)); | ||
| } | ||
|
|
||
| #pragma loop(no_vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a bug fix, or just a performance thing?
In either cases may worth commenting
This is stacked on #5949.
Performance numbers: