-
-
Notifications
You must be signed in to change notification settings - Fork 11k
BUG: Address interaction between SME and FPSR #29223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Address interaction between SME and FPSR #29223
Conversation
This is intended to resolve numpy#28687 The root cause is an interaction between Arm Scalable Matrix Extension (SME) and the floating point status register (FPSR). As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the remaining bits are set to 0". This means that floating point status flags are all raised when SME is used, regardless of values or operations performed. These are manifesting now because Apple Silicon M4 supports SME and macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK. However, SME / FPSR behavior is not specific to Apple Silicon M4 and will occur on non-Apple chips using SME as well. Changes add compile and runtime checks to determine whether BLAS / LAPACK might use SME (macOS / Accelerate only at the moment). If so, special handling of floating-point error (FPE) is added, which includes: - clearing FPE after some BLAS calls - short-circuiting FPE read after some BLAS calls All tests pass Performance is similar Another approach would have been to wrap all BLAS / LAPACK calls with save / restore FPE. However, it added a lot of overhead for the inner loops that utilize BLAS / LAPACK. Some benchmarks were 8x slower.
31s to failure! Is that a record to beat or avoid!? Taking a look! |
Address the linker & linter failures
The PyPy failure can be ignored. |
I am curious about the reasoning involved in having SME behave like that. I assume there was a good reason ... |
I wonder why we do not see that on other places we use SME. Are we not testing SME or do Highway/intrinsics take care of this for us? |
I don't think we've enabled SME anywhere else yet. OpenBLAS doesn't enable it by default, and Highway only supports up to SVE currently. |
Thanks @Developer-Ecosystem-Engineering . |
if(major >= 15 && minor >= 4){ | ||
ret = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(major >= 15 && minor >= 4){ | |
ret = true; | |
} | |
if(major > 15 || (major == 15 && minor >= 4)){ | |
ret = true; | |
} |
As written, this comparison will return false for e.g. 26.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this in the backport.
For next time, also use hard line breaks in the commit messages and don't depend on word wrap. The editor should take care of that. |
* BUG: Address interaction between SME and FPSR This is intended to resolve numpy#28687 The root cause is an interaction between Arm Scalable Matrix Extension (SME) and the floating point status register (FPSR). As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the remaining bits are set to 0". This means that floating point status flags are all raised when SME is used, regardless of values or operations performed. These are manifesting now because Apple Silicon M4 supports SME and macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK. However, SME / FPSR behavior is not specific to Apple Silicon M4 and will occur on non-Apple chips using SME as well. Changes add compile and runtime checks to determine whether BLAS / LAPACK might use SME (macOS / Accelerate only at the moment). If so, special handling of floating-point error (FPE) is added, which includes: - clearing FPE after some BLAS calls - short-circuiting FPE read after some BLAS calls All tests pass Performance is similar Another approach would have been to wrap all BLAS / LAPACK calls with save / restore FPE. However, it added a lot of overhead for the inner loops that utilize BLAS / LAPACK. Some benchmarks were 8x slower. * add blas_supports_fpe and ifdef check Address the linker & linter failures
BUG: Address interaction between SME and FPSR (#29223)
This is intended to resolve #28687
The root cause is an interaction between Arm Scalable Matrix Extension (SME) and the floating point status register (FPSR).
As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the remaining bits are set to 0". This means that floating point status flags are all raised when SME is used, regardless of values or operations performed.
These are manifesting now because Apple Silicon M4 supports SME and macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK. However, SME / FPSR behavior is not specific to Apple Silicon M4 and will occur on non-Apple chips using SME as well.
Changes add compile and runtime checks to determine whether BLAS / LAPACK might use SME (macOS / Accelerate only at the moment). If so, special handling of floating-point error (FPE) is added, which includes:
Clearing FPE after some BLAS calls
Short-circuiting FPE read after some BLAS calls
All tests pass
Performance is similar
Another approach would have been to wrap all BLAS / LAPACK calls with save / restore FPE. However, it added notable overhead for the inner loops that utilize BLAS / LAPACK. Some benchmarks were 8x slower with that approach.