Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BUG: Address interaction between SME and FPSR #29223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Developer-Ecosystem-Engineering
Copy link
Contributor

This is intended to resolve #28687

The root cause is an interaction between Arm Scalable Matrix Extension (SME) and the floating point status register (FPSR).

As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the remaining bits are set to 0". This means that floating point status flags are all raised when SME is used, regardless of values or operations performed.

These are manifesting now because Apple Silicon M4 supports SME and macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK. However, SME / FPSR behavior is not specific to Apple Silicon M4 and will occur on non-Apple chips using SME as well.

Changes add compile and runtime checks to determine whether BLAS / LAPACK might use SME (macOS / Accelerate only at the moment). If so, special handling of floating-point error (FPE) is added, which includes:

  • Clearing FPE after some BLAS calls

  • Short-circuiting FPE read after some BLAS calls

  • All tests pass

  • Performance is similar

Another approach would have been to wrap all BLAS / LAPACK calls with save / restore FPE. However, it added notable overhead for the inner loops that utilize BLAS / LAPACK. Some benchmarks were 8x slower with that approach.

This is intended to resolve numpy#28687

The root cause is an interaction between Arm Scalable Matrix Extension (SME) and the floating point status register (FPSR).

 As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the remaining bits are set to 0".  This means that floating point status flags are all raised when SME is used, regardless of values or operations performed.

These are manifesting now because Apple Silicon M4 supports SME and macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK.  However, SME / FPSR behavior is not specific to Apple Silicon M4 and will occur on non-Apple chips using SME as well.

Changes add compile and runtime checks to determine whether BLAS / LAPACK might use SME (macOS / Accelerate only at the moment).  If so, special handling of floating-point error (FPE) is added, which includes:
- clearing FPE after some BLAS calls
- short-circuiting FPE read after some BLAS calls

All tests pass
Performance is similar

Another approach would have been to wrap all BLAS / LAPACK calls with save / restore FPE.  However, it added a lot of overhead for the inner loops that utilize BLAS / LAPACK.  Some benchmarks were 8x slower.
@Developer-Ecosystem-Engineering
Copy link
Contributor Author

Developer-Ecosystem-Engineering commented Jun 17, 2025

31s to failure! Is that a record to beat or avoid!? Taking a look!

Address the linker & linter failures
@charris
Copy link
Member

charris commented Jun 18, 2025

The PyPy failure can be ignored.

@charris charris added 09 - Backport-Candidate PRs tagged should be backported labels Jun 18, 2025
@charris
Copy link
Member

charris commented Jun 18, 2025

I am curious about the reasoning involved in having SME behave like that. I assume there was a good reason ...

@mattip
Copy link
Member

mattip commented Jun 18, 2025

This means that floating point status flags are all raised when SME is used, regardless of values or operations performed

I wonder why we do not see that on other places we use SME. Are we not testing SME or do Highway/intrinsics take care of this for us?

@seberg seberg added this to the 2.3.1 release milestone Jun 18, 2025
@Mousius
Copy link
Member

Mousius commented Jun 18, 2025

This means that floating point status flags are all raised when SME is used, regardless of values or operations performed

I wonder why we do not see that on other places we use SME. Are we not testing SME or do Highway/intrinsics take care of this for us?

I don't think we've enabled SME anywhere else yet. OpenBLAS doesn't enable it by default, and Highway only supports up to SVE currently.

@charris charris merged commit f15a116 into numpy:main Jun 18, 2025
73 of 74 checks passed
@charris
Copy link
Member

charris commented Jun 18, 2025

Thanks @Developer-Ecosystem-Engineering .

Comment on lines +46 to +48
if(major >= 15 && minor >= 4){
ret = true;
}
Copy link

@BertalanD BertalanD Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(major >= 15 && minor >= 4){
ret = true;
}
if(major > 15 || (major == 15 && minor >= 4)){
ret = true;
}

As written, this comparison will return false for e.g. 26.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in the backport.

@charris
Copy link
Member

charris commented Jun 19, 2025

For next time, also use hard line breaks in the commit messages and don't depend on word wrap. The editor should take care of that.

charris pushed a commit to charris/numpy that referenced this pull request Jun 19, 2025
* BUG: Address interaction between SME and FPSR

This is intended to resolve numpy#28687

The root cause is an interaction between Arm Scalable Matrix Extension
(SME) and the floating point status register (FPSR).

 As noted in Arm docs for FPSR, "On entry to or exit from Streaming SVE
 mode, FPSR.{IOC, DZC, OFC, UFC, IXC, IDC, QC} are set to 1 and the
 remaining bits are set to 0".  This means that floating point status
 flags are all raised when SME is used, regardless of values or
 operations performed.

These are manifesting now because Apple Silicon M4 supports SME and
macOS 15.4 enables SME codepaths for Accelerate BLAS / LAPACK.  However,
SME / FPSR behavior is not specific to Apple Silicon M4 and will occur
on non-Apple chips using SME as well.

Changes add compile and runtime checks to determine whether BLAS /
LAPACK might use SME (macOS / Accelerate only at the moment).  If so,
special handling of floating-point error (FPE) is added, which includes:
- clearing FPE after some BLAS calls
- short-circuiting FPE read after some BLAS calls

All tests pass
Performance is similar

Another approach would have been to wrap all BLAS / LAPACK calls with
save / restore FPE.  However, it added a lot of overhead for the inner
loops that utilize BLAS / LAPACK.  Some benchmarks were 8x slower.

* add blas_supports_fpe and ifdef check

Address the linker & linter failures
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jun 19, 2025
charris added a commit that referenced this pull request Jun 19, 2025
BUG: Address interaction between SME and FPSR (#29223)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: wrong errors e.g. "divide by zero encountered in matmul" on MacOS M4
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.