[MPS] Regression from macOS 14.3 to 14.4 in PyTorch 2.2.0/2.2.1

🐛 Describe the bug

I've been using a pytorch model daily through transformers (https://github.com/kha-white/manga-ocr) and MPS. Everything was fine with latest PyTorch and latest Transformers until the Sonoma 14.4 update which made it start crashing on startup (I was running 14.3 before and am on a M1 Max mac, for reference). Since CPU mode worked fine, I also tried older versions of PyTorch and I found anything before 2.2.0 worked fine.
Bisecting it I found this commit is causing it: 056d624 . I tried building latest main with that commit reverted and it's working again.

I apologize but I've not been able to pinpoint this further or to write a standalone example (since I don't really know what I'm doing here).
I've looked at what's happening from within transformers but not sure any of this can be of help:

It seems in https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L318 in input_ids the first member of each tensor array differs with or without that commit:

e.g.
With:
tensor([[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 78],
[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 934],
[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 935],
[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 28]], device='mps:0')

Without:
tensor([[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 78],
[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 934],
[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 935],
[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 21]], device='mps:0')

This difference is ultimately what causes the transformers crash.
Another weird thing I noticed was that in https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L366 , which in this particular case (launching "manga_ocr") evaluates to:
sent_lengths[0] = 26
this assignment only sticks if I retrieve sent_lengths[0] immediately after it's set within the loop (eg "test_var = sent_lengths[0] + 1" or also a print), doing it out of the loop doesn't work either (this fixes the crash in manga_ocr's case, though the first member of "decoded" that's returned in https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L408 is still some incorrect/very high value like in input_ids above).

Versions

Collecting environment information...
PyTorch version: 2.2.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.28.3
Libc version: N/A

Python version: 3.12.2 (main, Feb 6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] (64-bit runtime)
Python platform: macOS-14.4-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.17.1
[pip3] rapidocr-onnxruntime==1.2.3
[pip3] torch==2.2.1
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.17.1
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MPS] Regression from macOS 14.3 to 14.4 in PyTorch 2.2.0/2.2.1 #122016

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

[MPS] Regression from macOS 14.3 to 14.4 in PyTorch 2.2.0/2.2.1 #122016

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions