Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[MPS] Regression from macOS 14.3 to 14.4 in PyTorch 2.2.0/2.2.1 #122016

Copy link
Copy link
Closed
@AuroraWright

Description

@AuroraWright
Issue body actions

🐛 Describe the bug

I've been using a pytorch model daily through transformers (https://github.com/kha-white/manga-ocr) and MPS. Everything was fine with latest PyTorch and latest Transformers until the Sonoma 14.4 update which made it start crashing on startup (I was running 14.3 before and am on a M1 Max mac, for reference). Since CPU mode worked fine, I also tried older versions of PyTorch and I found anything before 2.2.0 worked fine.
Bisecting it I found this commit is causing it: 056d624 . I tried building latest main with that commit reverted and it's working again.

I apologize but I've not been able to pinpoint this further or to write a standalone example (since I don't really know what I'm doing here).
I've looked at what's happening from within transformers but not sure any of this can be of help:

It seems in https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L318 in input_ids the first member of each tensor array differs with or without that commit:

e.g.
With:
tensor([[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 78],
[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 934],
[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 935],
[3463342888, 2, 2312, 4080, 5063, 4885,
3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014,
5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 28]], device='mps:0')

Without:
tensor([[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 78],
[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 934],
[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 935],
[ 2, 2, 2312, 4080, 5063, 4885, 3600, 5764, 5554, 4798, 1074, 3386,
2865, 2166, 1524, 1858, 2872, 4014, 5250, 1225, 2119, 1104, 1682, 5730,
1482, 5578, 21]], device='mps:0')

This difference is ultimately what causes the transformers crash.
Another weird thing I noticed was that in https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L366 , which in this particular case (launching "manga_ocr") evaluates to:
sent_lengths[0] = 26
this assignment only sticks if I retrieve sent_lengths[0] immediately after it's set within the loop (eg "test_var = sent_lengths[0] + 1" or also a print), doing it out of the loop doesn't work either (this fixes the crash in manga_ocr's case, though the first member of "decoded" that's returned in https://github.com/huggingface/transformers/blob/main/src/transformers/generation/beam_search.py#L408 is still some incorrect/very high value like in input_ids above).

Versions

Collecting environment information...
PyTorch version: 2.2.1
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.4 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.28.3
Libc version: N/A

Python version: 3.12.2 (main, Feb 6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] (64-bit runtime)
Python platform: macOS-14.4-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Max

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] onnxruntime==1.17.1
[pip3] rapidocr-onnxruntime==1.2.3
[pip3] torch==2.2.1
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.17.1
[conda] Could not collect

cc @ezyang @gchanan @zou3519 @kadeng @kulinseth @albanD @malfet @DenisVieriu97 @razarmehr

Metadata

Metadata

Assignees

Labels

high prioritymodule: correctness (silent)issue that returns an incorrect result silentlyissue that returns an incorrect result silentlymodule: mpsRelated to Apple Metal Performance Shaders frameworkRelated to Apple Metal Performance Shaders frameworkmodule: regressionIt used to work, and now it doesn'tIt used to work, and now it doesn'ttriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.