-
Notifications
You must be signed in to change notification settings - Fork 24.4k
[CUDA] Change slim-wheel libraries load order #145638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145638
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 60 PendingAs of commit d9f0922 with merge base b808774 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Previously it did not work for cu118, because libnvjitlink does not exist with cu118 and it would trigger exception on cu118 (since we no longer guard cu118 for this) The exception trigger goes to exception phase, skipping the libnvrtc preload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
The whole thing is very brittle, maybe at least each library needs to be in its own |
@nWEIdia can you please provide a bit better explanation on |
To clarify, libnvjitlink does not depend on libnvrtc. |
# If all abovementioned conditions are met, preload nvjitlink and nvrtc | ||
_preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]") | ||
# If all above-mentioned conditions are met, preload nvrtc and nvjitlink | ||
# Please note that order are important for CUDA-11.8 , as nvjitlink does not exist there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvjitlink does not exist there but we do preload, just for the sake of fixing the break.
Fix first, optimize later.
We should restore the logic lost in #145582 where
if version.cuda not in ["12.4", "12.6"]: # type: ignore[name-defined]
return
was removed, exposing libnvjitlink preload in cu118. Which is not that great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's discuss it during the meeting, I remember Andrey had to remove this one as 11.8 failed to load cudnn, but may be his environment were corrupted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We seem to only have nvidia-nvjitlink-cu12 · PyPI not cu11 on PYPI, that is an explanation why libnvjitlink does not exist with cu118
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it does not solve a problem for me at all (but also, it does not manifest during the load)
Please, make sure you test with vanilla ubuntu 24.04 |
You are right: my workarounds must have clouded my experiments, indeed this PR does nothing to fix the issue. Open to fix. |
Oh never mind, " this PR does nothing to fix the issue" I made another mistake. I was doing libjitnvlink first while repeating the experiment. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot merge -f "Lint is green" |
The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
@pytorchbot cherry-pick --onto release/2.6 -c critical |
There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 2a70de7)
Cherry picking #145638The cherry pick PR is at #145662 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated: Details for Dev Infra teamRaised by workflow job |
[CUDA] Change slim-wheel libraries load order (#145638) There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 2a70de7) Co-authored-by: Wei Wang <weiwan@nvidia.com>
There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in pytorch#145614 (comment) Pull Request resolved: pytorch#145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
This is included in 2.6.0, hence removing from 2.6.1 milestone |
There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc
Fixes issues reported in #145614 (comment)
cc @atalman @malfet @ptrblck @eqy @tinglvv