Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[CUDA] Change slim-wheel libraries load order #145638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

nWEIdia
Copy link
Collaborator

@nWEIdia nWEIdia commented Jan 24, 2025

There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in #145614 (comment)

cc @atalman @malfet @ptrblck @eqy @tinglvv

Copy link

pytorch-bot bot commented Jan 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145638

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 60 Pending

As of commit d9f0922 with merge base b808774 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jan 24, 2025

Previously it did not work for cu118, because libnvjitlink does not exist with cu118 and it would trigger exception on cu118 (since we no longer guard cu118 for this) The exception trigger goes to exception phase, skipping the libnvrtc preload.
Now we are switching order, ensuring libnvrtc finishes preload first, then exception for libnvjitlink is a Do not care error.

@nWEIdia nWEIdia requested review from atalman and malfet January 24, 2025 18:53
@nWEIdia nWEIdia added the release notes: build release notes category label Jan 24, 2025
Copy link
Contributor

@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kit1980
Copy link
Contributor

kit1980 commented Jan 24, 2025

The whole thing is very brittle, maybe at least each library needs to be in its own try block, and maybe add some logging?
Not now for the release, later.

@malfet
Copy link
Contributor

malfet commented Jan 24, 2025

@nWEIdia can you please provide a bit better explanation on On cu118, libnvrtc wants to be preloaded first? (Actually let me just edit it myself)

@malfet malfet changed the title [CI][CD][CUDA][11.8] On cu118, libnvrtc wants to be preloaded first [CUDA] Change slim-wheel libraries load order Jan 24, 2025
torch/__init__.py Outdated Show resolved Hide resolved
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jan 24, 2025

To clarify, libnvjitlink does not depend on libnvrtc.
it just does not exist on cu118, so loading it first will skip loading the one (libnvrtc) after it.

# If all abovementioned conditions are met, preload nvjitlink and nvrtc
_preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
# If all above-mentioned conditions are met, preload nvrtc and nvjitlink
# Please note that order are important for CUDA-11.8 , as nvjitlink does not exist there
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvjitlink does not exist there but we do preload, just for the sake of fixing the break.
Fix first, optimize later.
We should restore the logic lost in #145582 where

if version.cuda not in ["12.4", "12.6"]: # type: ignore[name-defined]
return

was removed, exposing libnvjitlink preload in cu118. Which is not that great.

Copy link
Contributor

@malfet malfet Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's discuss it during the meeting, I remember Andrey had to remove this one as 11.8 failed to load cudnn, but may be his environment were corrupted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to only have nvidia-nvjitlink-cu12 · PyPI not cu11 on PYPI, that is an explanation why libnvjitlink does not exist with cu118

Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it does not solve a problem for me at all (but also, it does not manifest during the load)

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jan 24, 2025

Please, make sure you test with vanilla ubuntu 24.04

@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jan 24, 2025

You are right: my workarounds must have clouded my experiments, indeed this PR does nothing to fix the issue. Open to fix.

@nWEIdia nWEIdia closed this Jan 24, 2025
@malfet malfet reopened this Jan 24, 2025
@nWEIdia
Copy link
Collaborator Author

nWEIdia commented Jan 24, 2025

Oh never mind, " this PR does nothing to fix the issue" I made another mistake. I was doing libjitnvlink first while repeating the experiment.
I am confident my PR DOES fix the issue. (At one point I thought it did not fix the issue too)

@malfet
Copy link
Contributor

malfet commented Jan 24, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 24, 2025
@malfet malfet added this to the 2.6.1 milestone Jan 24, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 24, 2025 21:25 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 24, 2025 21:25 Inactive
@pytorch-bot pytorch-bot bot temporarily deployed to upload-benchmark-results January 24, 2025 21:25 Inactive
@malfet
Copy link
Contributor

malfet commented Jan 24, 2025

@pytorchbot merge -f "Lint is green"

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kit1980
Copy link
Contributor

kit1980 commented Jan 24, 2025

@pytorchbot cherry-pick --onto release/2.6 -c critical

@pytorch pytorch deleted a comment from pytorch-bot bot Jan 24, 2025
pytorchbot pushed a commit that referenced this pull request Jan 24, 2025
There is no libnvjitlink in  CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in #145614 (comment)

Pull Request resolved: #145638
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 2a70de7)
@pytorchbot
Copy link
Collaborator

Cherry picking #145638

The cherry pick PR is at #145662 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

Details for Dev Infra team Raised by workflow job

malfet pushed a commit that referenced this pull request Jan 24, 2025
[CUDA] Change slim-wheel libraries load order (#145638)

There is no libnvjitlink in  CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in #145614 (comment)

Pull Request resolved: #145638
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
(cherry picked from commit 2a70de7)

Co-authored-by: Wei Wang <weiwan@nvidia.com>
nWEIdia added a commit to nWEIdia/pytorch that referenced this pull request Jan 27, 2025
There is no libnvjitlink in  CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in pytorch#145614 (comment)

Pull Request resolved: pytorch#145638
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
@atalman atalman removed this from the 2.6.1 milestone Feb 4, 2025
@atalman
Copy link
Contributor

atalman commented Feb 4, 2025

This is included in 2.6.0, hence removing from 2.6.1 milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: build release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.