[CUDA] Change slim-wheel libraries load order #145638

nWEIdia · Jan 24, 2025

There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc

Fixes issues reported in #145614 (comment)

cc @atalman @malfet @ptrblck @eqy @tinglvv

pytorch-bot · Jan 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145638

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 60 Pending

As of commit d9f0922 with merge base b808774 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nWEIdia · Jan 24, 2025

Previously it did not work for cu118, because libnvjitlink does not exist with cu118 and it would trigger exception on cu118 (since we no longer guard cu118 for this) The exception trigger goes to exception phase, skipping the libnvrtc preload.
Now we are switching order, ensuring libnvrtc finishes preload first, then exception for libnvjitlink is a Do not care error.

atalman

lgtm

kit1980 · Jan 24, 2025

The whole thing is very brittle, maybe at least each library needs to be in its own try block, and maybe add some logging?
Not now for the release, later.

malfet · Jan 24, 2025

@nWEIdia can you please provide a bit better explanation on On cu118, libnvrtc wants to be preloaded first? (Actually let me just edit it myself)

torch/__init__.py

nWEIdia · Jan 24, 2025

To clarify, libnvjitlink does not depend on libnvrtc.
it just does not exist on cu118, so loading it first will skip loading the one (libnvrtc) after it.

nWEIdia · Jan 24, 2025

torch/__init__.py

-            # If all abovementioned conditions are met, preload nvjitlink and nvrtc
-            _preload_cuda_deps("nvjitlink", "libnvJitLink.so.*[0-9]")
+            # If all above-mentioned conditions are met, preload nvrtc and nvjitlink
+            # Please note that order are important for CUDA-11.8 , as nvjitlink does not exist there


nvjitlink does not exist there but we do preload, just for the sake of fixing the break.
Fix first, optimize later.
We should restore the logic lost in #145582 where

if version.cuda not in ["12.4", "12.6"]: # type: ignore[name-defined]
return

was removed, exposing libnvjitlink preload in cu118. Which is not that great.

Ok, let's discuss it during the meeting, I remember Andrey had to remove this one as 11.8 failed to load cudnn, but may be his environment were corrupted

We seem to only have nvidia-nvjitlink-cu12 · PyPI not cu11 on PYPI, that is an explanation why libnvjitlink does not exist with cu118

malfet

Actually, it does not solve a problem for me at all (but also, it does not manifest during the load)

nWEIdia · Jan 24, 2025

Please, make sure you test with vanilla ubuntu 24.04

nWEIdia · Jan 24, 2025

You are right: my workarounds must have clouded my experiments, indeed this PR does nothing to fix the issue. Open to fix.

nWEIdia · Jan 24, 2025

Oh never mind, " this PR does nothing to fix the issue" I made another mistake. I was doing libjitnvlink first while repeating the experiment.
I am confident my PR DOES fix the issue. (At one point I thought it did not fix the issue too)

malfet · Jan 24, 2025

@pytorchbot merge

pytorchmergebot · Jan 24, 2025

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

malfet · Jan 24, 2025

@pytorchbot merge -f "Lint is green"

pytorchmergebot · Jan 24, 2025

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

pytorchmergebot · Jan 24, 2025

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

kit1980 · Jan 24, 2025

@pytorchbot cherry-pick --onto release/2.6 -c critical

There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 2a70de7)

pytorchbot · Jan 24, 2025

Cherry picking #145638

The cherry pick PR is at #145662 and it is recommended to link a critical cherry pick PR with an issue. The following tracker issues are updated:

[v.2.6.0] Release Tracker #142814 (comment)

Details for Dev Infra team

Raised by workflow job

[CUDA] Change slim-wheel libraries load order (#145638) There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in #145614 (comment) Pull Request resolved: #145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> (cherry picked from commit 2a70de7) Co-authored-by: Wei Wang <weiwan@nvidia.com>

There is no libnvjitlink in CUDA-11.x , so attempts to load it first will abort the execution and prevent the script from preloading nvrtc Fixes issues reported in pytorch#145614 (comment) Pull Request resolved: pytorch#145638 Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>

atalman · Feb 4, 2025

This is included in 2.6.0, hence removing from 2.6.1 milestone

On cu118, libnvrtc wants to be preloaded first

d95e59a

pytorchbot added the open source label Jan 24, 2025

nWEIdia requested review from atalman and malfet January 24, 2025 18:53

nWEIdia added the release notes: build release notes category label Jan 24, 2025

atalman approved these changes Jan 24, 2025

View reviewed changes

kit1980 approved these changes Jan 24, 2025

View reviewed changes

malfet changed the title ~~[CI][CD][CUDA][11.8] On cu118, libnvrtc wants to be preloaded first~~ [CUDA] Change slim-wheel libraries load order Jan 24, 2025

malfet approved these changes Jan 24, 2025

View reviewed changes

malfet reviewed Jan 24, 2025

View reviewed changes

torch/__init__.py Outdated Show resolved Hide resolved

Update torch/__init__.py

73020f1

Fixing misleading comments

d9f0922

nWEIdia commented Jan 24, 2025

View reviewed changes

malfet requested changes Jan 24, 2025

View reviewed changes

nWEIdia closed this Jan 24, 2025

malfet reopened this Jan 24, 2025

malfet approved these changes Jan 24, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 24, 2025

malfet added this to the 2.6.1 milestone Jan 24, 2025

pytorchmergebot added the merging label Jan 24, 2025

pytorch-bot bot temporarily deployed to upload-benchmark-results January 24, 2025 21:25 Inactive

pytorchmergebot added the Merged label Jan 24, 2025

pytorchmergebot closed this in 2a70de7 Jan 24, 2025

pytorchmergebot removed the merging label Jan 24, 2025

pytorch deleted a comment from pytorch-bot bot Jan 24, 2025

pytorchbot mentioned this pull request Jan 24, 2025

[v.2.6.0] Release Tracker #142814

Closed

atalman removed this from the 2.6.1 milestone Feb 4, 2025

Search code, repositories, users, issues, pull requests...

[CUDA] Change slim-wheel libraries load order #145638

[CUDA] Change slim-wheel libraries load order #145638

Uh oh!

Conversation

nWEIdia commented Jan 24, 2025 • edited by malfet Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145638

⏳ No Failures, 60 Pending

Uh oh!

nWEIdia commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

kit1980 commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Jan 24, 2025

Uh oh!

Uh oh!

nWEIdia commented Jan 24, 2025

Uh oh!

nWEIdia Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

malfet Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nWEIdia Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

nWEIdia commented Jan 24, 2025

Uh oh!

nWEIdia commented Jan 24, 2025

Uh oh!

nWEIdia commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Jan 24, 2025

Uh oh!

pytorchmergebot commented Jan 24, 2025

Merge started

Uh oh!

malfet commented Jan 24, 2025

Uh oh!

pytorchmergebot commented Jan 24, 2025

Uh oh!

pytorchmergebot commented Jan 24, 2025

Merge started

Uh oh!

kit1980 commented Jan 24, 2025

Uh oh!

pytorchbot commented Jan 24, 2025

Cherry picking #145638

Uh oh!

atalman commented Feb 4, 2025

Uh oh!

Uh oh!

nWEIdia commented Jan 24, 2025 •

edited by malfet

Loading

pytorch-bot bot commented Jan 24, 2025 •

edited

Loading

nWEIdia commented Jan 24, 2025 •

edited

Loading

kit1980 commented Jan 24, 2025 •

edited

Loading

malfet Jan 24, 2025 •

edited

Loading

nWEIdia commented Jan 24, 2025 •

edited

Loading