Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Comments

Close side panel

chore!: Update PyTorch to 2.5#2865

Merged
mergify[bot] merged 1 commit intoinstructlab:maininstructlab/instructlab:mainfrom
fabiendupont:update-pytorch-2.5fabiendupont/instructlab:update-pytorch-2.5Copy head branch name to clipboard
Jan 8, 2025
Merged

chore!: Update PyTorch to 2.5#2865
mergify[bot] merged 1 commit intoinstructlab:maininstructlab/instructlab:mainfrom
fabiendupont:update-pytorch-2.5fabiendupont/instructlab:update-pytorch-2.5Copy head branch name to clipboard

Conversation

@fabiendupont
Copy link
Contributor

This change increases the upper version of PyTorch to allow version 2.5. For the AMD variant, it also switches to ROCm 6.2, which is required for PyTorch 2.5.

Resolves #2864

@mergify mergify bot added documentation Improvements or additions to documentation dependencies Relates to dependencies ci-failure PR has at least one CI failure labels Jan 7, 2025
@mergify mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 7, 2025
@mergify
Copy link
Contributor

mergify bot commented Jan 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @fabiendupont please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Jan 7, 2025
@mergify mergify bot removed needs-rebase This Pull Request needs to be rebased ci-failure PR has at least one CI failure labels Jan 7, 2025
@cdoern cdoern requested a review from nathan-weinberg January 7, 2025 16:08
@nathan-weinberg
Copy link
Member

@fabiendupont have you done any kind of testing around this?

@fabiendupont
Copy link
Contributor Author

@nathan-weinberg, I have used our downstream build pipeline to test instructlab v0.22.1 with PyTorch 2.5.1 manually. I was able to run chat, serve, data generate and train steps without any issue. That's why I went ahead to propose this MR, as it seems pretty safe. Do you have any specific concern?

@nathan-weinberg
Copy link
Member

@fabiendupont if you look at the existing issues (I linked them in the additional one you opened, I think that may be a dup), the reason we've been holding off is we want to make sure the Training library (@JamesKunstle) is functional with this version before we made the bump here

@prarit curious around your thoughts on this as well

@fabiendupont
Copy link
Contributor Author

@nathan-weinberg, the training library doesn't seem to cap the PyTorch version, so I would have expected it to have already been tested with PyTorch 2.5.1 during a previous PR in instructlab/training.

@JamesKunstle, is there somewhere where the PyTorch version is controlled besides requirements file?

@JamesKunstle
Copy link
Contributor

@fabiendupont No it isn't capped elsewhere but we don't have an independent test quite yet to run through everything w/ a higher torch version- that'll be ready in a bit. The 'instructlab/instructlab' tests should be good enough to roughly confirm it though.

@nathan-weinberg
Copy link
Member

I'm going to trigger a couple E2E jobs on this (Large and XLarge) just as a sanity check

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

E2E (NVIDIA L40S x8) workflow launched on this PR: View run

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@prarit
Copy link

prarit commented Jan 7, 2025

@nathan-weinberg LGTM

@nathan-weinberg
Copy link
Member

Okay, if the CI jobs pass I'm fine to approve this - @JamesKunstle can you approve as well if you are signing off, which it sounds like you are?

Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all our CI succeeded w/ torch<2.6 I think we're okay to bump it.

@mergify mergify bot added the one-approval PR has one approval from a maintainer label Jan 7, 2025
@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

@JamesKunstle
Copy link
Contributor

@nathan-weinberg I don't think that failure is real, I looked at the logs and I think it's just a regex mismatch or something

@nathan-weinberg
Copy link
Member

Yeah let me rerun it here

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

@reidliu41
Copy link
Contributor

Seems still not update? The grep still old.

ᕦ(òᴗóˇ)ᕤ Accelerated model training completed successfully! ᕦ(òᴗóˇ)ᕤ
Best final checkpoint: /tmp/tmp.uAmOJhmRbo/.local/share/instructlab/skills-only/phase2/checkpoints/hf_format/samples_662 with score: 7.894736842105263
Journal: /tmp/tmp.uAmOJhmRbo/.local/share/instructlab/skills-only/journalfile.yaml
+ grep -o '/[^ ]*'
+ grep 'Training finished! Best final checkpoint: ' <<<<<<<<<<<<-------- /tmp/tmp.uAmOJhmRbo/skills_only_training.log
+ rm -rf /tmp/tmp.uAmOJhmRbo

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

@github-actions
Copy link

github-actions bot commented Jan 7, 2025

e2e workflow succeeded on this PR: View run, congrats!

@github-actions
Copy link

github-actions bot commented Jan 8, 2025

e2e workflow failed on this PR: View run, please investigate.

@mergify
Copy link
Contributor

mergify bot commented Jan 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @fabiendupont please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase This Pull Request needs to be rebased label Jan 8, 2025
This change increases the upper version of PyTorch to allow version 2.5.
For the AMD variant, it also switches to ROCm 6.2, which is required for
PyTorch 2.5.

Resolves instructlab#2864

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
@mergify mergify bot removed the needs-rebase This Pull Request needs to be rebased label Jan 8, 2025
@bbrowning
Copy link
Contributor

bbrowning commented Jan 8, 2025

It seems we ran out of disk space on the e2e-xlarge test kicked off? From its logs at https://github.com/instructlab/instructlab/actions/runs/12659779418/job/35279654906 -

2025-01-08T00:55:49.8129077Z ##[warning]You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 9 MB

However, we did get past the final checkpoint grep issue. So, the question is did PyTorch 2.5 increase our disk space usage? Or is this just a flake of that e2e test setup?

@bbrowning
Copy link
Contributor

Looking at the e2e-xlarge-test history, I see it has never passed on main so the fact that it failed may not be surprising. It did get past the point of training a model at least, so that gives some indication that PyTorch 2.5 is working properly on that setup.

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the large job passed with no issue and the various other approvals/signoffs, I am going to go ahead and approve this.

@nathan-weinberg nathan-weinberg removed the hold In-progress PR. Tag should be removed before merge. label Jan 8, 2025
@mergify mergify bot merged commit 61c877b into instructlab:main Jan 8, 2025
30 checks passed
@mergify mergify bot removed the one-approval PR has one approval from a maintainer label Jan 8, 2025
@bbrowning
Copy link
Contributor

So, looking at the logs, I'm not sure this actually ran any CI run with Torch 2.5.x. CI was green with this change, but I'm still seeing multiple references to torch 2.4.x in the CI logs when installing ilab. Are we sure we have confidence torch 2.5.x works properly if CI isn't actually using torch 2.5.x in our tests?

fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 9, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 9, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 9, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 10, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

It also bumps the version of SDG to 0.6.3, which relaxes PyTorch
dependency to allow 2.5.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 10, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

It also bumps the version of SDG to 0.6.3, which relaxes PyTorch
dependency to allow 2.5.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 28, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

It also bumps the version of SDG to 0.6.3, which relaxes PyTorch
dependency to allow 2.5.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 28, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
fabiendupont added a commit to fabiendupont/instructlab that referenced this pull request Jan 28, 2025
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range.
Even with that range extension, we realized that PyTorch 2.4 is still
used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a
requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the
latest available in the Open Data Hub fork of vLLM. The vLLM changelog
doesn't highlight much risk in this version bump.

Resolves instructlab#2702

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
mergify bot added a commit that referenced this pull request Jan 28, 2025
This PR is a follow-up to #2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4.

This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump.

Resolves #2702


Approved-by: nathan-weinberg

Approved-by: alinaryan
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Relates to dependencies documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow PyTorch 2.5

7 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.