chore!: Update PyTorch to 2.5 by fabiendupont · Pull Request #2865 · instructlab/instructlab

fabiendupont · Jan 7, 2025

This change increases the upper version of PyTorch to allow version 2.5. For the AMD variant, it also switches to ROCm 6.2, which is required for PyTorch 2.5.

Resolves #2864

mergify · Jan 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @fabiendupont please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

nathan-weinberg · Jan 7, 2025

@fabiendupont have you done any kind of testing around this?

fabiendupont · Jan 7, 2025

@nathan-weinberg, I have used our downstream build pipeline to test instructlab v0.22.1 with PyTorch 2.5.1 manually. I was able to run chat, serve, data generate and train steps without any issue. That's why I went ahead to propose this MR, as it seems pretty safe. Do you have any specific concern?

nathan-weinberg · Jan 7, 2025

@fabiendupont if you look at the existing issues (I linked them in the additional one you opened, I think that may be a dup), the reason we've been holding off is we want to make sure the Training library (@JamesKunstle) is functional with this version before we made the bump here

@prarit curious around your thoughts on this as well

fabiendupont · Jan 7, 2025

@nathan-weinberg, the training library doesn't seem to cap the PyTorch version, so I would have expected it to have already been tested with PyTorch 2.5.1 during a previous PR in instructlab/training.

@JamesKunstle, is there somewhere where the PyTorch version is controlled besides requirements file?

JamesKunstle · Jan 7, 2025

@fabiendupont No it isn't capped elsewhere but we don't have an independent test quite yet to run through everything w/ a higher torch version- that'll be ready in a bit. The 'instructlab/instructlab' tests should be good enough to roughly confirm it though.

nathan-weinberg · Jan 7, 2025

I'm going to trigger a couple E2E jobs on this (Large and XLarge) just as a sanity check

github-actions · Jan 7, 2025

E2E (NVIDIA L40S x8) workflow launched on this PR: View run

github-actions · Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

prarit · Jan 7, 2025

@nathan-weinberg LGTM

nathan-weinberg · Jan 7, 2025

Okay, if the CI jobs pass I'm fine to approve this - @JamesKunstle can you approve as well if you are signing off, which it sounds like you are?

JamesKunstle

Since all our CI succeeded w/ torch<2.6 I think we're okay to bump it.

github-actions · Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

JamesKunstle · Jan 7, 2025

@nathan-weinberg I don't think that failure is real, I looked at the logs and I think it's just a regex mismatch or something

nathan-weinberg · Jan 7, 2025

Yeah let me rerun it here

github-actions · Jan 7, 2025

E2E (NVIDIA L40S x4) workflow launched on this PR: View run

github-actions · Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

github-actions · Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

reidliu41 · Jan 7, 2025

Seems still not update? The grep still old.

ᕦ(òᴗóˇ)ᕤ Accelerated model training completed successfully! ᕦ(òᴗóˇ)ᕤ
Best final checkpoint: /tmp/tmp.uAmOJhmRbo/.local/share/instructlab/skills-only/phase2/checkpoints/hf_format/samples_662 with score: 7.894736842105263
Journal: /tmp/tmp.uAmOJhmRbo/.local/share/instructlab/skills-only/journalfile.yaml
+ grep -o '/[^ ]*'
+ grep 'Training finished! Best final checkpoint: ' <<<<<<<<<<<<-------- /tmp/tmp.uAmOJhmRbo/skills_only_training.log
+ rm -rf /tmp/tmp.uAmOJhmRbo

github-actions · Jan 7, 2025

e2e workflow failed on this PR: View run, please investigate.

github-actions · Jan 7, 2025

e2e workflow succeeded on this PR: View run, congrats!

github-actions · Jan 8, 2025

e2e workflow failed on this PR: View run, please investigate.

mergify · Jan 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. @fabiendupont please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

This change increases the upper version of PyTorch to allow version 2.5. For the AMD variant, it also switches to ROCm 6.2, which is required for PyTorch 2.5. Resolves instructlab#2864 Signed-off-by: Fabien Dupont <fdupont@redhat.com>

bbrowning · Jan 8, 2025

It seems we ran out of disk space on the e2e-xlarge test kicked off? From its logs at https://github.com/instructlab/instructlab/actions/runs/12659779418/job/35279654906 -

2025-01-08T00:55:49.8129077Z ##[warning]You are running out of disk space. The runner will stop working when the machine runs out of disk space. Free space left: 9 MB

However, we did get past the final checkpoint grep issue. So, the question is did PyTorch 2.5 increase our disk space usage? Or is this just a flake of that e2e test setup?

bbrowning · Jan 8, 2025

Looking at the e2e-xlarge-test history, I see it has never passed on main so the fact that it failed may not be surprising. It did get past the point of training a model at least, so that gives some indication that PyTorch 2.5 is working properly on that setup.

nathan-weinberg

Given the large job passed with no issue and the various other approvals/signoffs, I am going to go ahead and approve this.

bbrowning · Jan 8, 2025

So, looking at the logs, I'm not sure this actually ran any CI run with Torch 2.5.x. CI was green with this change, but I'm still seeing multiple references to torch 2.4.x in the CI logs when installing ilab. Are we sure we have confidence torch 2.5.x works properly if CI isn't actually using torch 2.5.x in our tests?

This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>

This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. It also bumps the version of SDG to 0.6.3, which relaxes PyTorch dependency to allow 2.5. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>

This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>

This PR is a follow-up to #2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves #2702 Approved-by: nathan-weinberg Approved-by: alinaryan

mergify bot added documentation Improvements or additions to documentation dependencies Relates to dependencies ci-failure PR has at least one CI failure labels Jan 7, 2025

fabiendupont force-pushed the update-pytorch-2.5 branch from 8c1b921 to 47fb5a2 Compare January 7, 2025 13:32

mergify bot added ci-failure PR has at least one CI failure and removed ci-failure PR has at least one CI failure labels Jan 7, 2025

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jan 7, 2025

fabiendupont force-pushed the update-pytorch-2.5 branch from 47fb5a2 to 5997ffd Compare January 7, 2025 15:46

mergify bot removed needs-rebase This Pull Request needs to be rebased ci-failure PR has at least one CI failure labels Jan 7, 2025

cdoern requested a review from nathan-weinberg January 7, 2025 16:08

nathan-weinberg requested a review from JamesKunstle January 7, 2025 19:30

JamesKunstle approved these changes Jan 7, 2025

View reviewed changes

mergify bot added the one-approval PR has one approval from a maintainer label Jan 7, 2025

mergify bot added the needs-rebase This Pull Request needs to be rebased label Jan 8, 2025

chore!: Update PyTorch to 2.5

ab51236

This change increases the upper version of PyTorch to allow version 2.5. For the AMD variant, it also switches to ROCm 6.2, which is required for PyTorch 2.5. Resolves instructlab#2864 Signed-off-by: Fabien Dupont <fdupont@redhat.com>

fabiendupont force-pushed the update-pytorch-2.5 branch from 6e9bfeb to ab51236 Compare January 8, 2025 07:23

mergify bot removed the needs-rebase This Pull Request needs to be rebased label Jan 8, 2025

nathan-weinberg approved these changes Jan 8, 2025

View reviewed changes

nathan-weinberg removed the hold In-progress PR. Tag should be removed before merge. label Jan 8, 2025

mergify bot merged commit 61c877b into instructlab:main Jan 8, 2025
30 checks passed

mergify bot removed the one-approval PR has one approval from a maintainer label Jan 8, 2025

This was referenced Jan 8, 2025

Add support for Torch 2.5.1 (required for Intel Gaudi 1.19.0) #2669

Closed

AMD ROCm 6.3: Allow for torch==2.5.1 #2772

Closed

fabiendupont mentioned this pull request Jan 8, 2025

chore!: Update PyTorch to 2.5 instructlab/sdg#465

Merged

fabiendupont mentioned this pull request Jan 9, 2025

Update vLLM to 0.6.6.post1 #2892

Merged

Search code, repositories, users, issues, pull requests...

Comments

Conversation

fabiendupont commented Jan 7, 2025

Uh oh!

mergify bot commented Jan 7, 2025

Uh oh!

nathan-weinberg commented Jan 7, 2025

Uh oh!

fabiendupont commented Jan 7, 2025

Uh oh!

nathan-weinberg commented Jan 7, 2025

Uh oh!

fabiendupont commented Jan 7, 2025

Uh oh!

JamesKunstle commented Jan 7, 2025

Uh oh!

nathan-weinberg commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

prarit commented Jan 7, 2025

Uh oh!

nathan-weinberg commented Jan 7, 2025

Uh oh!

JamesKunstle left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

JamesKunstle commented Jan 7, 2025

Uh oh!

nathan-weinberg commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

reidliu41 commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 7, 2025

Uh oh!

github-actions bot commented Jan 8, 2025

Uh oh!

mergify bot commented Jan 8, 2025

Uh oh!

bbrowning commented Jan 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning commented Jan 8, 2025

Uh oh!

nathan-weinberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bbrowning commented Jan 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

bbrowning commented Jan 8, 2025 •

edited

Loading