chore!: Update PyTorch to 2.5#2865
chore!: Update PyTorch to 2.5#2865mergify[bot] merged 1 commit intoinstructlab:maininstructlab/instructlab:mainfrom fabiendupont:update-pytorch-2.5fabiendupont/instructlab:update-pytorch-2.5Copy head branch name to clipboard
Conversation
8c1b921 to
47fb5a2
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
47fb5a2 to
5997ffd
Compare
|
@fabiendupont have you done any kind of testing around this? |
|
@nathan-weinberg, I have used our downstream build pipeline to test instructlab v0.22.1 with PyTorch 2.5.1 manually. I was able to run chat, serve, data generate and train steps without any issue. That's why I went ahead to propose this MR, as it seems pretty safe. Do you have any specific concern? |
|
@fabiendupont if you look at the existing issues (I linked them in the additional one you opened, I think that may be a dup), the reason we've been holding off is we want to make sure the Training library (@JamesKunstle) is functional with this version before we made the bump here @prarit curious around your thoughts on this as well |
|
@nathan-weinberg, the training library doesn't seem to cap the PyTorch version, so I would have expected it to have already been tested with PyTorch 2.5.1 during a previous PR in instructlab/training. @JamesKunstle, is there somewhere where the PyTorch version is controlled besides requirements file? |
|
@fabiendupont No it isn't capped elsewhere but we don't have an independent test quite yet to run through everything w/ a higher torch version- that'll be ready in a bit. The 'instructlab/instructlab' tests should be good enough to roughly confirm it though. |
|
I'm going to trigger a couple E2E jobs on this (Large and XLarge) just as a sanity check |
|
E2E (NVIDIA L40S x8) workflow launched on this PR: View run |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
@nathan-weinberg LGTM |
|
Okay, if the CI jobs pass I'm fine to approve this - @JamesKunstle can you approve as well if you are signing off, which it sounds like you are? |
JamesKunstle
left a comment
There was a problem hiding this comment.
Since all our CI succeeded w/ torch<2.6 I think we're okay to bump it.
|
e2e workflow failed on this PR: View run, please investigate. |
|
@nathan-weinberg I don't think that failure is real, I looked at the logs and I think it's just a regex mismatch or something |
|
Yeah let me rerun it here |
|
E2E (NVIDIA L40S x4) workflow launched on this PR: View run |
|
e2e workflow failed on this PR: View run, please investigate. |
|
e2e workflow failed on this PR: View run, please investigate. |
|
Seems still not update? The |
|
e2e workflow failed on this PR: View run, please investigate. |
|
e2e workflow succeeded on this PR: View run, congrats! |
|
e2e workflow failed on this PR: View run, please investigate. |
|
This pull request has merge conflicts that must be resolved before it can be |
This change increases the upper version of PyTorch to allow version 2.5. For the AMD variant, it also switches to ROCm 6.2, which is required for PyTorch 2.5. Resolves instructlab#2864 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
6e9bfeb to
ab51236
Compare
|
It seems we ran out of disk space on the e2e-xlarge test kicked off? From its logs at https://github.com/instructlab/instructlab/actions/runs/12659779418/job/35279654906 - However, we did get past the final checkpoint grep issue. So, the question is did PyTorch 2.5 increase our disk space usage? Or is this just a flake of that e2e test setup? |
|
Looking at the e2e-xlarge-test history, I see it has never passed on main so the fact that it failed may not be surprising. It did get past the point of training a model at least, so that gives some indication that PyTorch 2.5 is working properly on that setup. |
nathan-weinberg
left a comment
There was a problem hiding this comment.
Given the large job passed with no issue and the various other approvals/signoffs, I am going to go ahead and approve this.
|
So, looking at the logs, I'm not sure this actually ran any CI run with Torch 2.5.x. CI was green with this change, but I'm still seeing multiple references to torch 2.4.x in the CI logs when installing ilab. Are we sure we have confidence torch 2.5.x works properly if CI isn't actually using torch 2.5.x in our tests? |
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. It also bumps the version of SDG to 0.6.3, which relaxes PyTorch dependency to allow 2.5. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. It also bumps the version of SDG to 0.6.3, which relaxes PyTorch dependency to allow 2.5. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. It also bumps the version of SDG to 0.6.3, which relaxes PyTorch dependency to allow 2.5. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to instructlab#2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves instructlab#2702 Signed-off-by: Fabien Dupont <fdupont@redhat.com>
This PR is a follow-up to #2865 that relaxed the PyTorch version range. Even with that range extension, we realized that PyTorch 2.4 is still used when installing `instructlab[vllm-cuda]`, because vLLM 0.6.2 has a requirement on PyTorch 2.4. This new PR updates the version of vLLM to 0.6.6.post1, which is the latest available in the Open Data Hub fork of vLLM. The vLLM changelog doesn't highlight much risk in this version bump. Resolves #2702 Approved-by: nathan-weinberg Approved-by: alinaryan
This change increases the upper version of PyTorch to allow version 2.5. For the AMD variant, it also switches to ROCm 6.2, which is required for PyTorch 2.5.
Resolves #2864