-
Notifications
You must be signed in to change notification settings - Fork 24.4k
[c10d] PGNCCL refactor part 1: adds assert size==1 #119099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119099
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit fd49e29 with merge base 9afd539 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sgtm
When should we update docs?
@@ -286,7 +286,6 @@ TEST_F(ProcessGroupNCCLErrorsTest, testNCCLErrorsBlocking) { | ||
|
||
auto work = pg.allreduce(tensors_); | ||
work->wait(); | ||
EXPECT_TRUE(work->isSuccess()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is bc single device style is blocking and multi was non blocking ? Or what
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is just because the is_success
API has been deprecated for a long time at the pybind place:
https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp#L2556-L2560
so taking this chance to take out the c++ code as well.
Also because its c++ code would complicate my refactorization.
It makes sense (to deprecate it) -- how can we tell a collective is successful? Are we telling user that the result is correct? We obviously don't/cannot do data check.. So at best we can only tell is_complete()
.
@@ -2926,7 +2907,7 @@ def tearDown(self): | ||
|
||
@property | ||
def op_timeout_sec(self): | ||
return 1 | ||
return 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to increase the timeout limit by a little to avoid a false timeout, so that the chance of being flaky is smaller. Didn't mean that my change requires increase of this timeout. Just a minor tweak as I see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this gonna sleep for 3x longer or is it a nice timeout that only waits that long if needed? i'm a little mindful of our reputation for too-long CI jobs..
of course, avoiding flaky tests is the most important imo. But if all that stands between your test and a flake is a 1-2 sec knob tuning, it might be a bad test design? haven't looked at where this is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a nice timeout that only waits that long if needed"
The doc has been updated for a while now: |
@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
// maybe refactor the guts rather than copy-pasta, but it may not be | ||
// worth it. | ||
for (auto test : {&test1, &test2}) { | ||
const int numDevices = test->numDevices(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont really understand what this test is doing. is this a new test btw?
do we expect numDevices > 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an old test, refactored from
TEST_F(ProcessGroupNCCLTest, testSplittingCommunicator)
test->numDevices()
should return 1 now (I hard set numDevices_
to 1 in test
class).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice cleanup.
@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Breaking pytorch#118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: pytorch#119099 Approved by: https://github.com/wconstab
@pytorchmergebot revert -c ghfirst -m "Breaks internal tests" |
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit 4ab852b. Reverted #119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](#119099 (comment)))
@kwen2501 your PR has been successfully reverted. |
@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx for splitting the original PR into small PRs.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab
This reverts commit 4ab852b. Reverted #119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](#119099 (comment)))
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab
This reverts commit 4ab852b. Reverted #119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](#119099 (comment)))
Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: #119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds
assert size==1
to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @ezyang @gchanan