Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[c10d] PGNCCL refactor part 1: adds assert size==1 #119099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Feb 2, 2024

Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds assert size==1 to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @ezyang @gchanan

@kwen2501 kwen2501 added the module: bc-breaking Related to a BC-breaking change label Feb 2, 2024
@kwen2501 kwen2501 requested a review from albanD February 2, 2024 23:38
Copy link

pytorch-bot bot commented Feb 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119099

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fd49e29 with merge base 9afd539 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 2, 2024
Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sgtm
When should we update docs?

@@ -286,7 +286,6 @@ TEST_F(ProcessGroupNCCLErrorsTest, testNCCLErrorsBlocking) {

auto work = pg.allreduce(tensors_);
work->wait();
EXPECT_TRUE(work->isSuccess());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bc single device style is blocking and multi was non blocking ? Or what

Copy link
Contributor Author

@kwen2501 kwen2501 Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is just because the is_success API has been deprecated for a long time at the pybind place:
https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/init.cpp#L2556-L2560
so taking this chance to take out the c++ code as well.
Also because its c++ code would complicate my refactorization.

It makes sense (to deprecate it) -- how can we tell a collective is successful? Are we telling user that the result is correct? We obviously don't/cannot do data check.. So at best we can only tell is_complete().

@@ -2926,7 +2907,7 @@ def tearDown(self):

@property
def op_timeout_sec(self):
return 1
return 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wut

Copy link
Contributor Author

@kwen2501 kwen2501 Feb 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to increase the timeout limit by a little to avoid a false timeout, so that the chance of being flaky is smaller. Didn't mean that my change requires increase of this timeout. Just a minor tweak as I see it.

Copy link
Contributor

@wconstab wconstab Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this gonna sleep for 3x longer or is it a nice timeout that only waits that long if needed? i'm a little mindful of our reputation for too-long CI jobs..

of course, avoiding flaky tests is the most important imo. But if all that stands between your test and a flake is a 1-2 sec knob tuning, it might be a bad test design? haven't looked at where this is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a nice timeout that only waits that long if needed"

@kwen2501
Copy link
Contributor Author

kwen2501 commented Feb 3, 2024

Sgtm When should we update docs?

The doc has been updated for a while now:
https://pytorch.org/docs/stable/distributed.html#multi-gpu-collective-functions
(I can add more words to it in a following PR advising alternatives, e.g. multi-proc or multi-thread)

@facebook-github-bot
Copy link
Contributor

@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

// maybe refactor the guts rather than copy-pasta, but it may not be
// worth it.
for (auto test : {&test1, &test2}) {
const int numDevices = test->numDevices();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont really understand what this test is doing. is this a new test btw?

do we expect numDevices > 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an old test, refactored from

TEST_F(ProcessGroupNCCLTest, testSplittingCommunicator)

test->numDevices() should return 1 now (I hard set numDevices_ to 1 in test class).

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup.

@facebook-github-bot
Copy link
Contributor

@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kwen2501
Copy link
Contributor Author

kwen2501 commented Feb 6, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 6, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

smalltalkman pushed a commit to smalltalkman/pytorch that referenced this pull request Feb 6, 2024
Breaking pytorch#118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: pytorch#119099
Approved by: https://github.com/wconstab
@atalman
Copy link
Contributor

atalman commented Feb 6, 2024

@pytorchmergebot revert -c ghfirst -m "Breaks internal tests"

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 6, 2024
This reverts commit 4ab852b.

Reverted #119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](#119099 (comment)))
@pytorchmergebot
Copy link
Collaborator

@kwen2501 your PR has been successfully reverted.

@facebook-github-bot
Copy link
Contributor

@kwen2501 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@XilunWu XilunWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx for splitting the original PR into small PRs.

@kwen2501
Copy link
Contributor Author

kwen2501 commented Feb 7, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
This reverts commit 4ab852b.

Reverted #119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](#119099 (comment)))
pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
pytorchmergebot pushed a commit that referenced this pull request Feb 9, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
pytorchmergebot pushed a commit that referenced this pull request Feb 12, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
This reverts commit 4ab852b.

Reverted #119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](#119099 (comment)))
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
Breaking #118674 into multiple smaller PRs.
This is the first one.
It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert).

Pull Request resolved: #119099
Approved by: https://github.com/wconstab, https://github.com/XilunWu
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
clee2000 pushed a commit that referenced this pull request Feb 14, 2024
…device style (#119421)

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

Pull Request resolved: #119421
Approved by: https://github.com/shuqiangzhang
@github-actions github-actions bot deleted the single_dev_test branch March 9, 2024 01:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: bc-breaking Related to a BC-breaking change oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Reverted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.