Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Patch all_gather to support HSDP + TP #118638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mvpatel2000
Copy link
Contributor

@mvpatel2000 mvpatel2000 commented Jan 30, 2024

Update all_gather to support HSDP + TP.

Currently, the _all_gather_dtensor function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

Copy link

pytorch-bot bot commented Jan 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118638

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 2ab8355 with merge base 923a7c7 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 30, 2024
@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jan 30, 2024
@fegin fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2024
Copy link

pytorch-bot bot commented Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jan 30, 2024
@fegin fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jan 30, 2024
Copy link

pytorch-bot bot commented Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

1 similar comment
Copy link

pytorch-bot bot commented Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jan 30, 2024
@Skylion007 Skylion007 requested a review from awgu January 30, 2024 19:29
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvpatel2000 Can you add more description? Like what your PR actually does and what it solves. The description provides very little information for people who didn't write this code piece.

@mvpatel2000
Copy link
Contributor Author

@mvpatel2000 Can you add more description? Like what your PR actually does and what it solves. The description provides very little information for people who didn't write this code piece.

Updated, sorry!

@mvpatel2000 mvpatel2000 requested a review from fegin January 30, 2024 19:35
Copy link
Contributor

@fegin fegin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fegin fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jan 30, 2024
Copy link

pytorch-bot bot commented Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

@pytorch-bot pytorch-bot bot removed the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Jan 30, 2024
Copy link
Contributor

@wz337 wz337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mvpatel2000
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased patch-6 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout patch-6 && git pull --rebase)

@mvpatel2000
Copy link
Contributor Author

@pytorchmergebot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Skylion007
Copy link
Collaborator

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorch-bot bot pushed a commit that referenced this pull request Feb 8, 2024
Update all_gather to support HSDP + TP.

Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather.

Pull Request resolved: #118638
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337
@mvpatel2000 mvpatel2000 deleted the patch-6 branch February 13, 2024 19:07
mvpatel2000 added a commit to mvpatel2000/pytorch that referenced this pull request Feb 13, 2024
Update all_gather to support HSDP + TP.

Currently, the `_all_gather_dtensor` function for dtensors only replaces the first dimension with replicate (the FSDP dimension) and does not touch the second dimension (which is assumed to be the TP dimension). With HSDP, we have two dimensions ahead of the TP dimension as opposed to 1. This PR updates to replace all other dimensions with replicate to run the all-gather.

Pull Request resolved: pytorch#118638
Approved by: https://github.com/fegin, https://github.com/awgu, https://github.com/wz337
atalman pushed a commit that referenced this pull request Feb 14, 2024
Co-authored-by: Andrew Gu <andgu@fb.com>
resolved: #112435
resolved: #118620
Fixed `device_mesh` and auto wrap (#119064)
fix #118906.
resolved: #119064
resolved: #118638
Fixes #118639.
resolved: #119481
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.