HSDP + DTensor Support in FSDP #118618

mvpatel2000 · Jan 30, 2024

FSDP should take either process groups or device_mesh. When device_mesh is specified with DTensor, passing in process groups as well seems to make things blow up

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @LucasLLC

pytorch-bot · Jan 30, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118618

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit 831141c with merge base 5dfcf07 ():

NEW FAILURES - The following jobs have failed:

periodic / linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
distributed/fsdp/test_fsdp_comm_hooks.py::TestCommunicationHooks::test_registering_hook_hybrid_strategy
periodic / linux-focal-rocm5.7-py3.8 / test (distributed, 2, 2, linux.rocm.gpu) (gh)
distributed/_composable/fully_shard/test_fully_shard_runtime.py::TestRuntime::test_training
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/_composable/fully_shard/test_fully_shard_runtime.py::TestRuntime::test_training
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/fsdp/test_fsdp_comm_hooks.py::TestCommunicationHooks::test_registering_hook_hybrid_strategy
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/fsdp/test_fsdp_hybrid_shard.py::TestFSDPHybridShard::test_fsdp_hybrid_shard_basic_setup

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

mvpatel2000 · Jan 30, 2024

@pytorchbot label "module: distributed_checkpoint"

pytorch-bot · Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

pytorch-bot · Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

pytorch-bot · Jan 30, 2024

Please seek CI approval before scheduling CIFlow labels

fegin · Jan 30, 2024

@wz337 Do we still support process group for HSDP?

wconstab · Jan 30, 2024

is it possible to wrap the PG you have in a device_mesh, or, construct a 2D mesh early on and pull a PG out of it for other uses, but then only have a device_mesh inside FSDP? Doesn't seem great to have mixed pg+mesh IMO.

wz337 · Jan 30, 2024

@wz337 Do we still support process group for HSDP?

Users can still use ProcessGroup as input for HSDP, but they need to do some additional work for checkpointing due to the duplicate FQNs. If they use DCP, they would need to pass the only one replicate group as the process group for dcp.save().

wanchaol · Jan 30, 2024

Users can still use ProcessGroup as input for HSDP, but they need to do some additional work for checkpointing due to the duplicate FQNs. If they use DCP, they would need to pass the only one replicate group as the process group for dcp.save().

I think we should error our or throw warning either in HSDP side or in DCP side for this when we found user is using process group for HSDP? We don't want silent issue to the user

wz337 · Jan 30, 2024

@mvpatel2000 Could you give a little bit more information regarding the current issue without the change? For example, the error trace.

fegin · Jan 30, 2024

This PR does break some unittests.

Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - #118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: #118620 Approved by: https://github.com/wz337, https://github.com/wanchaol

mvpatel2000 · Feb 2, 2024

@mvpatel2000 Could you give a little bit more information regarding the current issue without the change? For example, the error trace.

Essentially, the issue is the outermost root FSDP module is passed a device_mesh but no process_group (correct). But this line root_kwargs["process_group"] = (self.process_group, self._inter_node_pg) adds process_group to the root kwargs, which gets passed down recursively. So, if you have recursive wrapping, like some child module model.block1 that is also meant to be FSDP'd, then it gets root_kwargs passed into it for FSDP init. But now that child module, when it tries to init FSDP, has both a device_mesh and a process_group

The trace ends with a value error complaining both are specified. I unfortunately don't have one handy as we've monkeypatched this for a while.

I'm not sure what correct fix is given unit test failures. @fegin or @wz337 do either of you have recommendations?

awgu · Feb 2, 2024

Sorry, I have not been following closely, but is the issue the same as #118906?

mvpatel2000 · Feb 2, 2024

Sorry, I have not been following closely, but is the issue the same as #118906?

@awgu I think it is the same issue :)

awgu · Feb 2, 2024

Let me take a look into this

awgu · Feb 2, 2024

torch/distributed/fsdp/fully_sharded_data_parallel.py

-            if sharding_strategy in HYBRID_SHARDING_STRATEGIES:
+            if (
+                sharding_strategy in HYBRID_SHARDING_STRATEGIES
+                and device_mesh is not None


This might just be a typo? I think we want device_mesh is None (i.e. the user did not pass device_mesh) -- then we forward the process_group constructed by the root to the children.

I opened a PR with this here with some basic test: #119064

mvpatel2000 · Feb 2, 2024

I am going to close this in favor of @awgu 's PR. Thanks for taking it over!

Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - #118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: #118620 Approved by: https://github.com/Skylion007

Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - #118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: #118620 Approved by: https://github.com/wz337, https://github.com/wanchaol

Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - #118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: #118620 Approved by: https://github.com/Skylion007

Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - pytorch#118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: pytorch#118620 Approved by: https://github.com/Skylion007

Removes raising error if a device_mesh has a parent. The comment says that HSDP + TP is not supported, but I'm able to do 2D parallelism + HSDP fine. The only issues are: - this check - #118618 - a series of PRs related to checkpointing with 3D meshes that I will open We currently monkeypatch for the above which I am slowly upstreaming. I imagine torch will have a better, native integration eventually, but this check seems too aggressive in the meantime given DTensor now lets users do some things themselves (which is amazing 🎉)! Pull Request resolved: #118620 Approved by: https://github.com/Skylion007

Update fully_sharded_data_parallel.py

a885b70

pytorch-bot bot added the release notes: distributed (sharded) release notes category label Jan 30, 2024

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 30, 2024

pytorch-bot bot removed the ciflow/inductor label Jan 30, 2024

pytorch-bot bot added the module: distributed_checkpoint label Jan 30, 2024

mvpatel2000 mentioned this pull request Jan 30, 2024

Remove parent device mesh check #118620

Closed

pytorchbot added the open source label Jan 30, 2024

Update fully_sharded_data_parallel.py

831141c

github-actions bot added the ciflow/inductor label Jan 30, 2024

pytorch-bot bot removed the ciflow/inductor label Jan 30, 2024

mvpatel2000 changed the title ~~HSDP + DTensor Support~~ HSDP + DTensor Support in FSDP Jan 30, 2024

mvpatel2000 mentioned this pull request Jan 30, 2024

HSDP + TP Support with DTensor #118639

Closed

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Jan 30, 2024

pytorch-bot bot removed ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request labels Jan 30, 2024

wz337 added ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR release notes: distributed (fsdp) release notes category module: fsdp labels Jan 30, 2024

ezyang requested review from wanchaol and awgu February 1, 2024 12:45

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 1, 2024

awgu reviewed Feb 2, 2024

View reviewed changes

mvpatel2000 closed this Feb 2, 2024

mvpatel2000 deleted the patch-4 branch February 5, 2024 18:27

Search code, repositories, users, issues, pull requests...

HSDP + DTensor Support in FSDP #118618

HSDP + DTensor Support in FSDP #118618

Uh oh!

Conversation

mvpatel2000 commented Jan 30, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/118618

❌ 5 New Failures

Uh oh!

pytorch-bot bot commented Jan 30, 2024

Uh oh!

mvpatel2000 commented Jan 30, 2024

Uh oh!

pytorch-bot bot commented Jan 30, 2024

Uh oh!

pytorch-bot bot commented Jan 30, 2024

Uh oh!

pytorch-bot bot commented Jan 30, 2024

Uh oh!

fegin commented Jan 30, 2024

Uh oh!

wconstab commented Jan 30, 2024

Uh oh!

wz337 commented Jan 30, 2024

Uh oh!

wanchaol commented Jan 30, 2024

Uh oh!

wz337 commented Jan 30, 2024

Uh oh!

fegin commented Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mvpatel2000 commented Feb 2, 2024

Uh oh!

awgu commented Feb 2, 2024

Uh oh!

mvpatel2000 commented Feb 2, 2024

Uh oh!

awgu commented Feb 2, 2024

Uh oh!

awgu Feb 2, 2024

Choose a reason for hiding this comment

Uh oh!

mvpatel2000 commented Feb 2, 2024

Uh oh!

Uh oh!

mvpatel2000 commented Jan 30, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Jan 30, 2024 •

edited

Loading

fegin commented Jan 30, 2024 •

edited

Loading