[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict #127070

fegin · May 24, 2024

Stack from ghstack (oldest at bottom):

[DSD] Remove the unused submodule feature #127604
[DSD] Make distributed state_dict support torch.distributed is not initialized case #127385
[DSD] Unify the API signatures of set_model_state_dict and set_optimizer_state_dict #127384
[DSD] Support flattening the optimizer state_dict when saving and unflattening when loading #127071
-> [DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict #127070

Summary:
This is a very complicated signature that is hard for users to reason. Remove the support of this feature.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @LucasLLC

[ghstack-poisoned]

pytorch-bot · May 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127070

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Upgrade MacOS runner to 14

✅ You can merge normally! (2 Unrelated Failures)

As of commit 59df8e4 with merge base a60b06b ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-binary-manywheel / manywheel-py3_8-cuda12_1-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory
linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test (gh) (trunk failure)
ImportError: libcudnn.so.9: cannot open shared object file: No such file or directory

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

wz337

LGTM!

Summary: Remove as the same reason of #127070. ghstack-source-id: 8f060cc Pull Request resolved: #127604

fegin · May 31, 2024

@pytorchbot merge

pytorchmergebot · May 31, 2024

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

fegin · May 31, 2024

@pytorchbot merge

pytorchmergebot · May 31, 2024

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…lattening when loading (#127071) Fixes #126595 **What does this PR do?** This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one: ``` { "state": { "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, }, "param_group": [ {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]} ] } ``` While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism. This PR flatten the optimizer state_dict to the one as the following one: ``` { "state.layer1.weight.step": 10, "state.layer2.weight.step": 10, "state.layer1.weight.exp_avg": SomeTensor, "state.layer2.weight.exp_avg": SomeTensor, "state.layer1.weight.exp_avg_sq": SomeTensor, "state.layer2.weight.exp_avg_sq": SomeTensor, "param_group.layer1.weight.lr" : 0.1, "param_group.layer2.weight.lr" : 0.1, "param_group.layer1.weight.betas" : (0.9, 0.95), "param_group.layer2.weight.betas" : (0.9, 0.95), } ``` This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism). **Pros and Cons** *Pros* 1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers). 2. User don't need to manually add prefix to different optimizer. 3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism. *Cons* 1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken. 2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save. Pull Request resolved: #127071 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #127070

…zer_state_dict (#127384) Summary: Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues. Pull Request resolved: #127384 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071

…itialized case (#127385) Fixes #124942 Summary: Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False. Pull Request resolved: #127385 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071, #127384

Summary: Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature. The previous PR, #127070, assumes no one is using the feature and remove it without the grace period. This seems to be too aggresive and causes some concerns. This PR adds the deprecation warning and tests. We will remove the support in 2.5. Pull Request resolved: #127793 Approved by: https://github.com/LucasLLC

#127070) Summary: This is a very complicated signature that is hard for users to reason. Remove the support of this feature. Pull Request resolved: #127070 Approved by: https://github.com/wz337 (cherry picked from commit 6b1b8d0)

…lattening when loading (#127071) Fixes #126595 **What does this PR do?** This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one: ``` { "state": { "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, }, "param_group": [ {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]} ] } ``` While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism. This PR flatten the optimizer state_dict to the one as the following one: ``` { "state.layer1.weight.step": 10, "state.layer2.weight.step": 10, "state.layer1.weight.exp_avg": SomeTensor, "state.layer2.weight.exp_avg": SomeTensor, "state.layer1.weight.exp_avg_sq": SomeTensor, "state.layer2.weight.exp_avg_sq": SomeTensor, "param_group.layer1.weight.lr" : 0.1, "param_group.layer2.weight.lr" : 0.1, "param_group.layer1.weight.betas" : (0.9, 0.95), "param_group.layer2.weight.betas" : (0.9, 0.95), } ``` This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism). **Pros and Cons** *Pros* 1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers). 2. User don't need to manually add prefix to different optimizer. 3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism. *Cons* 1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken. 2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save. Pull Request resolved: #127071 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: #127070 (cherry picked from commit bd868ee)

…zer_state_dict (#127384) Summary: Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues. Pull Request resolved: #127384 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071 (cherry picked from commit 8b4ad3a)

…itialized case (#127385) Fixes #124942 Summary: Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False. Pull Request resolved: #127385 Approved by: https://github.com/wz337 ghstack dependencies: #127070, #127071, #127384 (cherry picked from commit 64c581a)

Summary: Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature. The previous PR, #127070, assumes no one is using the feature and remove it without the grace period. This seems to be too aggresive and causes some concerns. This PR adds the deprecation warning and tests. We will remove the support in 2.5. Pull Request resolved: #127793 Approved by: https://github.com/LucasLLC (cherry picked from commit 22964d1)

pytorch#127070) Summary: This is a very complicated signature that is hard for users to reason. Remove the support of this feature. Pull Request resolved: pytorch#127070 Approved by: https://github.com/wz337

…lattening when loading (pytorch#127071) Fixes pytorch#126595 **What does this PR do?** This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one: ``` { "state": { "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor}, }, "param_group": [ {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]} ] } ``` While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism. This PR flatten the optimizer state_dict to the one as the following one: ``` { "state.layer1.weight.step": 10, "state.layer2.weight.step": 10, "state.layer1.weight.exp_avg": SomeTensor, "state.layer2.weight.exp_avg": SomeTensor, "state.layer1.weight.exp_avg_sq": SomeTensor, "state.layer2.weight.exp_avg_sq": SomeTensor, "param_group.layer1.weight.lr" : 0.1, "param_group.layer2.weight.lr" : 0.1, "param_group.layer1.weight.betas" : (0.9, 0.95), "param_group.layer2.weight.betas" : (0.9, 0.95), } ``` This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism). **Pros and Cons** *Pros* 1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers). 2. User don't need to manually add prefix to different optimizer. 3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism. *Cons* 1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken. 2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save. Pull Request resolved: pytorch#127071 Approved by: https://github.com/wconstab, https://github.com/wz337 ghstack dependencies: pytorch#127070

…zer_state_dict (pytorch#127384) Summary: Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues. Pull Request resolved: pytorch#127384 Approved by: https://github.com/wz337 ghstack dependencies: pytorch#127070, pytorch#127071

…itialized case (pytorch#127385) Fixes pytorch#124942 Summary: Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False. Pull Request resolved: pytorch#127385 Approved by: https://github.com/wz337 ghstack dependencies: pytorch#127070, pytorch#127071, pytorch#127384

Update

b79f911

[ghstack-poisoned]

pytorch-bot bot added module: distributed_checkpoint oncall: distributed Add this issue/PR to distributed oncall triage queue labels May 24, 2024

This was referenced May 24, 2024

[DSD] Add a test to verify FSDP lazy initialization case #127069

Closed

[DSD] Support flattening the optimizer state_dict when saving and unflattening when loading #127071

Closed

fegin added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels May 24, 2024

Update

2a33a11

[ghstack-poisoned]

fegin requested review from wz337 and LucasLLC May 24, 2024 20:59

Update

59df8e4

[ghstack-poisoned]

This was referenced May 29, 2024

[DSD] Unify the API signatures of set_model_state_dict and set_optimizer_state_dict #127384

Closed

[DSD] Make distributed state_dict support torch.distributed is not initialized case #127385

Closed

wz337 approved these changes May 30, 2024

View reviewed changes

fegin mentioned this pull request May 31, 2024

[DSD] Remove the unused submodule feature #127604

Closed

fegin added a commit that referenced this pull request May 31, 2024

[DSD] Remove the unused submodule feature

182b8d0

Summary: Remove as the same reason of #127070. ghstack-source-id: 8f060cc Pull Request resolved: #127604

pytorchmergebot added the merging label May 31, 2024

pytorchmergebot removed the merging label May 31, 2024

fegin added the release notes: distributed (checkpoint) label May 31, 2024

pytorchmergebot added the merging label May 31, 2024

pytorchmergebot closed this in 6b1b8d0 May 31, 2024

pytorchmergebot added Merged and removed merging labels May 31, 2024

fegin mentioned this pull request Jun 3, 2024

[DSD] Deprecate submodules feature for DSD #127793

Closed

github-actions bot deleted the gh/fegin/244/head branch July 1, 2024 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict #127070

[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict #127070

Uh oh!

fegin commented May 24, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 24, 2024 •

edited

Loading

Uh oh!

wz337 left a comment

Uh oh!

fegin commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Uh oh!

fegin commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict #127070

[DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict #127070

Uh oh!

Conversation

fegin commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/127070

❗ 1 Active SEVs

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

wz337 left a comment

Choose a reason for hiding this comment

Uh oh!

fegin commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Merge failed

Uh oh!

fegin commented May 31, 2024

Uh oh!

pytorchmergebot commented May 31, 2024

Merge started

Uh oh!

Uh oh!

fegin commented May 24, 2024 •

edited

Loading

pytorch-bot bot commented May 24, 2024 •

edited

Loading