Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Issues: pytorch/pytorch

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Author
Filter by author
Loading
Label
Filter by label
Loading
Use alt + click/return to exclude labels
or + click/return for logical OR
Projects
Filter by project
Loading
Milestones
Filter by milestone
Loading
Assignee
Filter by who’s assigned
Assigned to nobody Loading
Sort

Issues list

[FSDP2] set_reduce_scatter_divide_factor errors with non-trivial MixedPrecisionPolicy module: fsdp triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#155223 opened Jun 5, 2025 by garrett361
[FSDP2] Slower Convergence with fully_shard() Compared to DDP during Qwen2-VL Fine-Tuning module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154984 opened Jun 3, 2025 by mingdianliu
[FSDP2] all_gather_copy_in for cpu offload module: fsdp topic: new features topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154960 opened Jun 3, 2025 by weifengpy
Potential Bug with HYBRID_SHARD and (n, 1) Device Mesh Falling Back to NO_SHARD module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154888 opened Jun 2, 2025 by origin-bio
[FSDP2] fix unit test test_all_gather_extension_outer_size_stride module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154836 opened Jun 2, 2025 by weifengpy
[FSDP2] offer public API to share communication context aross fsdp roots module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue
#154657 opened May 29, 2025 by weifengpy
[FSDP2] for mixed precision, input casting can get blocked when cuda streams are full module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154272 opened May 23, 2025 by weifengpy
torch.compile fails in FSDP due to .data assignment with different floating type module: aotdispatch umbrella label for AOTAutograd issues module: dynamo module: fsdp module: pt2-dispatcher PT2 dispatcher-related issues (e.g., aotdispatch, functionalization, faketensor, custom-op, oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#152162 opened Apr 25, 2025 by kbabiuchx
Unexpected memory usage in FSDP 2 Hybrid Sharding (HSDP) module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#151030 opened Apr 10, 2025 by Craigacp
FSDP in hybrid mode throws _saved_grad_shard error when backward is called on cross-rank all-gathered loss module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150799 opened Apr 7, 2025 by TianyiXiong1998
Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed module: ddp Issues/PRs related distributed data parallel training module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150734 opened Apr 5, 2025 by felixliufei
FSDP2 issue with mp_policy, checkpoint() and float input module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150140 opened Mar 27, 2025 by mori360
FSDP OOM during sync_params_and_buffers module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150096 opened Mar 27, 2025 by KimmiShi
[FSDP2][DTensor] numeric bug for DTensor + python float in gradient clipping module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#149768 opened Mar 21, 2025 by weifengpy
DISABLED test_unshard_async (__main__.TestFullyShardUnshardMultiProcess) module: flaky-tests Problem is a flaky test in CI module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue oncall: pt2 skipped Denotes a (flaky) test currently skipped in CI. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#149349 opened Mar 17, 2025 by pytorch-bot bot
Memory leak when using get_model_state_dict with FSDP-sharded models module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#149100 opened Mar 13, 2025 by mertyg
FSDP2 and autocast compatibility issue module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148831 opened Mar 9, 2025 by yjxiong
FSPD ValueError: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148756 opened Mar 7, 2025 by nikonikolov
[FSDP2] improve error msg for duplicate wraps module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148504 opened Mar 4, 2025 by weifengpy
[FSDP2] HSDP with globally sharded fp32 weights and optimizer states module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148257 opened Mar 1, 2025 by ChrisLiu6
copy_() fails with HSDP in FSDP2 module: dtensor distributed tensor tag module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#147568 opened Feb 21, 2025 by ad8e
[FSDP2] OOM when use integer reshard_after_forward that smaller than DP size module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#147179 opened Feb 14, 2025 by FindDefinition
Use device agnostic APIs for device_count and backend in common_fsdp ciflow/trunk Trigger trunk jobs on your pull request module: fsdp module: hpu Issues related to the hpu device (Habana/Gaudi) open source Stale topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#146289 opened Feb 3, 2025 by ankurneog Loading…
With FSDP2, a small tensor on a 1-GPU world size has grad=0 module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#144045 opened Jan 1, 2025 by ad8e
Error with fused AdamW module: fsdp module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#140514 opened Nov 13, 2024 by ad8e
ProTip! Updated in the last three days: updated:>2025-06-05.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.