-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Issues: pytorch/pytorch
[RFC] Proposed Changes to Feature Tracking & Classification f...
#152134
opened Apr 24, 2025 by
atalman
Open
4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
[FSDP2] set_reduce_scatter_divide_factor errors with non-trivial MixedPrecisionPolicy
module: fsdp
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#155223
opened Jun 5, 2025 by
garrett361
[FSDP2] Slower Convergence with fully_shard() Compared to DDP during Qwen2-VL Fine-Tuning
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154984
opened Jun 3, 2025 by
mingdianliu
[FSDP2] all_gather_copy_in for cpu offload
module: fsdp
topic: new features
topic category
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154960
opened Jun 3, 2025 by
weifengpy
Potential Bug with HYBRID_SHARD and (n, 1) Device Mesh Falling Back to NO_SHARD
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154888
opened Jun 2, 2025 by
origin-bio
[FSDP2] fix unit test test_all_gather_extension_outer_size_stride
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154836
opened Jun 2, 2025 by
weifengpy
[FSDP2] offer public API to share communication context aross fsdp roots
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
#154657
opened May 29, 2025 by
weifengpy
[FSDP2] for mixed precision, input casting can get blocked when cuda streams are full
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#154272
opened May 23, 2025 by
weifengpy
torch.compile fails in FSDP due to .data assignment with different floating type
module: aotdispatch
umbrella label for AOTAutograd issues
module: dynamo
module: fsdp
module: pt2-dispatcher
PT2 dispatcher-related issues (e.g., aotdispatch, functionalization, faketensor, custom-op,
oncall: pt2
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#152162
opened Apr 25, 2025 by
kbabiuchx
Unexpected memory usage in FSDP 2 Hybrid Sharding (HSDP)
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#151030
opened Apr 10, 2025 by
Craigacp
FSDP in hybrid mode throws _saved_grad_shard error when backward is called on cross-rank all-gathered loss
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150799
opened Apr 7, 2025 by
TianyiXiong1998
Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed
module: ddp
Issues/PRs related distributed data parallel training
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150734
opened Apr 5, 2025 by
felixliufei
FSDP2 issue with mp_policy, checkpoint() and float input
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#150140
opened Mar 27, 2025 by
mori360
FSDP OOM during Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
sync_params_and_buffers
module: fsdp
oncall: distributed
#150096
opened Mar 27, 2025 by
KimmiShi
[FSDP2][DTensor] numeric bug for DTensor + python float in gradient clipping
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#149768
opened Mar 21, 2025 by
weifengpy
DISABLED test_unshard_async (__main__.TestFullyShardUnshardMultiProcess)
module: flaky-tests
Problem is a flaky test in CI
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
oncall: pt2
skipped
Denotes a (flaky) test currently skipped in CI.
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#149349
opened Mar 17, 2025 by
pytorch-bot
bot
Memory leak when using get_model_state_dict with FSDP-sharded models
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#149100
opened Mar 13, 2025 by
mertyg
FSDP2 and autocast compatibility issue
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148831
opened Mar 9, 2025 by
yjxiong
FSPD ValueError: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148756
opened Mar 7, 2025 by
nikonikolov
[FSDP2] improve error msg for duplicate wraps
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148504
opened Mar 4, 2025 by
weifengpy
[FSDP2] HSDP with globally sharded fp32 weights and optimizer states
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#148257
opened Mar 1, 2025 by
ChrisLiu6
copy_()
fails with HSDP in FSDP2
module: dtensor
#147568
opened Feb 21, 2025 by
ad8e
[FSDP2] OOM when use integer Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
reshard_after_forward
that smaller than DP size
module: fsdp
oncall: distributed
#147179
opened Feb 14, 2025 by
FindDefinition
Use device agnostic APIs for device_count and backend in common_fsdp
ciflow/trunk
Trigger trunk jobs on your pull request
module: fsdp
module: hpu
Issues related to the hpu device (Habana/Gaudi)
open source
Stale
topic: not user facing
topic category
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#146289
opened Feb 3, 2025 by
ankurneog
Loading…
With FSDP2, a small tensor on a 1-GPU world size has grad=0
module: fsdp
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#144045
opened Jan 1, 2025 by
ad8e
Error with fused AdamW
module: fsdp
module: optimizer
Related to torch.optim
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
#140514
opened Nov 13, 2024 by
ad8e
Previous Next
ProTip!
Updated in the last three days: updated:>2025-06-05.