[FSDP][optim_state_dict] Call synchronize() to ensure DTensors.to_local() is synchronized #117799

fegin · Jan 18, 2024

Stack from ghstack (oldest at bottom):

-> [FSDP][optim_state_dict] Call synchronize() to ensure DTensors.to_local() is synchronized #117799

If a tensor is converted from DTensor (to_local()), there may be some async communication that has not finished yet. Calling clone() with that tensor does not seem to work (and may increase the memory usage, users report OOM). This is a temporary fix.

Differential Revision: D52890462

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

…ensors being recycled emporary tensors could not be recycled unless the operations are finished. Calling synchronize() can ensure all the operations are finished. The action can prevent OOM from happening. Differential Revision: [D52890462](https://our.internmc.facebook.com/intern/diff/D52890462/) [ghstack-poisoned]

pytorch-bot · Jan 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117799

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit ead91ac with merge base 5c17f66 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-py3.11-clang10 / test (dynamo, 1, 7, linux.2xlarge) (gh)
test_weak.py::WeakTest::test_make_weak_keyed_dict_from_weak_keyed_dict
pull / linux-focal-py3.8-clang10 / test (dynamo, 1, 7, linux.2xlarge) (gh)
test_weak.py::WeakTest::test_make_weak_keyed_dict_from_weak_keyed_dict

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…ensors being recycled emporary tensors could not be recycled unless the operations are finished. Calling synchronize() can ensure all the operations are finished. The action can prevent OOM from happening. Differential Revision: [D52890462](https://our.internmc.facebook.com/intern/diff/D52890462/) ghstack-source-id: 212464431 Pull Request resolved: #117799

awgu · Jan 18, 2024

Does this part of the optim state dict load use multiple streams (which is why memory is not freed immediately when there are no more references)?

fegin · Jan 18, 2024

No multiple streams. But even if the temporary tensors are not referenced in Python, it can still being used, iiuc. For example, the clone() one will keep the source tensor (and its memory) alive until the operation is done. Or do you think this should not happen?

awgu · Jan 18, 2024

I was wondering if you have tried del-ing the source that is being cloned and seeing if that frees it at the time of del.

Suppose at first we have this:

a = torch.empty((3,), device="cuda")
b = a.clone()
c = torch.empty((3,), device="cuda")

c cannot reuse the memory of a since a is still alive (due to the Python reference).

Now, suppose we del a:

a = torch.empty((3,), device="cuda")
b = a.clone()
del a  # <--- add this
c = torch.empty((3,), device="cuda")

c can reuse the memory of a. It does not need to wait until the GPU copy kernel from clone() finishes because any subsequent GPU kernel using the memory for c will be sequentially after the GPU copy kernel.

I did not see any special handling of clone that would record a cudaEvent and free the memory later (like in the case of multiple streams and recordStream). I would be curious if del-ing did not free the memory instantly if there is only a single stream.

clone implementation

pytorch/aten/src/ATen/native/TensorFactories.cpp

Lines 1730 to 1751 in 2f84a9d

    
           Tensor clone(const Tensor& src, c10::optional<c10::MemoryFormat> optional_memory_format) { 
        
             auto memory_format = 
        
                 optional_memory_format.value_or(MemoryFormat::Preserve); 
        
             Tensor self; 
        
             if (memory_format == MemoryFormat::Preserve) { 
        
               if (src.is_non_overlapping_and_dense()) { 
        
                 // Copy all strides, this is marginally faster than calling empty_like 
        
                 self = at::empty_strided_symint(src.sym_sizes(), src.sym_strides(), src.options()); 
        
               } else { 
        
                 self = at::empty_like(src); 
        
               } 
        
             } else { 
        
               self = at::empty_like(src, src.options(), memory_format); 
        
             } 
        
             if (src._is_zerotensor()) { 
        
               self.zero_(); 
        
             } else { 
        
               self.copy_(src); 
        
             } 
        
             return self; 
        
           }

fegin · Jan 19, 2024

@awgu that makes sense. However, the tensor should already be deleted as it has no reference after the inner util function. So I don't think del makes any difference in our case. I would do more tests to verify.

fegin · Jan 19, 2024

@awgu I can confirm that del alone is not enough. I'll check if multiple streams are being used.

…temporary tensors being recycled" emporary tensors could not be recycled unless the operations are finished. Calling synchronize() can ensure all the operations are finished. The action can prevent OOM from happening. Differential Revision: [D52890462](https://our.internmc.facebook.com/intern/diff/D52890462/) cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

fegin · Jan 22, 2024

@awgu The root cause is DTensor.to_local() has asynchronous communication by default. So I believe we need to call synchronize() in such a case.

awgu · Jan 22, 2024

@awgu The root cause is DTensor.to_local() has asynchronous communication by default. So I believe we need to call synchronize() in such a case.

Thanks for looking into this!

@wanchaol Is there any way to specify to to_local() to use synchronous collectives? We want to avoid recordStream, and IIUC, we only avoid it for synchronous collectives. (cc: @kwen2501).

If we can run synchronous collectives and avoid recordStream, then I think we can avoid the CPU sync from synchronize() and instead just rely on the PG NCCL calling current_stream.wait_stream(nccl_stream). Avoiding the CPU sync here could prevent some CPU boundedness (though I have not looked at any profiles).

fegin · Jan 22, 2024

@awgu This is optimizer state_dict code path. What's the down side of using torch.cuda.synchronize()?

awgu · Jan 22, 2024

@fegin The downside is just performance from synchronizing the CPU. If there is not much performance downside, then using synchronize sounds good to me!

(I do not have a good mental model of the optimizer state dict performance, so maybe this is not really an issue. However, in general, I do think that considering the performance of state dict makes sense.)

fegin · Jan 22, 2024

@awgu Yup, agree that performance is important. Currently, synchronize() is probably not the main bottleneck. The main bottleneck is the allgather which is done in a per-parameter way without batching. However, it is hard to batching the allgather for optimizer state dict load.

…memory used by DTensors.to_local() being recycled" If a tensor is converted from DTensor (to_local()), there may be some async communication that has not finished yet. Calling synchronize() can ensure all the operations are finished. The action can prevent OOM from happening. Differential Revision: [D52890462](https://our.internmc.facebook.com/intern/diff/D52890462/) cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

fegin · Jan 23, 2024

This becomes a bigger issue than just a performance issue. If we don't call torch.cuda.synchronize(), the subsequent clone() will not work -- the value will not be correctly cloned.

cc., @wanchaol @awgu @wz337 @LucasLLC

…sors.to_local() is synchronized" If a tensor is converted from DTensor (to_local()), there may be some async communication that has not finished yet. Calling `clone()` with that tensor does not seem to work (and may increase the memory usage, users report OOM). This is a temporary fix. Differential Revision: [D52890462](https://our.internmc.facebook.com/intern/diff/D52890462/) cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

…al() is synchronized Pull Request resolved: #117799 If a tensor is converted from DTensor (to_local()), there may be some async communication that has not finished yet. Calling clone() with that tensor does not seem to work (and may increase the memory usage, users report OOM). This is a temporary fix. ghstack-source-id: 212762982 @exported-using-ghexport Differential Revision: [D52890462](https://our.internmc.facebook.com/intern/diff/D52890462/)

kwen2501 · Jan 23, 2024

Can we use stream sync instead of device sync?

LucasLLC

Lacking a bit of context but approving to unblock the fix

kwen2501 · Jan 23, 2024

if only device sync works and stream sync doesn't, that means there is something wrong with DTensor's to_local() -- it should sync its communication work back to the "current stream" (or else provide a Work handle for user to sync at some point)

kwen2501 · Jan 23, 2024

If (1) proper stream dependency is maintained by DTensor's to_local() and (2) torch.clone observes stream properly, you don't even need to call stream sync.

fegin · Jan 23, 2024

@kwen2501 _optim_utils.py calls to DTensor.to_local() to gather the tensor. It does not know which stream DTensor.to_local() uses. So I don't think I'm able to use stream wait.

kwen2501 · Jan 23, 2024

If to_local()'s API does not have flags like async_op=True|False, it must always sync back to main stream, so that you don't need to figure out which stream to wait. Said plainly, to_local() must always call work.wait() inside or expose work out.

awgu

Landing this as a temporary fix sounds good to me!

We should separately figure out how to make AsyncCollectiveTensor wait before clone() (if that is root issue).

wconstab · Jan 23, 2024

in the more general sense of a user using DTensor.to_local, i don't know if it makes sense that operation has to be synchronous. Wouldn't it be reasonable to expect a to_local call to return a new 'ACT' that represents ongoing tensor work?

wanchaol · Jan 23, 2024

torch/distributed/fsdp/_optim_utils.py

-            value = value.flatten()[intra_param_start_idx : intra_param_end_idx + 1].clone()  # type: ignore[operator]
+            if fsdp_state._device_mesh is not None:
+                # We have to call synchronize() if the tensor is gathered from
+                # DTensor.  Otherwise, the later `clone()` will cause errors.


are you using the full_tensor() API to do the gathering or something else? iirc full_tensor gives sync behavior

are you using the full_tensor() API to do the gathering or something else? iirc full_tensor gives sync behavior

No. We are still using redistribute for the all_gather, since full_tensor() API was introduced later.
Maybe we could update this line to use full_tensor API? https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_optim_utils.py#L1455

kwen2501 · Jan 23, 2024

Wouldn't it be reasonable to expect a to_local call to return a new 'ACT' that represents ongoing tensor work?

The reasonableness depends on how much tendency users have to write
x.to_local()
vesus
x = x.to_local()

~~In a world that tensor.to(...) is prevailing, I'd say the tendency is non-negligible...~~

awgu · Jan 23, 2024

@kwen2501 tensor.to(...) is not inplace though. Only nn.Module.to is in-place.

kwen2501 · Jan 23, 2024

@awgu You are right. Thanks for the correction!

See the discussion in #117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR force `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) [ghstack-poisoned]

fegin · Jan 24, 2024

I have not found the root cause. Since _gather_state_dict() is used not only by FSDP but also PP-FSDP, I decide to change the behavior of _gather_state_dict() to always call wait() before returning the tensor. The new PR is #118197.

See the discussion in #117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) Pull Request resolved: #118197 Approved by: https://github.com/wz337, https://github.com/LucasLLC

…to_local() result" See the discussion in #117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

See the discussion in #117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 [ghstack-poisoned]

…118197) See the discussion in pytorch#117799. There are some issues when returning a AsyncCollectiveTensor (haven't found the root causes), including OOM and unexpected values. This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream. Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/) Pull Request resolved: pytorch#118197 Approved by: https://github.com/wz337, https://github.com/LucasLLC

pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Jan 18, 2024

github-actions bot added oncall: distributed Add this issue/PR to distributed oncall triage queue ciflow/inductor labels Jan 18, 2024

fegin changed the title ~~[FSDP][optim_state_dict] Call synchronize() to ensure the temporary tensors being recycled~~ [FSDP][optim_state_dict] Call synchronize() to ensure the memory used by DTensors.to_local() being recycled Jan 22, 2024

fegin changed the title ~~[FSDP][optim_state_dict] Call synchronize() to ensure the memory used by DTensors.to_local() being recycled~~ [FSDP][optim_state_dict] Call synchronize() to ensure the memory used by DTensors.to_local() is synchronized Jan 23, 2024

fegin requested review from wz337, awgu and LucasLLC January 23, 2024 01:02

fegin changed the title ~~[FSDP][optim_state_dict] Call synchronize() to ensure the memory used by DTensors.to_local() is synchronized~~ [FSDP][optim_state_dict] Call synchronize() to ensure DTensors.to_local() is synchronized Jan 23, 2024

LucasLLC approved these changes Jan 23, 2024

View reviewed changes

LucasLLC mentioned this pull request Jan 23, 2024

Torch distributed checkpoint loading broken on nightly #118081

Closed

awgu approved these changes Jan 23, 2024

View reviewed changes

wanchaol reviewed Jan 23, 2024

View reviewed changes

fegin mentioned this pull request Jan 24, 2024

[state_dict] Calls wait() for the DTensor to_local() result #118197

Closed

fegin closed this Jan 24, 2024

mvpatel2000 mentioned this pull request Feb 12, 2024

[v2.2.1] Release Tracker #119295

Closed

Skylion007 mentioned this pull request Feb 12, 2024

[state_dict] Calls wait() for the DTensor to_local() result (#118197) #119692

Merged

github-actions bot deleted the gh/fegin/199/head branch February 24, 2024 01:50

Search code, repositories, users, issues, pull requests...

[FSDP][optim_state_dict] Call synchronize() to ensure DTensors.to_local() is synchronized #117799

[FSDP][optim_state_dict] Call synchronize() to ensure DTensors.to_local() is synchronized #117799

Uh oh!

Conversation

fegin commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/117799

❌ 2 New Failures

Uh oh!

awgu commented Jan 18, 2024

Uh oh!

fegin commented Jan 18, 2024

Uh oh!

awgu commented Jan 18, 2024

Uh oh!

fegin commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin commented Jan 19, 2024

Uh oh!

fegin commented Jan 22, 2024

Uh oh!

awgu commented Jan 22, 2024

Uh oh!

fegin commented Jan 22, 2024

Uh oh!

awgu commented Jan 22, 2024

Uh oh!

fegin commented Jan 22, 2024

Uh oh!

fegin commented Jan 23, 2024

Uh oh!

kwen2501 commented Jan 23, 2024

Uh oh!

LucasLLC left a comment

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jan 23, 2024

Uh oh!

kwen2501 commented Jan 23, 2024

Uh oh!

fegin commented Jan 23, 2024

Uh oh!

kwen2501 commented Jan 23, 2024

Uh oh!

awgu left a comment

Choose a reason for hiding this comment

Uh oh!

wconstab commented Jan 23, 2024

Uh oh!

wanchaol Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

wz337 Jan 23, 2024

Choose a reason for hiding this comment

Uh oh!

kwen2501 commented Jan 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

awgu commented Jan 23, 2024

Uh oh!

kwen2501 commented Jan 23, 2024

Uh oh!

fegin commented Jan 24, 2024

Uh oh!

Uh oh!

fegin commented Jan 18, 2024 •

edited

Loading

pytorch-bot bot commented Jan 18, 2024 •

edited

Loading

fegin commented Jan 19, 2024 •

edited

Loading

kwen2501 commented Jan 23, 2024 •

edited

Loading