-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[FSDP] feat: Add FSDP forward pefetch and recompute chunking entropy #1927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ulysses_sequence_parallel_size: 1 | ||
|
||
# calculate entropy with chunking to reduce memory peak | ||
entropy_from_logits_with_chunking: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these two options? I guess they should be on by default for NPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove these two options? I guess they should be on by default for NPU
not only for NPU, but also for GPU, chunk+recompute can reduce entropy memory utilization(e2e memory peak). If memory is sufficient, its recommended to disable these two options.
…olcengine#1927) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? 1. Add fsdp1 forward pefetch configuration. 2. Add chunk entropy computation. 3. Add torch.checkpoint to entropy computation. 4. Move data to device from `ActorRolloutRefWorker.update_actor` to `DataParallelPPOActor.update_policy`. 5. Add `npu_cross_entropy_loss` fusion kernel. ### High-Level Design 1. More detail see [FSDP forward_pefetch](https://docs.pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) 2. `logits` usually is a large tensor [bsz\*seq_len, voc], on `compute_entropy_from_logits` will use [bsz\*seq_len, voc] * (4(float32) + 2(autocast of softmax+logsumexp) + 1(output of softmax)) memory. To reduce this memory peak, we can use chunk calculation, changing [bsz*seq_len, voc] to [chunk_size(2048), voc]. 3. During the training phase, `enable_gradient_checkpointing=True` is not applicable to entropy calculation, so add the recomputation function of entropy to reduce the memory peak during the training phase. 4. On `ActorRolloutRefWorker.update_actor` all batch data is moved to the device, but this is unnecessary, `DataParallelPPOActor.update_policy` will move the data to the device for each micro batch. ### Specific Changes > List the specific changes. ### API Add 3 new configurations in actor/ref, 1 new configuration in critic/reward. - actor_rollout_ref.actor.fsdp_config.forward_prefetch: False - actor_rollout_ref.actor.entropy_from_logits_with_chunking: False - actor_rollout_ref.actor.entropy_checkpointing: False - actor_rollout_ref.ref.fsdp_config.forward_prefetch: False - actor_rollout_ref.ref.entropy_from_logits_with_chunking: False - actor_rollout_ref.ref.entropy_checkpointing: False - critic.model.fsdp_config.forward_prefetch: False - reward_model.model.fsdp_config.forward_prefetch: False ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [x] Rely on existing unit tests on CI that covers the code path.
@CurryRice233 thx for the contribution! Would you mind adding these options to https://verl.readthedocs.io/en/latest/perf/perf_tuning.html (https://github.com/volcengine/verl/blob/main/docs/perf/perf_tuning.rst) ? |
No problem, I will create a PR within this week. |
…2322) ### What does this PR do? @eric-haibin-lin As this comment says #1927 (comment), add FSDP forward prefetch and entropy calculation memory optimization to performance tuning guide. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
…olcengine#2322) ### What does this PR do? @eric-haibin-lin As this comment says volcengine#1927 (comment), add FSDP forward prefetch and entropy calculation memory optimization to performance tuning guide. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
…2322) ### What does this PR do? @eric-haibin-lin As this comment says volcengine/verl#1927 (comment), add FSDP forward prefetch and entropy calculation memory optimization to performance tuning guide. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
…olcengine#2322) ### What does this PR do? @eric-haibin-lin As this comment says volcengine#1927 (comment), add FSDP forward prefetch and entropy calculation memory optimization to performance tuning guide. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes > List the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ).
…olcengine#1927) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? 1. Add fsdp1 forward pefetch configuration. 2. Add chunk entropy computation. 3. Add torch.checkpoint to entropy computation. 4. Move data to device from `ActorRolloutRefWorker.update_actor` to `DataParallelPPOActor.update_policy`. 5. Add `npu_cross_entropy_loss` fusion kernel. ### High-Level Design 1. More detail see [FSDP forward_pefetch](https://docs.pytorch.org/docs/stable/fsdp.html#module-torch.distributed.fsdp) 2. `logits` usually is a large tensor [bsz\*seq_len, voc], on `compute_entropy_from_logits` will use [bsz\*seq_len, voc] * (4(float32) + 2(autocast of softmax+logsumexp) + 1(output of softmax)) memory. To reduce this memory peak, we can use chunk calculation, changing [bsz*seq_len, voc] to [chunk_size(2048), voc]. 3. During the training phase, `enable_gradient_checkpointing=True` is not applicable to entropy calculation, so add the recomputation function of entropy to reduce the memory peak during the training phase. 4. On `ActorRolloutRefWorker.update_actor` all batch data is moved to the device, but this is unnecessary, `DataParallelPPOActor.update_policy` will move the data to the device for each micro batch. ### Specific Changes > List the specific changes. ### API Add 3 new configurations in actor/ref, 1 new configuration in critic/reward. - actor_rollout_ref.actor.fsdp_config.forward_prefetch: False - actor_rollout_ref.actor.entropy_from_logits_with_chunking: False - actor_rollout_ref.actor.entropy_checkpointing: False - actor_rollout_ref.ref.fsdp_config.forward_prefetch: False - actor_rollout_ref.ref.entropy_from_logits_with_chunking: False - actor_rollout_ref.ref.entropy_checkpointing: False - critic.model.fsdp_config.forward_prefetch: False - reward_model.model.fsdp_config.forward_prefetch: False ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc. ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] New CI unit test(s) are added to cover the code path. - [x] Rely on existing unit tests on CI that covers the code path.
Checklist Before Starting
What does this PR do?
ActorRolloutRefWorker.update_actor
toDataParallelPPOActor.update_policy
.npu_cross_entropy_loss
fusion kernel.High-Level Design
logits
usually is a large tensor [bsz*seq_len, voc], oncompute_entropy_from_logits
will use [bsz*seq_len, voc] * (4(float32) + 2(autocast of softmax+logsumexp) + 1(output of softmax)) memory. To reduce this memory peak, we can use chunk calculation, changing [bsz*seq_len, voc] to [chunk_size(2048), voc].enable_gradient_checkpointing=True
is not applicable to entropy calculation, so add the recomputation function of entropy to reduce the memory peak during the training phase.ActorRolloutRefWorker.update_actor
all batch data is moved to the device, but this is unnecessary,DataParallelPPOActor.update_policy
will move the data to the device for each micro batch.Specific Changes
API
Add 3 new configurations in actor/ref, 1 new configuration in critic/reward.
Usage Example
# Add code snippet or script demonstrating how to use this
Test
Additional Info.
Checklist Before Submitting
[BREAKING]
to the PR title if it breaks any API.