[fsdp] feat: Memory efficient cross entropy with a linear layer fused #462

Jianbing-D · Mar 4, 2025

Implemented forward and backward of the following compute logics, which eliminated many intermediate storage tensors, and resulted in reduced peak memory usage.

Equivalent compute logic:

def run_torch_entropy(hidden: torch.Tensor,
                    weight: torch.Tensor,
                    labels: torch.Tensor) -> typing.List[torch.Tensor]:
    logits = torch.matmul(hidden.to(torch.float32), weight.to(torch.float32)) # [num_tokens, vocab_size]
    pd = torch.nn.functional.softmax(logits, dim=-1) # [num_tokens, vocab_size]
    entropy_a = torch.logsumexp(logits, dim=-1) # [num_tokens]
    entropy_b = torch.sum(pd * logits, dim=-1) # [num_tokens]
    entropy = entropy_a - entropy_b
    logprobs = torch.nn.functional.cross_entropy(logits, labels) # [1]
    logprobs = torch.neg(logprobs)
    return logprobs, entropy

API

from verl.utils.kernel import linear_cross_entropy

hidden = torch.randn(num_tokens, hidden_size, dtype=torch.bfloat16, device="cuda")
weight = torch.randn(hidden_size, vocab_size, dtype=torch.bfloat16, device="cuda")
labels = torch.randint(0, vocab_size, (num_tokens,), device="cuda")

loss, entropy = linear_cross_entropy(hidden, weight, labels, reduction="mean")

Storage and latency

Unit test

$ cd verl/
$ python3 tests/kernel/test_memory_efficient_entropy.py

NOTE

For compatibility, torch.library.triton_op was not applied to those APIs, so that torch.compile might not be able to be enabled on top of it.

vermouth1992 · Mar 10, 2025

Could you please perform formatting according to the readme?

ETOgaosion · Mar 20, 2025

Tested with torch and VeRL current implementation, the improvement is huge.

Currently integrated to dp_actor.py

ETOgaosion · Mar 20, 2025

The integration has OOM problem, with current fake-weight way. Will reconsider the fusion of linear layer with cross entropy.

vermouth1992 · Mar 20, 2025

A success of intergration is that the max_token_len can be significantly increased compared to not using this kernel

Jianbing-D · Mar 25, 2025

TP experiment result

gameofdimension · Apr 8, 2025

Liger has a similar kernel called FusedLinearCrossEntropy

vermouth1992 · Apr 8, 2025

Liger has a similar kernel called FusedLinearCrossEntropy

The kernel in liger can't satisfy the requirement as there are additional loss computation after the kernel, which liger kernel can't support

Jianbing-D · Jun 6, 2025

End2End results:

vermouth1992 · Jun 7, 2025

There are multiple CI failures. Could you please fix them? Thanks.

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

ETOgaosion · Jun 8, 2025

Sorry for the close and open operations.

Use main branch to PR may be a dangerous operation for maintainers to cooperation and rebase (QaQ)

Next time will still use PR to others' repo.

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

…volcengine#462) Implemented forward and backward of the following compute logics, which eliminated many intermediate storage tensors, and resulted in reduced peak memory usage. ## Equivalent compute logic: ```python def run_torch_entropy(hidden: torch.Tensor, weight: torch.Tensor, labels: torch.Tensor) -> typing.List[torch.Tensor]: logits = torch.matmul(hidden.to(torch.float32), weight.to(torch.float32)) # [num_tokens, vocab_size] pd = torch.nn.functional.softmax(logits, dim=-1) # [num_tokens, vocab_size] entropy_a = torch.logsumexp(logits, dim=-1) # [num_tokens] entropy_b = torch.sum(pd * logits, dim=-1) # [num_tokens] entropy = entropy_a - entropy_b logprobs = torch.nn.functional.cross_entropy(logits, labels) # [1] logprobs = torch.neg(logprobs) return logprobs, entropy ``` ## API ```python from verl.utils.kernel import linear_cross_entropy hidden = torch.randn(num_tokens, hidden_size, dtype=torch.bfloat16, device="cuda") weight = torch.randn(hidden_size, vocab_size, dtype=torch.bfloat16, device="cuda") labels = torch.randint(0, vocab_size, (num_tokens,), device="cuda") loss, entropy = linear_cross_entropy(hidden, weight, labels, reduction="mean") ``` ## Storage and latency <img width="636" alt="image" src="https://github.com/user-attachments/assets/396b7303-a46a-46b1-a261-917fda034b02" /> ## Unit test ```shell $ cd verl/ $ python3 tests/kernel/test_memory_efficient_entropy.py ``` # NOTE For compatibility, `torch.library.triton_op` was not applied to those APIs, so that `torch.compile` might not be able to be enabled on top of it. --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: gaoziyuan.955 <gaoziyuan.955@bytedance.com> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>

vadimkantorov · Jul 8, 2025

Does this PR improve the GRPO loss computation in terms of peak memory? I've come across https://unsloth.ai/blog/grpo which describes how to implement GRPO in the chunked/fused style as well. So I wonder if Verl implements such technique as well

hljjjmssyh · Jul 17, 2025

I was wondering whether the recent introduction of this feature might have contributed to the issue described below.
#2547

vadimkantorov · Jul 17, 2025

Curious, why For compatibility, torch.library.triton_op was not applied to those APIs, so that torch.compile might not be able to be enabled on top of it.?

Wouldn't it be better to be also be able to use torch.compile on the whole model / loss?

WindowsXp-Beta · Jul 21, 2025

I noticed some weird results after enabling kernel fusion as described in #2656. wondering if it's a bug or I didn't use it correctly. @Jianbing-D

vadimkantorov · Jul 21, 2025

@WindowsXp-Beta are problems with both torch and triton fused backend?

WindowsXp-Beta · Jul 24, 2025

@WindowsXp-Beta are problems with both torch and triton fused backend?

Sorry for the late response. Was testing whether if it's caused by our internal model. Our current results show torch backend works normally but triton backend leads to reward collapse and entropy mismatch. We're also testing on Qwen2.5-VL to see if the problem still exists.

vadimkantorov · Jul 25, 2025

cc @eric-haibin-lin @vermouth1992

WindowsXp-Beta · Jul 25, 2025

@vadimkantorov sorry for the late update. Spent some time setting up the environment to run a Qwen2.5-VL using the mainline code. We found log_probs and entropy calculated by fused_kernel and vanilla torch impl matched for Qwen2.5-VL. So looks like the problem is our side and we're still working on it.

WindowsXp-Beta · Jul 25, 2025

Hi @vadimkantorov , after more tests we suspected the triton kernel may have bugs on certain hidden_states and weights values. Details see the latest comment in #2656. I wonder if you have ever seen similar thing / could reproduce this mismatch on your side.

…volcengine#462) Implemented forward and backward of the following compute logics, which eliminated many intermediate storage tensors, and resulted in reduced peak memory usage. ## Equivalent compute logic: ```python def run_torch_entropy(hidden: torch.Tensor, weight: torch.Tensor, labels: torch.Tensor) -> typing.List[torch.Tensor]: logits = torch.matmul(hidden.to(torch.float32), weight.to(torch.float32)) # [num_tokens, vocab_size] pd = torch.nn.functional.softmax(logits, dim=-1) # [num_tokens, vocab_size] entropy_a = torch.logsumexp(logits, dim=-1) # [num_tokens] entropy_b = torch.sum(pd * logits, dim=-1) # [num_tokens] entropy = entropy_a - entropy_b logprobs = torch.nn.functional.cross_entropy(logits, labels) # [1] logprobs = torch.neg(logprobs) return logprobs, entropy ``` ## API ```python from verl.utils.kernel import linear_cross_entropy hidden = torch.randn(num_tokens, hidden_size, dtype=torch.bfloat16, device="cuda") weight = torch.randn(hidden_size, vocab_size, dtype=torch.bfloat16, device="cuda") labels = torch.randint(0, vocab_size, (num_tokens,), device="cuda") loss, entropy = linear_cross_entropy(hidden, weight, labels, reduction="mean") ``` ## Storage and latency <img width="636" alt="image" src="https://github.com/user-attachments/assets/396b7303-a46a-46b1-a261-917fda034b02" /> ## Unit test ```shell $ cd verl/ $ python3 tests/kernel/test_memory_efficient_entropy.py ``` # NOTE For compatibility, `torch.library.triton_op` was not applied to those APIs, so that `torch.compile` might not be able to be enabled on top of it. --------- Signed-off-by: Jianbing Dong <jianbingd@nvidia.com> Co-authored-by: ETOgaosion <gaoziyuan19@mails.ucas.ac.cn> Co-authored-by: gaoziyuan.955 <gaoziyuan.955@bytedance.com> Co-authored-by: Blue Space <57280232+ETOgaosion@users.noreply.github.com>

Jianbing-D marked this pull request as ready for review March 4, 2025 07:01

Jianbing-D force-pushed the main branch 2 times, most recently from 1221335 to a14f31d Compare March 5, 2025 07:13

vermouth1992 previously approved these changes Mar 8, 2025

View reviewed changes

Jianbing-D dismissed vermouth1992’s stale review via 2f7ad86 March 17, 2025 07:39

Jianbing-D force-pushed the main branch from d4ef417 to 2f7ad86 Compare March 17, 2025 07:39

Jianbing-D force-pushed the main branch from 8dfd0eb to ce7eedc Compare March 25, 2025 08:50

Jianbing-D force-pushed the main branch from c6907ff to 05e24b9 Compare April 8, 2025 03:17

ETOgaosion mentioned this pull request May 5, 2025

Feat/memory optimized loss #1212

Merged

3 tasks

Jianbing-D force-pushed the main branch from 05e24b9 to 0ddded4 Compare May 15, 2025 04:20

Jianbing-D force-pushed the main branch from 0ddded4 to 9d17864 Compare June 6, 2025 07:00

Jianbing-D mentioned this pull request Jun 6, 2025

Memory efficient cross entropy with a linear layer fused #1878

Closed

vermouth1992 requested a review from ETOgaosion June 7, 2025 02:58

Jianbing-D and others added 8 commits June 9, 2025 01:19

add linear_cross_entropy

3919dab

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

make patch feasible

c5baddf

integrate fsdp kernel

551bfe0

fix tests

31a1899

fix tests

fdcc37c

fix shapes

856d812

seems no problem with APIs, but precisions not match

9430d64

pass tests

80fb8e5

ETOgaosion reopened this Jun 8, 2025

volcengine deleted a comment from CLAassistant Jun 8, 2025

reduce redundant CI

b665408

ETOgaosion changed the title ~~Memory efficient cross entropy with a linear layer fused~~ [feat] Memory efficient cross entropy with a linear layer fused Jun 8, 2025

ETOgaosion changed the title ~~[feat] Memory efficient cross entropy with a linear layer fused~~ [fsdp] feat: Memory efficient cross entropy with a linear layer fused Jun 8, 2025

ETOgaosion and others added 9 commits June 9, 2025 10:16

fix CI, qwen2_5_vl

78fe661

Merge branch 'main' into main

e5e79c0

try fix memory

fa40bcf

seems use double

2a37c96

Merge branch 'main' into main

c148efb

remove unnecessary assertion on vocab_size

cb6715e

Signed-off-by: Jianbing Dong <jianbingd@nvidia.com>

ref also use fused kernel

e4f9325

ref also use fused kernel

c58ea2c

fix kernel APIs

beec2ea

ETOgaosion approved these changes Jun 11, 2025

View reviewed changes

ETOgaosion merged commit c8908e1 into volcengine:main Jun 11, 2025
36 checks passed

vadimkantorov mentioned this pull request Jul 8, 2025

Fused (or at least chunked) Linear + GRPO loss #2422

Open

WindowsXp-Beta mentioned this pull request Jul 29, 2025

[bug] logprobs & entropy mismatch between w/ and w/o enabling kernel fusion #2656

Open

Search code, repositories, users, issues, pull requests...

[fsdp] feat: Memory efficient cross entropy with a linear layer fused #462

[fsdp] feat: Memory efficient cross entropy with a linear layer fused #462

Uh oh!

Conversation

Jianbing-D commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Equivalent compute logic:

API

Storage and latency

Unit test

NOTE

Uh oh!

vermouth1992 commented Mar 10, 2025

Uh oh!

ETOgaosion commented Mar 20, 2025

Uh oh!

ETOgaosion commented Mar 20, 2025

Uh oh!

vermouth1992 commented Mar 20, 2025

Uh oh!

Jianbing-D commented Mar 25, 2025

TP experiment result

Uh oh!

gameofdimension commented Apr 8, 2025

Uh oh!

vermouth1992 commented Apr 8, 2025

Uh oh!

Jianbing-D commented Jun 6, 2025

Uh oh!

vermouth1992 commented Jun 7, 2025

Uh oh!

ETOgaosion commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vadimkantorov commented Jul 8, 2025

Uh oh!

hljjjmssyh commented Jul 17, 2025

Uh oh!

vadimkantorov commented Jul 17, 2025

Uh oh!

WindowsXp-Beta commented Jul 21, 2025

Uh oh!

vadimkantorov commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WindowsXp-Beta commented Jul 24, 2025

Uh oh!

vadimkantorov commented Jul 25, 2025

Uh oh!

WindowsXp-Beta commented Jul 25, 2025

Uh oh!

WindowsXp-Beta commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Jianbing-D commented Mar 4, 2025 •

edited

Loading

ETOgaosion commented Jun 8, 2025 •

edited

Loading

vadimkantorov commented Jul 21, 2025 •

edited

Loading