[None][feat] Fuse all_reduce with norm for nemotron_h models#12410
[None][feat] Fuse all_reduce with norm for nemotron_h models#12410Wanli-Jiang merged 1 commit intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom Wanli-Jiang:user/williamj/fuse_allreduce_normWanli-Jiang/TensorRT-LLM:user/williamj/fuse_allreduce_normCopy head branch name to clipboard
Conversation
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #39732 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughTwo targeted enhancements: NemotronH model gains AllReduce fusion support with RMSNorm via new parameters ( Changes
Sequence Diagram(s)sequenceDiagram
participant Layer as NemotronHLayer
participant Norm as RMSNorm
participant Fusion as AllReduceFusionOp
participant Mixer as Mixer Layer
rect rgba(100, 150, 200, 0.5)
Note over Layer,Mixer: With fuse_allreduce_norm enabled
Layer->>Norm: Input tensor
Norm->>Fusion: Normalized output + hp_output
Fusion->>Mixer: Fused AllReduce-Norm output
Mixer->>Layer: Mixer output
end
rect rgba(200, 150, 100, 0.5)
Note over Layer,Mixer: With fuse_allreduce_norm disabled
Layer->>Norm: Input tensor
Norm->>Layer: Normalized output
Layer->>Mixer: Pre-norm output
Mixer->>Layer: Mixer output
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (1)
1-1:⚠️ Potential issue | 🟡 MinorBump the NVIDIA header year for this modified file.
The SPDX copyright line still ends at 2024 even though this file changes in this PR. Please update it to 2026.
As per coding guidelines: "Add NVIDIA copyright header on ALL new files, and update year on modified files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` at line 1, Update the SPDX copyright header year in the file by modifying the existing SPDX line that currently reads "SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to reflect the new modification year 2026; locate that SPDX header at the top of tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py and change the end year from 2024 to 2026 so it reads "...2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved."tensorrt_llm/_torch/models/modeling_nemotron_h.py (1)
233-240:⚠️ Potential issue | 🔴 CriticalKeep the attention-DP guard on the MoE output all-reduce.
Lines 575-577 force
fuse_allreduce_normoff whenenable_attention_dpis on, so this constructor now seesreduce_output=Truein exactly the mode where Lines 553-556 say all-reduce is invalid.AllReduce.forward()intensorrt_llm/_torch/distributed/ops.pydoes not suppress that case for you, so Lines 331-332 can hang or shape-fail on multi-rank MoE layers unless you preserve the oldnot self.enable_attention_dpcheck here.🛠️ Proposed fix
- if reduce_output: + if reduce_output and not self.enable_attention_dp: # AllReduce for combining shared and routed expert outputs in multi-GPU settings. self.allreduce = AllReduce( mapping=model_config.mapping, strategy=model_config.allreduce_strategy, ) else: self.allreduce = NoneAlso applies to: 331-332
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py` around lines 233 - 240, Constructor currently sets self.allreduce when reduce_output is true even if attention data-parallel is enabled; restore the attention-DP guard so AllReduce is only created when reduce_output is true AND attention DP is disabled. Modify the constructor logic around reduce_output/AllReduce (the block that assigns self.allreduce and uses AllReduce(mapping=..., strategy=...)) to also check self.enable_attention_dp (negated) before instantiating AllReduce; this preserves the previous behavior that avoids calling AllReduce.forward() in the attention-DP mode (see related checks around lines invoking fuse_allreduce_norm and the AllReduce.forward usage).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`:
- Around line 145-148: In __init__, validate tensor-parallel divisibility before
computing tp_ngroups, tp_d_inner/group_size, and head_group_ratio: ensure
n_groups is evenly divisible by tp_size (so tp_ngroups = n_groups // tp_size is
exact and >0), ensure self.tp_d_inner is divisible by tp_ngroups (so group_size
= self.tp_d_inner // self.tp_ngroups is valid), and ensure self.tp_nheads is
divisible by tp_ngroups (so head_group_ratio = self.tp_nheads // self.tp_ngroups
is exact); if any check fails either raise a clear ValueError or fall back to
disabling FlashInfer (set self._use_flashinfer = False) instead of proceeding.
Update the head_group_ratio computation to use the validated tp_ngroups (no
conditional truncation) and only consider FlashInfer/supported_head_group_ratios
when all divisibility checks succeeded so RMSNormGated and kernel dispatch
receive valid group_size and ratios.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_h.py`:
- Around line 233-240: Constructor currently sets self.allreduce when
reduce_output is true even if attention data-parallel is enabled; restore the
attention-DP guard so AllReduce is only created when reduce_output is true AND
attention DP is disabled. Modify the constructor logic around
reduce_output/AllReduce (the block that assigns self.allreduce and uses
AllReduce(mapping=..., strategy=...)) to also check self.enable_attention_dp
(negated) before instantiating AllReduce; this preserves the previous behavior
that avoids calling AllReduce.forward() in the attention-DP mode (see related
checks around lines invoking fuse_allreduce_norm and the AllReduce.forward
usage).
In `@tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`:
- Line 1: Update the SPDX copyright header year in the file by modifying the
existing SPDX line that currently reads "SPDX-FileCopyrightText: Copyright (c)
2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved." to reflect the
new modification year 2026; locate that SPDX header at the top of
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py and change the end year from
2024 to 2026 so it reads "...2022-2026 NVIDIA CORPORATION & AFFILIATES. All
rights reserved."
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 196b529f-4836-4be7-a0fd-7a0be6441e56
📒 Files selected for processing (2)
tensorrt_llm/_torch/models/modeling_nemotron_h.pytensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
|
PR_Github #39732 [ run ] completed with state
|
Signed-off-by: Wanli Jiang <35160485+Wanli-Jiang@users.noreply.github.com>
d8ef936 to
b827ea8
Compare
|
/bot run --only-multi-gpu-test --disable-fail-fast |
|
PR_Github #39898 [ run ] triggered by Bot. Commit: |
|
PR_Github #39898 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #40010 [ run ] triggered by Bot. Commit: |
|
PR_Github #40010 [ run ] completed with state |
Features
Fuses the AllReduce communication with RMS normalization + residual addition into a single kernel for Nemotron-H models when running with tensor parallelism (
tp_size > 1).How It Works
Before (unfused):
After (fused):
Each layer's
pre_allreducehandles the AllReduce of the previous layer's mixer output, fused with the current layer's norm. Layer 0 skips this (embedding output is already reduced). Afinal_allreducehandles the last layer's output → final norm.All mixer types participate in the fusion:
reduce_output=False)AllReduceParams(enable_allreduce=False))This ensures exactly one allreduce per layer, shifted to the next layer's
pre_allreduce.Scenarios That Benefit
pre_allreduce, enabling fusion across the full modelRESIDUAL_RMS_NORM_QUANT_NVFP4— norm output is quantized in-place, avoiding a separate quant kernel and memory round-tripScenarios That Are Unchanged
tp_size == 1orenable_attention_dpfuse_allreduce_norm = False)Detailed Benefits
Kernel launch reduction: Each layer saves 2-3 kernel launches (separate allreduce, residual add, rmsnorm → one fused kernel). For Nemotron-H with ~60+ layers, this is 120-180 fewer kernel launches per forward pass.
Memory bandwidth savings: The fused kernel reads peer buffers, accumulates in registers, applies residual + norm, and writes once. Without fusion, intermediate results hit DRAM 3-4 times.
NVFP4 quantization fusion: When NVFP4 is active, the norm output is quantized within the same kernel, saving yet another kernel + memory round-trip.
Latency hiding:
trigger_completion_at_end=Falseallows the allreduce to overlap its completion signaling with subsequent compute.Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.