[None][fix] Fix nemotron super MTP crash on SM90 by sunnyqgg · Pull Request #11807 · NVIDIA/TensorRT-LLM

sunnyqgg · Feb 28, 2026

Summary

Fix MTP speculative decoding crash on SM90 (H200) in _torch/speculative/mtp.py
Fix Nemotron model MTP when enabling enable_attention_dp
Add accuracy regression tests for Nemotron models (test_llm_api_pytorch.py)
Update test lists and requirements

Changes

tensorrt_llm/_torch/speculative/mtp.py: Fix MTP speculative decoding crash on SM90
tensorrt_llm/_torch/models/modeling_nemotron_h.py: Fix MTP with attention DP enabled
tensorrt_llm/_torch/cute_dsl_kernels/argmax.py: Related kernel fixes
tests/integration/defs/accuracy/test_llm_api_pytorch.py: Add Nemotron accuracy tests
tests/integration/test_lists/: Update test lists and DB configs
requirements.txt: Update dependencies

Test plan

Verify MTP speculative decoding on H200 (SM90) no longer crashes
Verify Nemotron model with enable_attention_dp works correctly
Run added accuracy tests on DGX B200

Summary by CodeRabbit

New Features
- Added test coverage for NVFP4 quantization with 4-GPU multi-token prediction setup.
Bug Fixes
- Improved speculative decoding state management and KV cache handling in multi-token prediction.
Performance Improvements
- Enhanced precision handling for argmax operations, ensuring float32 output for better accuracy.
Chores
- Pinned flashinfer-python dependency to exact version for stability.
- Refactored internal function handling in selective state updates for consistency.

coderabbitai · Feb 28, 2026

📝 Walkthrough

Walkthrough

This PR updates flashinfer-python dependency, modifies dtype handling in an argmax kernel, refactors speculative decoding and Mamba2 selective state update logic, adds **kwargs propagation to Nemotron model forward methods, and introduces new NVFP4 MTP integration tests.

Changes

Cohort / File(s)	Summary
Dependency Updates `requirements.txt`	Updated flashinfer-python from compatible release (~0.6.2) to exact pin (==0.6.4).
Kernel & Type Handling `tensorrt_llm/_torch/cute_dsl_kernels/argmax.py`	Changed argmax output tensor allocation from input dtype to float32, ensuring float32 precision for both max values and indices with supporting comments explaining dtype choices.
Model Architecture Updates `tensorrt_llm/_torch/models/modeling_nemotron_h.py`	Added **kwargs propagation to NemotronHMPDecoderLayer, NemotronHMTP, and NemotronHMTPDecoderLayer forward methods for flexible parameter passing to downstream components.
Mamba2 Selective State Update Refactor `tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py`	Consolidated two separate selective state update function hooks (selective_state_update_func_no_mtp and selective_state_update_func_mtp) into unified selective_state_update_func, simplifying initialization and forward pass logic.
Speculative Decoding Logic `tensorrt_llm/_torch/speculative/mtp.py`	Modified position_ids and kv_lens handling in prepare_position_ids_and_last_tokens to disable speculative decoding after updates and replace compiled in-loop kernel with direct in-place increment for kv_lens.
MTP Integration Tests `tests/integration/defs/accuracy/test_llm_api_pytorch.py`, `tests/integration/test_lists/qa/llm_function_core.txt`, `tests/integration/test_lists/test-db/l0_dgx_b200.yml`	Added test_nvfp4_4gpu_mtp_ar integration test for NVFP4 with 4 GPUs and MTP (max_draft_len=7) to validate acceptance rates; registered test in configuration files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description clearly explains the core issues being fixed (MTP speculative decoding crash on SM90, Nemotron MTP with enable_attention_dp), lists all modified files and their purposes, and provides a test plan with completed verification steps.
Title check	✅ Passed	The title accurately summarizes the primary fix: resolving an MTP crash on SM90 (H200) for Nemotron, which aligns with the main objective stated in the PR summary.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/cute_dsl_kernels/argmax.py (1)

650-653: ⚠️ Potential issue | 🟡 Minor

Inconsistent dtype in CUTLASS DSL fallback.

The fallback function when CUTLASS DSL is unavailable still converts indices to x.dtype instead of float32, which is inconsistent with the main argmax function's output contract (lines 600-603).

🐛 Proposed fix for dtype consistency

 else:
     # Fallback if CUTLASS DSL is not available
     def argmax(x: torch.Tensor) -> torch.Tensor:
         """Fallback argmax using PyTorch when CUTLASS DSL is not available."""
         max_vals, max_indices = torch.max(x, dim=-1, keepdim=True)
-        return torch.cat([max_vals, max_indices.to(x.dtype)], dim=-1)
+        return torch.cat([max_vals.to(torch.float32), max_indices.to(torch.float32)], dim=-1)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/cute_dsl_kernels/argmax.py` around lines 650 - 653, The
fallback argmax in function argmax currently casts max indices to x.dtype which
mismatches the main implementation's contract; change the cast so indices are
converted to torch.float32 (not x.dtype) before concatenation with max_vals,
ensuring the returned tensor matches the main argmax output dtype/shape (use
torch.max(..., keepdim=True) then torch.cat([max_vals,
max_indices.to(torch.float32)], dim=-1)).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 5865-5867: Guard against divide-by-zero when computing
accept_rate: before computing accept_rate = num_accepted / num_drafted, check
num_drafted and handle the zero case (e.g., assert with a clear message like "No
drafted tokens for prompt {i}" or skip the prompt) so the code doesn't raise
ZeroDivisionError; update the block around the variables accept_rate,
num_accepted, num_drafted and the prompt index i accordingly.
- Around line 5833-5868: The LLM instance is created without deterministic
cleanup; change the creation of llm_spec to use a context manager (e.g., with
LLM(**llm_common_config, speculative_config=mtp_config) as llm_spec:) so the LLM
is deterministically closed after the test block; update the block that uses
llm_spec.tokenizer, llm_spec.generate_async, and related variables to be inside
that with scope (or alternatively call llm_spec.close() in a finally) to ensure
proper teardown in the integration test.

---

Outside diff comments:
In `@tensorrt_llm/_torch/cute_dsl_kernels/argmax.py`:
- Around line 650-653: The fallback argmax in function argmax currently casts
max indices to x.dtype which mismatches the main implementation's contract;
change the cast so indices are converted to torch.float32 (not x.dtype) before
concatenation with max_vals, ensuring the returned tensor matches the main
argmax output dtype/shape (use torch.max(..., keepdim=True) then
torch.cat([max_vals, max_indices.to(torch.float32)], dim=-1)).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb5cf9b and a1bca4c.

📒 Files selected for processing (8)

requirements.txt
tensorrt_llm/_torch/cute_dsl_kernels/argmax.py
tensorrt_llm/_torch/models/modeling_nemotron_h.py
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
tensorrt_llm/_torch/speculative/mtp.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tests/integration/test_lists/qa/llm_function_core.txt
tests/integration/test_lists/test-db/l0_dgx_b200.yml

sunnyqgg · Mar 2, 2026

/bot run

tensorrt-cicd · Mar 2, 2026

PR_Github #37270 [ run ] triggered by Bot. Commit: de31baf Link to invocation

sunnyqgg · Mar 2, 2026

/bot run

tensorrt-cicd · Mar 2, 2026

PR_Github #37270 [ run ] completed with state FAILURE. Commit: de31baf
/LLM/main/L0_MergeRequest_PR pipeline #28844 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · Mar 2, 2026

PR_Github #37274 [ run ] triggered by Bot. Commit: da41308 Link to invocation

sunnyqgg · Mar 2, 2026

/bot run

tensorrt-cicd · Mar 2, 2026

PR_Github #37276 [ run ] triggered by Bot. Commit: 8f09385 Link to invocation

sunnyqgg · Mar 2, 2026

/bot run

tensorrt-cicd · Mar 2, 2026

PR_Github #37282 [ run ] triggered by Bot. Commit: ff662fa Link to invocation

sunnyqgg · Mar 3, 2026

/bot run

tensorrt-cicd · Mar 3, 2026

PR_Github #37417 [ run ] triggered by Bot. Commit: b3e71b1 Link to invocation

Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · Mar 3, 2026

/bot run

tensorrt-cicd · Mar 3, 2026

PR_Github #37423 [ run ] triggered by Bot. Commit: 8e8629d Link to invocation

sunnyqgg · Mar 3, 2026

/bot run

tensorrt-cicd · Mar 3, 2026

PR_Github #37434 [ run ] triggered by Bot. Commit: de34a7f Link to invocation

…e test Signed-off-by: qgai <qgai@nvidia.com>

tensorrt-cicd · Mar 3, 2026

PR_Github #37434 [ run ] completed with state SUCCESS. Commit: de34a7f
/LLM/main/L0_MergeRequest_PR pipeline #28977 completed with status: 'ABORTED'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

chzblych · Mar 3, 2026

/bot run

tensorrt-cicd · Mar 3, 2026

PR_Github #37523 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

tensorrt-cicd · Mar 3, 2026

PR_Github #37523 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29031 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

sunnyqgg · Mar 4, 2026

/bot run

tensorrt-cicd · Mar 4, 2026

PR_Github #37588 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

tensorrt-cicd · Mar 4, 2026

PR_Github #37588 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29087 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

sunnyqgg · Mar 4, 2026

/bot run

tensorrt-cicd · Mar 4, 2026

PR_Github #37626 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

tensorrt-cicd · Mar 4, 2026

PR_Github #37626 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29116 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

sunnyqgg · Mar 4, 2026

/bot run

tensorrt-cicd · Mar 4, 2026

PR_Github #37640 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

tensorrt-cicd · Mar 4, 2026

PR_Github #37640 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29130 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

sunnyqgg · Mar 5, 2026

/bot run

tensorrt-cicd · Mar 5, 2026

PR_Github #37779 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

tensorrt-cicd · Mar 5, 2026

PR_Github #37779 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29246 completed with status: 'SUCCESS'

Link to invocation

Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg requested review from a team as code owners February 28, 2026 12:37

sunnyqgg requested review from Naveassaf, cascade812, liji-nv, tomeras91 and yechank-nvidia February 28, 2026 12:37

coderabbitai Bot reviewed Feb 28, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py

sunnyqgg force-pushed the nemotron-super-h100 branch 3 times, most recently from 46cb498 to 2aff41d Compare February 28, 2026 13:30

sunnyqgg changed the title ~~Fix Nemotron MTP speculative decoding crash on SM90 (H200)~~ [None][fix] Fix MTP speculative decoding crash on SM90 (H200) Feb 28, 2026

sunnyqgg changed the title ~~[None][fix] Fix MTP speculative decoding crash on SM90 (H200)~~ [None][fix] Fix nemotron super MTP crash on SM90 Feb 28, 2026

sunnyqgg force-pushed the nemotron-super-h100 branch from 2aff41d to 8e09f8e Compare February 28, 2026 13:37

nv-guomingz reviewed Mar 2, 2026

View reviewed changes

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated

[None][fix] Restore torch.compile for kv_lens update in MTP loop

8e8629d

Signed-off-by: qgai <qgai@nvidia.com>

[None][fix] Fix MTP spec decoding use_spec_decoding restore and enabl…

ab64049

…e test Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg force-pushed the nemotron-super-h100 branch from de34a7f to ab64049 Compare March 3, 2026 05:07

mikeiovine approved these changes Mar 5, 2026

View reviewed changes

mikeiovine merged commit 517ee94 into NVIDIA:main Mar 5, 2026
5 checks passed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026

[None][fix] Fix nemotron super MTP crash on SM90 (NVIDIA#11807)

24fd2f8

Signed-off-by: qgai <qgai@nvidia.com>

tianyuz-nv pushed a commit to wanqian-nv/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][fix] Fix nemotron super MTP crash on SM90 (NVIDIA#11807)

189d5db

Signed-off-by: qgai <qgai@nvidia.com>

limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][fix] Fix nemotron super MTP crash on SM90 (NVIDIA#11807)

e6da029

Signed-off-by: qgai <qgai@nvidia.com>

Search code, repositories, users, issues, pull requests...

Conversation

sunnyqgg commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sunnyqgg commented Mar 2, 2026

Uh oh!

tensorrt-cicd commented Mar 2, 2026

Uh oh!

sunnyqgg commented Mar 2, 2026

Uh oh!

tensorrt-cicd commented Mar 2, 2026

Uh oh!

tensorrt-cicd commented Mar 2, 2026

Uh oh!

sunnyqgg commented Mar 2, 2026

Uh oh!

tensorrt-cicd commented Mar 2, 2026

Uh oh!

sunnyqgg commented Mar 2, 2026

Uh oh!

tensorrt-cicd commented Mar 2, 2026

Uh oh!

sunnyqgg commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

sunnyqgg commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

sunnyqgg commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

chzblych commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

sunnyqgg commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

sunnyqgg commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

sunnyqgg commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

sunnyqgg commented Mar 5, 2026

Uh oh!

sunnyqgg commented Feb 28, 2026 •

edited

Loading

coderabbitai Bot commented Feb 28, 2026 •

edited

Loading