Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[None][fix] Fix nemotron super MTP crash on SM90#11807

Merged
mikeiovine merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
sunnyqgg:nemotron-super-h100sunnyqgg/TensorRT-LLM:nemotron-super-h100Copy head branch name to clipboard
Mar 5, 2026
Merged

[None][fix] Fix nemotron super MTP crash on SM90#11807
mikeiovine merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
sunnyqgg:nemotron-super-h100sunnyqgg/TensorRT-LLM:nemotron-super-h100Copy head branch name to clipboard

Conversation

@sunnyqgg
Copy link
Copy Markdown
Collaborator

@sunnyqgg sunnyqgg commented Feb 28, 2026

Summary

  • Fix MTP speculative decoding crash on SM90 (H200) in _torch/speculative/mtp.py
  • Fix Nemotron model MTP when enabling enable_attention_dp
  • Add accuracy regression tests for Nemotron models (test_llm_api_pytorch.py)
  • Update test lists and requirements

Changes

  • tensorrt_llm/_torch/speculative/mtp.py: Fix MTP speculative decoding crash on SM90
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py: Fix MTP with attention DP enabled
  • tensorrt_llm/_torch/cute_dsl_kernels/argmax.py: Related kernel fixes
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py: Add Nemotron accuracy tests
  • tests/integration/test_lists/: Update test lists and DB configs
  • requirements.txt: Update dependencies

Test plan

  • Verify MTP speculative decoding on H200 (SM90) no longer crashes
  • Verify Nemotron model with enable_attention_dp works correctly
  • Run added accuracy tests on DGX B200

Summary by CodeRabbit

  • New Features

    • Added test coverage for NVFP4 quantization with 4-GPU multi-token prediction setup.
  • Bug Fixes

    • Improved speculative decoding state management and KV cache handling in multi-token prediction.
  • Performance Improvements

    • Enhanced precision handling for argmax operations, ensuring float32 output for better accuracy.
  • Chores

    • Pinned flashinfer-python dependency to exact version for stability.
    • Refactored internal function handling in selective state updates for consistency.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 28, 2026

📝 Walkthrough

Walkthrough

This PR updates flashinfer-python dependency, modifies dtype handling in an argmax kernel, refactors speculative decoding and Mamba2 selective state update logic, adds **kwargs propagation to Nemotron model forward methods, and introduces new NVFP4 MTP integration tests.

Changes

Cohort / File(s) Summary
Dependency Updates
requirements.txt
Updated flashinfer-python from compatible release (~0.6.2) to exact pin (==0.6.4).
Kernel & Type Handling
tensorrt_llm/_torch/cute_dsl_kernels/argmax.py
Changed argmax output tensor allocation from input dtype to float32, ensuring float32 precision for both max values and indices with supporting comments explaining dtype choices.
Model Architecture Updates
tensorrt_llm/_torch/models/modeling_nemotron_h.py
Added **kwargs propagation to NemotronHMPDecoderLayer, NemotronHMTP, and NemotronHMTPDecoderLayer forward methods for flexible parameter passing to downstream components.
Mamba2 Selective State Update Refactor
tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
Consolidated two separate selective state update function hooks (selective_state_update_func_no_mtp and selective_state_update_func_mtp) into unified selective_state_update_func, simplifying initialization and forward pass logic.
Speculative Decoding Logic
tensorrt_llm/_torch/speculative/mtp.py
Modified position_ids and kv_lens handling in prepare_position_ids_and_last_tokens to disable speculative decoding after updates and replace compiled in-loop kernel with direct in-place increment for kv_lens.
MTP Integration Tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py, tests/integration/test_lists/qa/llm_function_core.txt, tests/integration/test_lists/test-db/l0_dgx_b200.yml
Added test_nvfp4_4gpu_mtp_ar integration test for NVFP4 with 4 GPUs and MTP (max_draft_len=7) to validate acceptance rates; registered test in configuration files.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description clearly explains the core issues being fixed (MTP speculative decoding crash on SM90, Nemotron MTP with enable_attention_dp), lists all modified files and their purposes, and provides a test plan with completed verification steps.
Title check ✅ Passed The title accurately summarizes the primary fix: resolving an MTP crash on SM90 (H200) for Nemotron, which aligns with the main objective stated in the PR summary.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/cute_dsl_kernels/argmax.py (1)

650-653: ⚠️ Potential issue | 🟡 Minor

Inconsistent dtype in CUTLASS DSL fallback.

The fallback function when CUTLASS DSL is unavailable still converts indices to x.dtype instead of float32, which is inconsistent with the main argmax function's output contract (lines 600-603).

🐛 Proposed fix for dtype consistency
 else:
     # Fallback if CUTLASS DSL is not available
     def argmax(x: torch.Tensor) -> torch.Tensor:
         """Fallback argmax using PyTorch when CUTLASS DSL is not available."""
         max_vals, max_indices = torch.max(x, dim=-1, keepdim=True)
-        return torch.cat([max_vals, max_indices.to(x.dtype)], dim=-1)
+        return torch.cat([max_vals.to(torch.float32), max_indices.to(torch.float32)], dim=-1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/cute_dsl_kernels/argmax.py` around lines 650 - 653, The
fallback argmax in function argmax currently casts max indices to x.dtype which
mismatches the main implementation's contract; change the cast so indices are
converted to torch.float32 (not x.dtype) before concatenation with max_vals,
ensuring the returned tensor matches the main argmax output dtype/shape (use
torch.max(..., keepdim=True) then torch.cat([max_vals,
max_indices.to(torch.float32)], dim=-1)).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integration/defs/accuracy/test_llm_api_pytorch.py`:
- Around line 5865-5867: Guard against divide-by-zero when computing
accept_rate: before computing accept_rate = num_accepted / num_drafted, check
num_drafted and handle the zero case (e.g., assert with a clear message like "No
drafted tokens for prompt {i}" or skip the prompt) so the code doesn't raise
ZeroDivisionError; update the block around the variables accept_rate,
num_accepted, num_drafted and the prompt index i accordingly.
- Around line 5833-5868: The LLM instance is created without deterministic
cleanup; change the creation of llm_spec to use a context manager (e.g., with
LLM(**llm_common_config, speculative_config=mtp_config) as llm_spec:) so the LLM
is deterministically closed after the test block; update the block that uses
llm_spec.tokenizer, llm_spec.generate_async, and related variables to be inside
that with scope (or alternatively call llm_spec.close() in a finally) to ensure
proper teardown in the integration test.

---

Outside diff comments:
In `@tensorrt_llm/_torch/cute_dsl_kernels/argmax.py`:
- Around line 650-653: The fallback argmax in function argmax currently casts
max indices to x.dtype which mismatches the main implementation's contract;
change the cast so indices are converted to torch.float32 (not x.dtype) before
concatenation with max_vals, ensuring the returned tensor matches the main
argmax output dtype/shape (use torch.max(..., keepdim=True) then
torch.cat([max_vals, max_indices.to(torch.float32)], dim=-1)).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb5cf9b and a1bca4c.

📒 Files selected for processing (8)
  • requirements.txt
  • tensorrt_llm/_torch/cute_dsl_kernels/argmax.py
  • tensorrt_llm/_torch/models/modeling_nemotron_h.py
  • tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/qa/llm_function_core.txt
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml

Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py
Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py
@sunnyqgg sunnyqgg force-pushed the nemotron-super-h100 branch 3 times, most recently from 46cb498 to 2aff41d Compare February 28, 2026 13:30
@sunnyqgg sunnyqgg changed the title Fix Nemotron MTP speculative decoding crash on SM90 (H200) [None][fix] Fix MTP speculative decoding crash on SM90 (H200) Feb 28, 2026
@sunnyqgg sunnyqgg changed the title [None][fix] Fix MTP speculative decoding crash on SM90 (H200) [None][fix] Fix nemotron super MTP crash on SM90 Feb 28, 2026
@sunnyqgg sunnyqgg force-pushed the nemotron-super-h100 branch from 2aff41d to 8e09f8e Compare February 28, 2026 13:37
Comment thread tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated
@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 2, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37270 [ run ] triggered by Bot. Commit: de31baf Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 2, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37270 [ run ] completed with state FAILURE. Commit: de31baf
/LLM/main/L0_MergeRequest_PR pipeline #28844 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37274 [ run ] triggered by Bot. Commit: da41308 Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 2, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37276 [ run ] triggered by Bot. Commit: 8f09385 Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 2, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37282 [ run ] triggered by Bot. Commit: ff662fa Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37417 [ run ] triggered by Bot. Commit: b3e71b1 Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37423 [ run ] triggered by Bot. Commit: 8e8629d Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37434 [ run ] triggered by Bot. Commit: de34a7f Link to invocation

…e test

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg sunnyqgg force-pushed the nemotron-super-h100 branch from de34a7f to ab64049 Compare March 3, 2026 05:07
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37434 [ run ] completed with state SUCCESS. Commit: de34a7f
/LLM/main/L0_MergeRequest_PR pipeline #28977 completed with status: 'ABORTED'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chzblych
Copy link
Copy Markdown
Collaborator

chzblych commented Mar 3, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37523 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37523 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29031 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 4, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37588 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37588 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29087 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 4, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37626 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37626 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29116 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 4, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37640 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37640 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29130 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

sunnyqgg commented Mar 5, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37779 [ run ] triggered by Bot. Commit: ab64049 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37779 [ run ] completed with state SUCCESS. Commit: ab64049
/LLM/main/L0_MergeRequest_PR pipeline #29246 completed with status: 'SUCCESS'

Link to invocation

@mikeiovine mikeiovine merged commit 517ee94 into NVIDIA:main Mar 5, 2026
5 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026
tianyuz-nv pushed a commit to wanqian-nv/TensorRT-LLM that referenced this pull request Mar 19, 2026
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.