Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection#11909

Merged
luyiyun1021 merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
luyiyun1021:fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regressionluyiyun1021/TensorRT-LLM:fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regressionCopy head branch name to clipboard
Mar 9, 2026
Merged

[https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection#11909
luyiyun1021 merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
luyiyun1021:fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regressionluyiyun1021/TensorRT-LLM:fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regressionCopy head branch name to clipboard

Conversation

@luyiyun1021
Copy link
Copy Markdown
Collaborator

@luyiyun1021 luyiyun1021 commented Mar 4, 2026

@coderabbitai summary

Description

Fix a Python string truthiness bug in cpp/kernels/fmha_v2/setup.py that causes all SM89 E4M3 FMHA kernels to lose their precompiled cubins and fall back to source compilation, resulting in catastrophic accuracy regression on SM89 GPUs (L40S, L20, L40).

NVBugs: 5863806, 5868502, 5785465, 5785485
JIRA: TRTLLM-10833, TRTLLM-10867, TRTLLM-10257, TRTLLM-10258

Problem

FP8 accuracy tests fail catastrophically on SM89 GPUs:

Test GPU Expected Actual
TestQwen3_30B_A3B::test_fp8 L40S ≥ 80.227 0.531
TestLlama3_2_1B::test_fp8_prequantized L40 ≥ 25.039 3.022
TestMinistral8BInstruct::test_fp8 L40S, L40 ≥ 75.147 0.000

Tests pass on H100 (sm90) but consistently fail on L40S, L20, and L40 (all sm89).

Git Bisect Result

Step Commit Result
good dcd3f7b5e — [fix] Fix accuracy test OOM (#10173) PASS
bad a66eeab53 — [TRTLLM-9805][feat] Skip Softmax Attention (#9821) FAIL

Root Cause

The use_cubin_header() function in setup.py accepts a boolean enable_skip_softmax parameter, but callers pass a C++ string ('false') instead of a Python boolean:

'_skipSoftmax' in kname           → False        (Python bool)
pythonBoolean2cpp[False]          → 'false'      (C++ string literal)
use_cubin_header(..., 'false')                    (string passed as bool)
if 'false':                       → True          (non-empty string is truthy!)
    return False                                   (cubin disabled for ALL kernels)

This disables cubins for ALL SM89 E4M3 FMHA kernels (not just skip-softmax variants), forcing source compilation which has known accuracy regressions on SM89.

Fix

Separate the Python boolean from the C++ string: use enable_skip_softmax (Python True/False) for control flow, keep enable_skip_softmax_flag (C++ 'true'/'false') for code generation.

Impact

  • SM89 E4M3 FMHA cubins are restored, fixing accuracy for all FP8 models on SM89 GPUs (L40S, L20, L40)
  • SM90 cubins are restored for all existing kernel types. The only new kernel variant added since the bug (bidirectional sliding window, commit e699f232) lacks cubin entries, so use_cubin_header() now accepts an attention_mask_type parameter and skips cubins specifically for BIDIRECTIONAL_SLIDING_WINDOW kernels.
  • Fixes NVBugs 5863806, 5868502 (accuracy component), 5785465, and 5785485

Test Coverage

Test GPU Before Fix After Fix
TestQwen3_30B_A3B::test_fp8[latency-torch_compile=False] L40S 0.531 (FAIL) ≥ 80.227 (PASS)
TestLlama3_2_1B::test_fp8_prequantized L40S 3.022 (FAIL) ≥ 25.039 (PASS)
TestMinistral8BInstruct::test_fp8 L40S 0.000 (FAIL) ≥ 75.147 (PASS)
  • Direct verification: generated fmha_cubin.cpp correctly references cubin data instead of nullptr, 0 for SM89 E4M3 entries

PR Checklist

  • PR description clearly explains what and why.

  • PR Follows TRT-LLM CODING GUIDELINES.

  • Test cases are provided for new code paths.

  • Any new dependencies have been scanned for license and vulnerabilities.

  • CODEOWNERS updated if ownership changes.

  • Documentation updated as needed.

  • Update tava architecture diagram if significant design change.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

Refactors the skip-softmax flag handling in setup.py by introducing an intermediate boolean variable enable_skip_softmax_bool and deriving enable_skip_softmax_flag from it, replacing direct name-based flag computation. Observable behavior remains unchanged.

Changes

Cohort / File(s) Summary
Skip-Softmax Flag Refactoring
cpp/kernels/fmha_v2/setup.py
Introduced enable_skip_softmax_bool variable and updated flag derivation logic; refactored call sites to use the new intermediate boolean for consistent flag handling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the bug fix: a Python string truthiness bug in FMHA cubin selection. It directly matches the core issue described in the PR objectives.
Description check ✅ Passed PR description comprehensively explains the bug (Python string truthiness), root cause analysis, fix, and impact with test coverage and verification details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@luyiyun1021 luyiyun1021 changed the title [https://nvbugs/5863806][https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection [https://nvbugs/5863806 & https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection Mar 4, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37686 [ run ] triggered by Bot. Commit: fc7259b Link to invocation

@luyiyun1021 luyiyun1021 changed the title [https://nvbugs/5863806 & https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection [https://nvbugs/5863806&https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection Mar 4, 2026
@luyiyun1021 luyiyun1021 changed the title [https://nvbugs/5863806&https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection [https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection Mar 4, 2026
@luyiyun1021 luyiyun1021 force-pushed the fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regression branch 2 times, most recently from de02c4b to 0b7bee2 Compare March 4, 2026 12:13
@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37692 [ run ] triggered by Bot. Commit: 0b7bee2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37692 [ run ] completed with state SUCCESS. Commit: 0b7bee2
/LLM/main/L0_MergeRequest_PR pipeline #29174 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…A cubin selection

In get_cubin_header(), enable_skip_softmax_flag (C++ string "false") was passed to use_cubin_header() which expects a Python bool. Non-empty strings are truthy in Python, so "false" evaluated to True, disabling precompiled cubins for all SM89 E4M3 FMHA kernels. This caused them to fall back to source compilation which has known accuracy regressions on SM89 (L40S/L20). Fix: use a Python bool (enable_skip_softmax_bool) for control flow, keep the C++ string (enable_skip_softmax_flag) only for code generation templates.

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
@luyiyun1021 luyiyun1021 force-pushed the fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regression branch from 0b7bee2 to 464250b Compare March 5, 2026 04:24
@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37806 [ run ] triggered by Bot. Commit: 464250b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37806 [ run ] completed with state SUCCESS. Commit: 464250b
/LLM/main/L0_MergeRequest_PR pipeline #29269 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…y cubin restoration

Remove waives for three tests whose accuracy failures are resolved by the
setup.py fix in the previous commit:
- TestQwen3_30B_A3B::test_fp8 (nvbugs/5863806)
- TestLlama3_2_1B::test_fp8_prequantized (nvbugs/5785465)
- TestMinistral8BInstruct::test_fp8 (nvbugs/5785485)

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37988 [ run ] triggered by Bot. Commit: 2a74b53 Link to invocation

@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37999 [ run ] triggered by Bot. Commit: b643be0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37999 [ run ] completed with state SUCCESS. Commit: b643be0
/LLM/main/L0_MergeRequest_PR pipeline #29431 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

… CUDA_ERROR_NOT_FOUND

SM90 cubins were inadvertently disabled since commit a66eeab (Skip
Softmax Attention). During that period, new SM90 kernel variants were
added without corresponding cubin entries. Restoring SM90 cubin usage
causes CUDA_ERROR_NOT_FOUND at runtime. Keep SM90 on source compilation
until cubin binaries are regenerated.

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38025 [ run ] triggered by Bot. Commit: 2feb413 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38025 [ run ] completed with state SUCCESS. Commit: 2feb413
/LLM/main/L0_MergeRequest_PR pipeline #29455 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

Comment thread cpp/kernels/fmha_v2/setup.py Outdated
…ndow kernels on SM90

Replace the blanket SM90 cubin disable with a targeted check: only
bidirectional sliding window kernels (added after cubins were last
generated) fall back to source compilation. All other SM90 kernels
now correctly use precompiled cubins again.

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
@luyiyun1021
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38224 [ run ] triggered by Bot. Commit: 010669f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38224 [ run ] completed with state SUCCESS. Commit: 010669f
/LLM/main/L0_MergeRequest_PR pipeline #29614 completed with status: 'SUCCESS'

Link to invocation

@luyiyun1021 luyiyun1021 merged commit 34a9153 into NVIDIA:main Mar 9, 2026
9 checks passed
tianyuz-nv pushed a commit to wanqian-nv/TensorRT-LLM that referenced this pull request Mar 19, 2026
…A cubin selection (NVIDIA#11909)

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
…A cubin selection (NVIDIA#11909)

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>
longcheng-nv pushed a commit to longcheng-nv/TensorRT-LLM that referenced this pull request Mar 31, 2026
…A cubin selection (NVIDIA#11909)

Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.