[https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection by luyiyun1021 · Pull Request #11909 · NVIDIA/TensorRT-LLM

luyiyun1021 · Mar 4, 2026

Description

Fix a Python string truthiness bug in cpp/kernels/fmha_v2/setup.py that causes all SM89 E4M3 FMHA kernels to lose their precompiled cubins and fall back to source compilation, resulting in catastrophic accuracy regression on SM89 GPUs (L40S, L20, L40).

NVBugs: 5863806, 5868502, 5785465, 5785485
JIRA: TRTLLM-10833, TRTLLM-10867, TRTLLM-10257, TRTLLM-10258

Problem

FP8 accuracy tests fail catastrophically on SM89 GPUs:

Test	GPU	Expected	Actual
`TestQwen3_30B_A3B::test_fp8`	L40S	≥ 80.227	0.531
`TestLlama3_2_1B::test_fp8_prequantized`	L40	≥ 25.039	3.022
`TestMinistral8BInstruct::test_fp8`	L40S, L40	≥ 75.147	0.000

Tests pass on H100 (sm90) but consistently fail on L40S, L20, and L40 (all sm89).

Git Bisect Result

Step	Commit	Result
good	`dcd3f7b5e` — [fix] Fix accuracy test OOM (#10173)	PASS
bad	`a66eeab53` — [TRTLLM-9805][feat] Skip Softmax Attention (#9821)	FAIL

Root Cause

The use_cubin_header() function in setup.py accepts a boolean enable_skip_softmax parameter, but callers pass a C++ string ('false') instead of a Python boolean:

'_skipSoftmax' in kname           → False        (Python bool)
pythonBoolean2cpp[False]          → 'false'      (C++ string literal)
use_cubin_header(..., 'false')                    (string passed as bool)
if 'false':                       → True          (non-empty string is truthy!)
    return False                                   (cubin disabled for ALL kernels)

This disables cubins for ALL SM89 E4M3 FMHA kernels (not just skip-softmax variants), forcing source compilation which has known accuracy regressions on SM89.

Fix

Separate the Python boolean from the C++ string: use enable_skip_softmax (Python True/False) for control flow, keep enable_skip_softmax_flag (C++ 'true'/'false') for code generation.

Impact

SM89 E4M3 FMHA cubins are restored, fixing accuracy for all FP8 models on SM89 GPUs (L40S, L20, L40)
SM90 cubins are restored for all existing kernel types. The only new kernel variant added since the bug (bidirectional sliding window, commit e699f232) lacks cubin entries, so use_cubin_header() now accepts an attention_mask_type parameter and skips cubins specifically for BIDIRECTIONAL_SLIDING_WINDOW kernels.
Fixes NVBugs 5863806, 5868502 (accuracy component), 5785465, and 5785485

Test Coverage

Test	GPU	Before Fix	After Fix
`TestQwen3_30B_A3B::test_fp8[latency-torch_compile=False]`	L40S	0.531 (FAIL)	≥ 80.227 (PASS)
`TestLlama3_2_1B::test_fp8_prequantized`	L40S	3.022 (FAIL)	≥ 25.039 (PASS)
`TestMinistral8BInstruct::test_fp8`	L40S	0.000 (FAIL)	≥ 75.147 (PASS)

Direct verification: generated fmha_cubin.cpp correctly references cubin data instead of nullptr, 0 for SM89 E4M3 entries

PR Checklist

PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Any new dependencies have been scanned for license and vulnerabilities.
CODEOWNERS updated if ownership changes.
Documentation updated as needed.
Update tava architecture diagram if significant design change.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

luyiyun1021 · Mar 4, 2026

/bot run --disable-fail-fast

coderabbitai · Mar 4, 2026

📝 Walkthrough

Walkthrough

Refactors the skip-softmax flag handling in setup.py by introducing an intermediate boolean variable enable_skip_softmax_bool and deriving enable_skip_softmax_flag from it, replacing direct name-based flag computation. Observable behavior remains unchanged.

Changes

Cohort / File(s)	Summary
Skip-Softmax Flag Refactoring `cpp/kernels/fmha_v2/setup.py`	Introduced `enable_skip_softmax_bool` variable and updated flag derivation logic; refactored call sites to use the new intermediate boolean for consistent flag handling.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the bug fix: a Python string truthiness bug in FMHA cubin selection. It directly matches the core issue described in the PR objectives.
Description check	✅ Passed	PR description comprehensively explains the bug (Python string truthiness), root cause analysis, fix, and impact with test coverage and verification details.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tensorrt-cicd · Mar 4, 2026

PR_Github #37686 [ run ] triggered by Bot. Commit: fc7259b Link to invocation

luyiyun1021 · Mar 4, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 4, 2026

PR_Github #37692 [ run ] triggered by Bot. Commit: 0b7bee2 Link to invocation

tensorrt-cicd · Mar 4, 2026

PR_Github #37692 [ run ] completed with state SUCCESS. Commit: 0b7bee2
/LLM/main/L0_MergeRequest_PR pipeline #29174 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…A cubin selection In get_cubin_header(), enable_skip_softmax_flag (C++ string "false") was passed to use_cubin_header() which expects a Python bool. Non-empty strings are truthy in Python, so "false" evaluated to True, disabling precompiled cubins for all SM89 E4M3 FMHA kernels. This caused them to fall back to source compilation which has known accuracy regressions on SM89 (L40S/L20). Fix: use a Python bool (enable_skip_softmax_bool) for control flow, keep the C++ string (enable_skip_softmax_flag) only for code generation templates. Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>

luyiyun1021 · Mar 5, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 5, 2026

PR_Github #37806 [ run ] triggered by Bot. Commit: 464250b Link to invocation

tensorrt-cicd · Mar 5, 2026

PR_Github #37806 [ run ] completed with state SUCCESS. Commit: 464250b
/LLM/main/L0_MergeRequest_PR pipeline #29269 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…y cubin restoration Remove waives for three tests whose accuracy failures are resolved by the setup.py fix in the previous commit: - TestQwen3_30B_A3B::test_fp8 (nvbugs/5863806) - TestLlama3_2_1B::test_fp8_prequantized (nvbugs/5785465) - TestMinistral8BInstruct::test_fp8 (nvbugs/5785485) Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>

luyiyun1021 · Mar 6, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 6, 2026

PR_Github #37988 [ run ] triggered by Bot. Commit: 2a74b53 Link to invocation

…-regression

luyiyun1021 · Mar 6, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 6, 2026

PR_Github #37999 [ run ] triggered by Bot. Commit: b643be0 Link to invocation

tensorrt-cicd · Mar 6, 2026

PR_Github #37999 [ run ] completed with state SUCCESS. Commit: b643be0
/LLM/main/L0_MergeRequest_PR pipeline #29431 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

… CUDA_ERROR_NOT_FOUND SM90 cubins were inadvertently disabled since commit a66eeab (Skip Softmax Attention). During that period, new SM90 kernel variants were added without corresponding cubin entries. Restoring SM90 cubin usage causes CUDA_ERROR_NOT_FOUND at runtime. Keep SM90 on source compilation until cubin binaries are regenerated. Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>

luyiyun1021 · Mar 6, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 6, 2026

PR_Github #38025 [ run ] triggered by Bot. Commit: 2feb413 Link to invocation

tensorrt-cicd · Mar 6, 2026

PR_Github #38025 [ run ] completed with state SUCCESS. Commit: 2feb413
/LLM/main/L0_MergeRequest_PR pipeline #29455 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Link to invocation

…ndow kernels on SM90 Replace the blanket SM90 cubin disable with a targeted check: only bidirectional sliding window kernels (added after cubins were last generated) fall back to source compilation. All other SM90 kernels now correctly use precompiled cubins again. Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>

luyiyun1021 · Mar 9, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 9, 2026

PR_Github #38224 [ run ] triggered by Bot. Commit: 010669f Link to invocation

tensorrt-cicd · Mar 9, 2026

PR_Github #38224 [ run ] completed with state SUCCESS. Commit: 010669f
/LLM/main/L0_MergeRequest_PR pipeline #29614 completed with status: 'SUCCESS'

Link to invocation

…A cubin selection (NVIDIA#11909) Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com> Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>

github-actions Bot assigned luyiyun1021 Mar 4, 2026

luyiyun1021 changed the title ~~[https://nvbugs/5863806][https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection~~ [https://nvbugs/5863806 & https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection Mar 4, 2026

luyiyun1021 changed the title ~~[https://nvbugs/5863806 & https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection~~ [https://nvbugs/5863806&https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection Mar 4, 2026

luyiyun1021 changed the title ~~[https://nvbugs/5863806&https://nvbugs/5868502][fix] Fix Python string truthiness bug in FMHA cubin selection~~ [https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection Mar 4, 2026

luyiyun1021 force-pushed the fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regression branch 2 times, most recently from de02c4b to 0b7bee2 Compare March 4, 2026 12:13

luyiyun1021 force-pushed the fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regression branch from 0b7bee2 to 464250b Compare March 5, 2026 04:24

bobboli approved these changes Mar 5, 2026

View reviewed changes

Merge branch 'main' into fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy…

b643be0

…-regression

PerkzZheng reviewed Mar 9, 2026

View reviewed changes

Comment thread cpp/kernels/fmha_v2/setup.py Outdated

PerkzZheng approved these changes Mar 9, 2026

View reviewed changes

luyiyun1021 merged commit 34a9153 into NVIDIA:main Mar 9, 2026
9 checks passed

Search code, repositories, users, issues, pull requests...

Conversation

luyiyun1021 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Git Bisect Result

Root Cause

Fix

Impact

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

luyiyun1021 commented Mar 4, 2026

Uh oh!

coderabbitai Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

luyiyun1021 commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

luyiyun1021 commented Mar 5, 2026

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

tensorrt-cicd commented Mar 5, 2026

Uh oh!

luyiyun1021 commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

luyiyun1021 commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

luyiyun1021 commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

tensorrt-cicd commented Mar 6, 2026

Uh oh!

Uh oh!

luyiyun1021 commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

tensorrt-cicd commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

luyiyun1021 commented Mar 4, 2026 •

edited

Loading

coderabbitai Bot commented Mar 4, 2026 •

edited

Loading