[https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection#11909
[https://nvbugs/5863806][fix] Fix Python string truthiness bug in FMHA cubin selection#11909luyiyun1021 merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom luyiyun1021:fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regressionluyiyun1021/TensorRT-LLM:fix/nvbug-5863806-l40s-qwen3moe-fp8-accuracy-regressionCopy head branch name to clipboard
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughRefactors the skip-softmax flag handling in setup.py by introducing an intermediate boolean variable Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
|
PR_Github #37686 [ run ] triggered by Bot. Commit: |
de02c4b to
0b7bee2
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37692 [ run ] triggered by Bot. Commit: |
|
PR_Github #37692 [ run ] completed with state
|
…A cubin selection In get_cubin_header(), enable_skip_softmax_flag (C++ string "false") was passed to use_cubin_header() which expects a Python bool. Non-empty strings are truthy in Python, so "false" evaluated to True, disabling precompiled cubins for all SM89 E4M3 FMHA kernels. This caused them to fall back to source compilation which has known accuracy regressions on SM89 (L40S/L20). Fix: use a Python bool (enable_skip_softmax_bool) for control flow, keep the C++ string (enable_skip_softmax_flag) only for code generation templates. Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
0b7bee2 to
464250b
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37806 [ run ] triggered by Bot. Commit: |
|
PR_Github #37806 [ run ] completed with state
|
…y cubin restoration Remove waives for three tests whose accuracy failures are resolved by the setup.py fix in the previous commit: - TestQwen3_30B_A3B::test_fp8 (nvbugs/5863806) - TestLlama3_2_1B::test_fp8_prequantized (nvbugs/5785465) - TestMinistral8BInstruct::test_fp8 (nvbugs/5785485) Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #37988 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #37999 [ run ] triggered by Bot. Commit: |
|
PR_Github #37999 [ run ] completed with state
|
… CUDA_ERROR_NOT_FOUND SM90 cubins were inadvertently disabled since commit a66eeab (Skip Softmax Attention). During that period, new SM90 kernel variants were added without corresponding cubin entries. Restoring SM90 cubin usage causes CUDA_ERROR_NOT_FOUND at runtime. Keep SM90 on source compilation until cubin binaries are regenerated. Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #38025 [ run ] triggered by Bot. Commit: |
|
PR_Github #38025 [ run ] completed with state |
…ndow kernels on SM90 Replace the blanket SM90 cubin disable with a targeted check: only bidirectional sliding window kernels (added after cubins were last generated) fall back to source compilation. All other SM90 kernels now correctly use precompiled cubins again. Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #38224 [ run ] triggered by Bot. Commit: |
|
PR_Github #38224 [ run ] completed with state |
…A cubin selection (NVIDIA#11909) Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com> Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>
…A cubin selection (NVIDIA#11909) Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com> Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>
…A cubin selection (NVIDIA#11909) Signed-off-by: Yiyun Lu <55233584+luyiyun1021@users.noreply.github.com> Co-authored-by: Jie Li <76780849+jieli-matrix@users.noreply.github.com>
@coderabbitai summary
Description
Fix a Python string truthiness bug in
cpp/kernels/fmha_v2/setup.pythat causes all SM89 E4M3 FMHA kernels to lose their precompiled cubins and fall back to source compilation, resulting in catastrophic accuracy regression on SM89 GPUs (L40S, L20, L40).NVBugs: 5863806, 5868502, 5785465, 5785485
JIRA: TRTLLM-10833, TRTLLM-10867, TRTLLM-10257, TRTLLM-10258
Problem
FP8 accuracy tests fail catastrophically on SM89 GPUs:
TestQwen3_30B_A3B::test_fp8TestLlama3_2_1B::test_fp8_prequantizedTestMinistral8BInstruct::test_fp8Tests pass on H100 (sm90) but consistently fail on L40S, L20, and L40 (all sm89).
Git Bisect Result
dcd3f7b5e— [fix] Fix accuracy test OOM (#10173)a66eeab53— [TRTLLM-9805][feat] Skip Softmax Attention (#9821)Root Cause
The
use_cubin_header()function insetup.pyaccepts a booleanenable_skip_softmaxparameter, but callers pass a C++ string ('false') instead of a Python boolean:This disables cubins for ALL SM89 E4M3 FMHA kernels (not just skip-softmax variants), forcing source compilation which has known accuracy regressions on SM89.
Fix
Separate the Python boolean from the C++ string: use
enable_skip_softmax(PythonTrue/False) for control flow, keepenable_skip_softmax_flag(C++'true'/'false') for code generation.Impact
e699f232) lacks cubin entries, souse_cubin_header()now accepts anattention_mask_typeparameter and skips cubins specifically forBIDIRECTIONAL_SLIDING_WINDOWkernels.Test Coverage
TestQwen3_30B_A3B::test_fp8[latency-torch_compile=False]TestLlama3_2_1B::test_fp8_prequantizedTestMinistral8BInstruct::test_fp8fmha_cubin.cppcorrectly references cubin data instead ofnullptr, 0for SM89 E4M3 entriesPR Checklist
PR description clearly explains what and why.
PR Follows TRT-LLM CODING GUIDELINES.
Test cases are provided for new code paths.
Any new dependencies have been scanned for license and vulnerabilities.
CODEOWNERS updated if ownership changes.
Documentation updated as needed.
Update tava architecture diagram if significant design change.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.