Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[Bug] FP4 CUTLASS GEMM fails on GB10 (SM121) — shared memory overflow from B200-sized tile configs #11368

Copy link
Copy link
@MikkoParkkola

Description

@MikkoParkkola
Issue body actions

Summary

nvfp4_gemm_cutlass fails on GB10 (SM121) with "Error Internal no error". The SM120 tile configs require more shared memory than GB10 provides. nvfp4_gemm_cublaslt works fine on the same hardware.

Problem: fp4_gemm_template.h routes SM121 to the SM120 path (line ~378: mSm == 120 || mSm == 121), but both SM120 tile configs (128×128×256B and 256×128×128B) need >99 KiB SMEM. GB10 only has 99 KiB vs B200's ~228 KiB.

Proof it's not a hardware limitation: CUTLASS example 79 (blackwell_geforce_gemm) uses GB10-appropriate tiles and runs FP4 successfully at 41.6 TFLOPS.

Environment

  • GPU: NVIDIA GB10 (SM 12.1, DGX Spark)
  • TRT-LLM: 1.3.0rc2 (commit f42a6cb)
  • CUDA: 12.8
  • PyTorch: 2.7 (CUDA 12.6)

Failure Path

  1. fp4_gemm_template.h:378 — SM121 dispatches to SM120 path
  2. SM120 tile configs both require >99 KiB SMEM
  3. gemm_universal_adapter.h:342-353cudaFuncSetAttribute(MaxDynamicSharedMemorySize) fails
  4. Returns kErrorInternal"Error Internal no error"

Benchmarks on GB10

Backend Peak TFLOPS Status
cuBLASLt FP4 (nvfp4_gemm_cublaslt) 99.6 ✅ Works
CUTLASS example 79 (SM121 tiles) 41.6 ✅ Works
TRT-LLM CUTLASS FP4 (nvfp4_gemm_cutlass) ❌ SMEM overflow
BF16 baseline 8.4 ✅ Reference

CUTLASS example 79 results:

Problem Size Avg Runtime TFLOPS
3072×3072×3072 1.57 ms 37.0
4096×4096×4096 3.35 ms 41.0
2048×5120×8192 4.13 ms 41.6

Additional Blockers

Two other FP4 backends are also blocked on GB10:

  1. CuteDSL: cute_dsl_custom_ops.py:766 gates on sm_version not in (100, 103). Patching that reveals a second gate in the nvidia-cutlass-dsl package — MmaMXF4NVF4Op at cutlass/cute/nvgpu/tcgen05/mma.py:1271 hardcodes sm_100a/sm_103a only.

  2. CUDA Core FP4: No dispatch path for SM121.

Suggested Fix

Add GB10-specific tile configs to fp4_gemm_template.h with fewer pipeline stages to fit 99 KiB. Example 79's GeForce tile configs could serve as reference.

Alternatively, StageCountAutoCarveout could query runtime SMEM via cudaDeviceGetAttribute(cudaDevAttrMaxSharedMemoryPerBlockOptin) instead of using compile-time SM120 assumptions — this would make the same binary work on both B200 and GB10 by adjusting pipeline depth.

Reproduction

import torch
import tensorrt_llm

a = torch.randn(256, 512, dtype=torch.bfloat16, device="cuda")
b = torch.randn(256, 512, dtype=torch.bfloat16, device="cuda")
global_sf = torch.tensor(1.0, dtype=torch.float32, device="cuda")
a_q, a_sf = torch.ops.trtllm.fp4_quantize(a, global_sf, 16)
b_q, b_sf = torch.ops.trtllm.fp4_quantize(b, global_sf, 16)

# Works:
result = torch.ops.trtllm.nvfp4_gemm_cublaslt(a_q, a_sf, b_q, b_sf, global_sf, 256, 256, 512)

# Fails with "Error Internal no error":
result = torch.ops.trtllm.nvfp4_gemm_cutlass(a_q, a_sf, b_q, b_sf, global_sf, 256, 256, 512)

Relevant Files

  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp4_gemm/fp4_gemm_template.h — dispatch + tile configs
  • cpp/tensorrt_llm/kernels/cutlass_kernels/fp4_gemm/nvfp4_nvfp4_gemm_template_sm120.h — SM120 kernel
  • cpp/build/_deps/cutlass-src/include/cutlass/gemm/device/gemm_universal_adapter.h:342-353 — failure point
  • cpp/build/_deps/cutlass-src/examples/79_blackwell_geforce_gemm/ — working GB10 FP4 reference

Tested on: NVIDIA GB10 (DGX Spark, compute 12.1)

Reactions are currently unavailable

Metadata

Metadata

Assignees

Labels

Customized kernels<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.<NV>Specialized/modified CUDA kernels in TRTLLM for LLM ops, beyond standard TRT. Dev & perf.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.