[None][feat] Add support for expert_number<=2048 and K<=32 by ChristinaZ · Pull Request #11510 · NVIDIA/TensorRT-LLM

ChristinaZ · Feb 13, 2026

Summary by CodeRabbit

New Features
- Extended expert count support from 512 to 2048 experts.
- Expanded top-K selection range from 22 to 32 supported values.
- Added multi-expert-per-thread processing capability for improved scalability.
Performance Improvements
- Optimized kernel dispatch with parallel compilation support.
- Enhanced top-K sorting with adaptive bitonic and odd-even strategies.

Description

Refactor the code to reduce the compiling time.
Extend the range of topK support to candidates number 2048 and K 32.

Test Coverage

cd cpp/build
./tests/unit_tests/kernels/routingKernelsTest

pytest tests/unittest/_torch/thop/serial/test_moe.py -k "test_no_autotune and RoutingRenormalize_large_experts" -v
pytest tests/unittest/_torch/thop/serial/test_moe.py::TestMoeFp4::test_autotune -k "RoutingRenormalize_large_experts" -v

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · Feb 13, 2026

📝 Walkthrough

Walkthrough

The PR refactors MOE routing kernels from monolithic designs to a split-compile launcher pattern, introduces per-thread multi-expert handling for larger expert counts, increases MaxSupportedTopExperts from 22 to 32 and total expert support to 2048, and updates validator constraints accordingly.

Changes

Cohort / File(s)	Summary
Build Configuration `CMakeLists.txt`	Adds CUDA `--split-compile=0` flag to parallelize ptxas compilation for the FP8 block scale MOE target.
Kernel Parameter Infrastructure `RoutingKernel.h`, `DevKernel.h`	Extends `KernelParamsBase` with new `MaxNumTopExperts_` template parameter; updates routing macros (`LAUNCH_ROUTING_LLAMA4`, `LAUNCH_ROUTING_WITH_NUM_EXPERTS*`) to propagate `numTopExperts` parameter and restructures PDL launch blocks into unified do-while control flow.
Routing Kernel Core Logic `RoutingKernel.cuh`	Implements multi-expert-per-thread strategy with `ExpertsPerThread` constant, converts single-expert per-thread arrays into `MaxNumExperts`-sized arrays, updates histogram/offset kernels and shared memory layouts to handle larger expert counts with per-thread expert loops and bounds checks.
Top-K Sorting Infrastructure `RoutingKernelTopK.cuh`	Increases `MaxSupportedTopExperts` from 22 to 32; introduces `IsPowerOf2` compile-time utility; replaces templated `Sort` with dispatch-based implementation selecting bitonic sort for power-of-two N and odd-even sort for non-power-of-two N; renames `reduceTopKFunc` to `reduceTopK`; relaxes constraints (K ≤ 32, N ≤ 64).
DeepSeek Routing Organization `RoutingDeepSeek.cu`, `routingDeepSeek/RoutingDeepSeekCommon.cuh`	Replaces monolithic kernel implementations with forward declarations for six launch wrappers; introduces common header defining expert count tiers, helper dispatch macros (`LAUNCH_DEEPSEEK_WITH_TOPK`, `LAUNCH_ROUTING_DEEPSEEK`), and `getMaxNumExperts()` function for tier selection.
DeepSeek Kernel Launchers `routingDeepSeek/launchMainKernel.cu`, `routingDeepSeek/launchClusterKernel.cu`, `routingDeepSeek/launchCoopKernel.cu`, `routingDeepSeek/launch{Histogram,InitExpertCounts,Offsets}Kernel.cu`	Implements six new kernel entry points: main (top-k routing with group support and score computation), cluster (per-expert token assignment), coop (cooperative multi-expert mapping with PDL), and three utility kernels (histogram, init expert counts, offsets), each with dedicated launcher function.
Renormalize Routing Organization `RoutingRenormalize.cu`, `routingRenormalize/RoutingRenormalizeCommon.cuh`	Replaces large kernel implementations with forward declarations and common header containing tier constants (`NumExperts128/512Experts`, `MaxSupportedExperts=2048`), helper function `getMaxNumExperts()`, and dispatch macros (`LAUNCH_ROUTING_WITH_TOPK`, `LAUNCH_ROUTING_RENORNALIZE`).
Renormalize Kernel Launchers `routingRenormalize/launch{Block,Cluster,Histogram,HistogramScores,InitExpertCounts,Offsets}Kernel.cu`	Implements six new launcher functions coordinating block-level, cluster-level, and histogram-based routing with top-K extraction, shared memory management, expert count initialization, and offset computation, with optional softmax pre-processing.
Constant and Constraint Updates `RoutingLlama4.cu`	Renames `NumExpertsLimit` to `MaxSupportedExperts` (value 128).
Host-Side Validation `thop/fp4BlockScaleMoe.cpp`, `thop/fp8BlockScaleMoe.cpp`, `thop/fp8PerTensorScaleMoe.cpp`, `thop/mxFp4BlockScaleMoe.cpp`	Increases `top_k` maximum from 10 to 32 in Renormalize/RenormalizeNaive paths; increases `num_experts` maximum from 512 to 2048; adds new validation for Renormalize routing method in FP8 PerTensor variant.
Tests `routingRenormalizeTest.cpp`, `routingTest.cpp`, `test_moe.py`	Updates test parameters to exercise larger expert counts (512→2048) and top-K values (8→10, 10→32); relaxes Python test bounds for top-K from 22 to 32; removes minor whitespace.

Sequence Diagram(s)

sequenceDiagram
    actor Host
    participant Launcher as Launch Wrapper
    participant MainKernel as Main Kernel
    participant GridSync as PDL Grid Sync
    participant HistoKernel as Histogram Kernel
    participant OffsetKernel as Offset Kernel

    Host->>Launcher: launchMainKernel()
    Launcher->>MainKernel: Execute (compute top-k scores)
    MainKernel-->>Host: (if no PDL: complete)
    MainKernel->>GridSync: Trigger PDL
    
    Host->>Launcher: launchHistogramKernel()
    Launcher->>HistoKernel: Execute (count experts)
    HistoKernel-->>Host: (if no PDL: complete)
    HistoKernel->>GridSync: Trigger PDL
    
    Host->>Launcher: launchOffsetsKernel()
    Launcher->>OffsetKernel: Execute (compute offsets)
    OffsetKernel-->>Host: Complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

[TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) #7761 — Directly related code-level modifications to the same blockScaleMoe routing kernels, macro infrastructure, and launcher patterns.

Suggested reviewers

MatthiasKohl
syuoni

🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (127 files): ⚔️ `README.md` (content) ⚔️ `cpp/CMakeLists.txt` (content) ⚔️ `cpp/include/tensorrt_llm/batch_manager/llmRequest.h` (content) ⚔️ `cpp/include/tensorrt_llm/executor/executor.h` (content) ⚔️ `cpp/include/tensorrt_llm/executor/serialization.h` (content) ⚔️ `cpp/include/tensorrt_llm/executor/types.h` (content) ⚔️ `cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp` (content) ⚔️ `cpp/tensorrt_llm/common/ncclUtils.cpp` (content) ⚔️ `cpp/tensorrt_llm/common/opUtils.h` (content) ⚔️ `cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/CMakeLists.txt` (content) ⚔️ `cpp/tensorrt_llm/executor/multimodalInput.cpp` (content) ⚔️ `cpp/tensorrt_llm/executor/serialization.cpp` (content) ⚔️ `cpp/tensorrt_llm/executor/serializeUtils.h` (content) ⚔️ `cpp/tensorrt_llm/kernels/causalConv1d/causalConv1d.cu` (content) ⚔️ `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_v2.cpp` (content) ⚔️ `cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp` (content) ⚔️ `cpp/tensorrt_llm/kernels/fusedLayernormKernels/layernorm_param.h` (content) ⚔️ `cpp/tensorrt_llm/kernels/fusedLayernormKernels/low_latency_layernorm.cuh` (content) ⚔️ `cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.cuh` (content) ⚔️ `cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.h` (content) ⚔️ `cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm_fp4_traits.cu` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/CMakeLists.txt` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu` (content) ⚔️ `cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h` (content) ⚔️ `cpp/tensorrt_llm/nanobind/CMakeLists.txt` (content) ⚔️ `cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp` (content) ⚔️ `cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp` (content) ⚔️ `cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h` (content) ⚔️ `cpp/tensorrt_llm/nanobind/executor/bindings.cpp` (content) ⚔️ `cpp/tensorrt_llm/nanobind/executor/request.cpp` (content) ⚔️ `cpp/tensorrt_llm/plugins/CMakeLists.txt` (content) ⚔️ `cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp` (content) ⚔️ `cpp/tensorrt_llm/runtime/CMakeLists.txt` (content) ⚔️ `cpp/tensorrt_llm/runtime/ipcNvlsMemory.cu` (content) ⚔️ `cpp/tensorrt_llm/thop/CMakeLists.txt` (content) ⚔️ `cpp/tensorrt_llm/thop/allreduceOp.cpp` (content) ⚔️ `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp` (content) ⚔️ `cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp` (content) ⚔️ `cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp` (content) ⚔️ `cpp/tensorrt_llm/thop/fusedAddRMSNormQuant.cpp` (content) ⚔️ `cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp` (content) ⚔️ `cpp/tests/e2e_tests/batch_manager/trtGptModelTest.cpp` (content) ⚔️ `cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` (content) ⚔️ `cpp/tests/unit_tests/batch_manager/microBatchSchedulerTest.cpp` (content) ⚔️ `cpp/tests/unit_tests/executor/serializeUtilsTest.cpp` (content) ⚔️ `cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp` (content) ⚔️ `cpp/tests/unit_tests/kernels/routing/routingTest.cpp` (content) ⚔️ `cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp` (content) ⚔️ `docs/source/features/kvcache.md` (content) ⚔️ `examples/constraints.txt` (content) ⚔️ `examples/models/core/glm-4-9b/README.md` (content) ⚔️ `requirements.txt` (content) ⚔️ `scripts/build_wheel.py` (content) ⚔️ `tensorrt_llm/_torch/attention_backend/trtllm.py` (content) ⚔️ `tensorrt_llm/_torch/auto_deploy/transform/library/ssm_cache.py` (content) ⚔️ `tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py` (content) ⚔️ `tensorrt_llm/_torch/custom_ops/torch_custom_ops.py` (content) ⚔️ `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` (content) ⚔️ `tensorrt_llm/_torch/models/modeling_llava_next.py` (content) ⚔️ `tensorrt_llm/_torch/models/modeling_nemotron_h.py` (content) ⚔️ `tensorrt_llm/_torch/models/modeling_qwen2vl.py` (content) ⚔️ `tensorrt_llm/_torch/models/modeling_qwen3vl.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/interface.py` (content) ⚔️ `tensorrt_llm/_torch/modules/fused_moe/quantization.py` (content) ⚔️ `tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py` (content) ⚔️ `tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py` (content) ⚔️ `tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py` (content) ⚔️ `tensorrt_llm/_torch/modules/mamba/ssd_chunk_state.py` (content) ⚔️ `tensorrt_llm/_torch/modules/mlp.py` (content) ⚔️ `tensorrt_llm/_torch/modules/rms_norm.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/_util.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/llm_request.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/model_engine.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/py_executor.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/request_utils.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/resource_manager.py` (content) ⚔️ `tensorrt_llm/_torch/pyexecutor/sampler.py` (content) ⚔️ `tensorrt_llm/_torch/utils.py` (content) ⚔️ `tensorrt_llm/_utils.py` (content) ⚔️ `tensorrt_llm/executor/base_worker.py` (content) ⚔️ `tensorrt_llm/executor/executor.py` (content) ⚔️ `tensorrt_llm/executor/result.py` (content) ⚔️ `tensorrt_llm/inputs/data.py` (content) ⚔️ `tensorrt_llm/inputs/multimodal.py` (content) ⚔️ `tensorrt_llm/inputs/registry.py` (content) ⚔️ `tensorrt_llm/llmapi/llm.py` (content) ⚔️ `tensorrt_llm/llmapi/llm_args.py` (content) ⚔️ `tensorrt_llm/serve/openai_server.py` (content) ⚔️ `tensorrt_llm/tools/layer_wise_benchmarks/runner.py` (content) ⚔️ `tensorrt_llm/version.py` (content) ⚔️ `tests/integration/defs/perf/pytorch_model_config.py` (content) ⚔️ `tests/integration/defs/perf/test_perf.py` (content) ⚔️ `tests/integration/test_lists/qa/llm_spark_perf.yml` (content) ⚔️ `tests/integration/test_lists/test-db/l0_dgx_b200.yml` (content) ⚔️ `tests/integration/test_lists/test-db/l0_dgx_h100.yml` (content) ⚔️ `tests/integration/test_lists/waives.txt` (content) ⚔️ `tests/unittest/_torch/executor/test_py_executor.py` (content) ⚔️ `tests/unittest/_torch/executor/test_request_utils.py` (content) ⚔️ `tests/unittest/_torch/modeling/test_modeling_llama.py` (content) ⚔️ `tests/unittest/_torch/modules/moe/quantize_utils.py` (content) ⚔️ `tests/unittest/_torch/modules/moe/test_moe_backend.py` (content) ⚔️ `tests/unittest/_torch/modules/moe/test_moe_module.py` (content) ⚔️ `tests/unittest/_torch/multimodal/test_find_num_image_tokens.py` (content) ⚔️ `tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py` (content) ⚔️ `tests/unittest/_torch/sampler/test_torch_sampler.py` (content) ⚔️ `tests/unittest/_torch/thop/serial/test_moe.py` (content) ⚔️ `tests/unittest/api_stability/api_stability_core.py` (content) ⚔️ `tests/unittest/api_stability/references_committed/request_output.yaml` (content) ⚔️ `tests/unittest/bindings/test_executor_bindings.py` (content) ⚔️ `tests/unittest/llmapi/test_llm_kv_cache_events.py` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.
Description check	❓ Inconclusive	The PR description is minimal and lacks key details about the refactoring approach and implementation strategy.	Add more detail explaining: (1) specific refactoring techniques used to reduce compile time (e.g., split-compile, code reorganization, splitting kernels into separate files); (2) why extending top-K to 32 and expert count to 2048 required refactoring; (3) impact assessment (performance, memory, compatibility); (4) any breaking changes or migration notes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: extending support for expert numbers up to 2048 and K up to 32, which aligns with the substantial refactoring and parameter expansion throughout the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The file is being modified in 2026 but the copyright header still reads 2022-2025. As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The header currently reads 2022-2024, but this file is being modified in 2026. As per coding guidelines, "update year on modified files."
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
tests/unittest/_torch/thop/serial/test_moe.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

The copyright header says 2022-2024, but this file is being modified in 2026. As per coding guidelines, all source files must update the year on modified files.
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The copyright header reads 2022-2024, but this file has meaningful modifications in 2026. As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Copyright year needs updating.

The copyright header says 2022-2024 but this file has been modified in 2025. Update to 2022-2025.

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."
Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)
42-44: ⚠️ Potential issue | 🟡 Minor

Copy-paste error in error message: says "Llama4" but this is the Renormalize path.

Line 44 reads "Llama4 routing kernel expects permuted idx..." but this is routingRenormalize::run. Should say "Renormalize routing kernel".
Proposed fix
     TLLM_CHECK_WITH_INFO(data.mPtrPermutedIdxSize != nullptr && data.mPtrCtaIdxXyToBatchIdx != nullptr
             && data.mPtrCtaIdxXyToMnLimit != nullptr && data.mPtrNumNonExitingCtas != nullptr,
-        "Llama4 routing kernel expects permuted idx and grouped Gemm launch config buffers");
+        "Renormalize routing kernel expects permuted idx and grouped Gemm launch config buffers");

🤖 Fix all issues with AI agents

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu`:
- Around line 78-81: The stride-alignment check in launchCoopKernel.cu is using
a bitwise test `(localExpertIdx & params.mLocalExpertsStrideLog2) == 0` which is
incorrect; update the check used when computing isLocalExpert (the expression
that now references params.mLocalExpertsStrideLog2) to use a proper mask:
`(localExpertIdx & ((1u << params.mLocalExpertsStrideLog2) - 1)) == 0`, then
keep the rest of the logic that computes expertOffsets (the ternary
atomicAdd(smemExpertCount + expertIdx, 1) : 0) unchanged; ensure you apply the
same corrected mask pattern to the other identical checks in this file (and
mirror the fix in the other files cited) so that localExpertIdx,
params.mLocalExpertsStrideLog2, expertOffsets and the atomicAdd usage behave
correctly for stride log2 >= 1.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchHistogramKernel.cu`:
- Line 2: The copyright header in launchHistogramKernel.cu (and other new
launcher .cu files listed: launchOffsetsKernel.cu, launchInitExpertCounts.cu,
launchCoopKernel.cu, launchClusterKernel.cu, launchMainKernel.cu) stops at
"2022-2025" but these are new 2026 files—update the header to include 2026
(e.g., "2022-2026" or the appropriate range) so the file-level copyright
reflects the latest modification year; modify the top-of-file comment in each
mentioned file (launchHistogramKernel.cu and the other launcher .cu filenames)
to replace "2022-2025" with the correct year range including 2026.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchMainKernel.cu`:
- Around line 204-205: Rename the misspelled local arrays intermidiateScore and
intermidiateExpert to intermediateScore and intermediateExpert, and update every
usage/reference to these symbols (including places originally between the nearby
block where they are read/written) to match the corrected names so compilation
and semantics remain consistent.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchBlockKernel.cu`:
- Around line 167-187: The two consecutive CUB BlockScan calls using the same
TempStorage (Scan(tempStorage).ExclusiveSum(numCtaPerExpert, ctaOffsetPerExpert,
numNonExitingCtas) and Scan(tempStorage).ExclusiveSum(tmpCountPerExpert,
expertScanCountsPerExpert)) must be separated by a barrier: add a
__syncthreads() immediately after the first ExclusiveSum (after
ctaOffsetPerExpert/numNonExitingCtas are computed) and before computing
tmpCountPerExpert to prevent TempStorage reuse races; keep the existing
__syncthreads() after the second scan as well and ensure the barrier location
references the existing tempStorage, ctaOffsetPerExpert, accExpertCount,
tmpCountPerExpert, and expertScanCountsPerExpert variables.
- Around line 128-131: The check treating mLocalExpertsStrideLog2 as a bitmask
is wrong: wherever you use (localExpIdx & params.mLocalExpertsStrideLog2) == 0
(e.g., in the localExpIdx/isLocal computation and the later conditional in
launchBlockKernel.cu), replace it with a proper mask built from the log2 value,
i.e. compute mask = (1u << params.mLocalExpertsStrideLog2) - 1 and test
(localExpIdx & mask) == 0; update uses of localExpIdx,
params.mLocalExpertsStartIdx, params.mNumLocalExperts, and
params.mLocalExpertsStrideLog2 in this file (and mirror the same change in other
affected functions like routingDeepSeek, RoutingLlama4.cu, RoutingKernel.cuh) so
the stride-log2 is interpreted correctly.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchInitExpertCounts.cu`:
- Line 2: Replace the outdated copyright header string "2022-2025" with the
correct latest-year range that includes 2026 (e.g. "2022-2026" or a header that
ends with 2026) in the new launcher files (e.g. launchInitExpertCounts.cu and
the other five launchers under routingRenormalize) so the top-of-file NVIDIA
copyright reflects the 2026 modification; search for the exact literal
"2022-2025" in each new file and update it consistently.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/RoutingRenormalizeCommon.cuh`:
- Line 136: The macro name is misspelled: replace all occurrences of
LAUNCH_ROUTING_RENORNALIZE with the correctly spelled
LAUNCH_ROUTING_RENORMALIZE; update the macro definition in
RoutingRenormalizeCommon.cuh and rename every invocation where the old macro is
used so the compile-time symbol matches (ensure you update the macro token in
the six call sites that reference it), then rebuild to verify no remaining
references to LAUNCH_ROUTING_RENORNALIZE remain.

🧹 Nitpick comments (11)

cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
2-2: Update the copyright year to include 2026.

The header currently says 2022-2024, but this file is being modified now. As per coding guidelines: "update year on modified files."
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
tests/unittest/_torch/thop/serial/test_moe.py (2)

1095-1105: Large memory footprint — consider CI impact.

With num_experts=2048, hidden_size=1024, and intermediate_size=1024, gemm1_weights alone is (2048, 2048, 1024) in bf16 (~8 GB), plus gemm2_weights (~4 GB), plus all quantized copies and scales. Combined with the parametrize cross-product (num_tokens × intermediate_size × act_type), this generates multiple heavy test cases.

Consider either:

Reducing hidden_size/intermediate_size for this specific large-expert case, or

Limiting the cross-product (e.g., fixing intermediate_size=768 and a single act_type for this param).

1189-1199: Same memory concern as test_autotune applies here.

This test_no_autotune variant additionally cross-products with use_topk_as_input=[False, True], further doubling the large-expert test matrix. The use_topk_as_input=True path hits the DeepSeekV3-only skip at line 1357, but the False path still runs all combinations.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)

29-30: Commented-out code should use #if/#endif instead of comments.

Line 29 uses a comment to disable code (// static constexpr int MaxNumExperts = 128;). If this is dead code, remove it entirely. If it's intentionally kept for reference, use #if 0 / #endif per the coding guidelines.

As per coding guidelines: "Use #if / #endif to disable code, preferably with a mnemonic condition... Do not use comments to disable code."
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (1)
37-70: Good refactor to do { ... } while (0) for the LAUNCH_PDL macro.

This is a well-known best practice for multi-statement macros. One minor coding guideline nit: the if on lines 56 and 65 should be followed by brace-delimited statements.

As per coding guidelines: "If and else should always be followed by brace-delimited statements, even if empty or a single statement."
Proposed fix (lines 56-58, 65-67)
             if (smemSize > 48 * 1024)                                                                  \
-                TLLM_CUDA_CHECK(                                                                       \
-                    cudaFuncSetAttribute(kernelTyped, cudaFuncAttributeMaxDynamicSharedMemorySize, smemSize));         \
+            {                                                                                          \
+                TLLM_CUDA_CHECK(                                                                       \
+                    cudaFuncSetAttribute(kernelTyped, cudaFuncAttributeMaxDynamicSharedMemorySize, smemSize));         \
+            }                                                                                          \
Apply similarly to both the true and false UsePdl branches.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/RoutingDeepSeekCommon.cuh (1)

26-29: std::max with initializer list may need explicit <algorithm> include.

Line 29 uses std::max({...}) with an initializer list, which requires <algorithm> (and implicitly <initializer_list>). These may be provided transitively via RoutingKernel.cuh → DevKernel.h, but relying on transitive includes is fragile. Consider a direct #include <algorithm>, or alternatively, since all three values are compile-time constants, a simpler expression like nested std::max calls or just hardcoding 512 (the actual max) would avoid the dependency entirely.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchInitExpertCounts.cu (1)
27-30: Missing /*coopLaunch=*/ inline comment for consistency.

All other launcher wrappers annotate the coopLaunch parameter with an inline comment (/*coopLaunch=*/false). This one passes bare false.
Suggested fix
-    LAUNCH_ROUTING_DEEPSEEK(data, false, routingInitExpertCounts, (2 * data.mNumExperts - 1) / numThreadsHist + 1,
+    LAUNCH_ROUTING_DEEPSEEK(data, /*coopLaunch=*/false, routingInitExpertCounts, (2 * data.mNumExperts - 1) / numThreadsHist + 1,
As per coding guidelines: "In function calls where parameters are not obvious, use inline C comments to document the parameter."
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu (1)

216-222: TODO note: PDL visibility concern is documented but unresolved.

The comment on line 218 explicitly states: "this is not sufficient to ensure visibility in the next kernel!" This suggests the secondary kernel may observe stale data for mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas, and mPtrPermutedIdxSize.

Is there a tracking issue for this? If the dependent FC1 kernel relies on these outputs being visible, a missing memory fence before cudaTriggerProgrammaticLaunchCompletion could cause data races on SM90+.

Would you like me to open an issue to track this PDL visibility concern?

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchClusterKernel.cu (1)

44-49: Non-SM90 fallback is missing __launch_bounds__ unlike the SM90+ path.

The SM90+ variant (line 27) uses __launch_bounds__(KernelParams::MaxNumExperts) while the fallback (line 45) has no launch bounds annotation. For consistency with the Renormalize variant (which uses __launch_bounds__(NumThreads) on its fallback), consider adding it here. This is minor since the fallback only asserts false.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramScoresKernel.cu (1)
44-44: Unused variable minScore.

minScore is initialized to -INFINITY but never read or passed to any function in this kernel. Likely a copy-paste artifact from the block kernel. Remove it to avoid compiler warnings and dead code.
Suggested fix
-    BaseType minScore = BaseType{-INFINITY};
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchBlockKernel.cu (1)
100-100: Unused variable minScore.

Same as in launchHistogramScoresKernel.cu — minScore is initialized but never referenced. Remove to avoid dead code.
Suggested fix
-        BaseType minScore = BaseType{-INFINITY};

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ · Mar 3, 2026

/bot run

tensorrt-cicd · Mar 3, 2026

PR_Github #37524 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

tensorrt-cicd · Mar 3, 2026

PR_Github #37524 [ run ] completed with state SUCCESS. Commit: ce58d29
/LLM/main/L0_MergeRequest_PR pipeline #29032 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ChristinaZ · Mar 4, 2026

/bot run

tensorrt-cicd · Mar 4, 2026

PR_Github #37621 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

tensorrt-cicd · Mar 4, 2026

PR_Github #37621 [ run ] completed with state SUCCESS. Commit: ce58d29
/LLM/main/L0_MergeRequest_PR pipeline #29112 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

jiahanc · Mar 4, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 4, 2026

PR_Github #37662 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

ChristinaZ · Mar 4, 2026

/bot run

tensorrt-cicd · Mar 4, 2026

PR_Github #37668 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

tensorrt-cicd · Mar 4, 2026

PR_Github #37668 [ run ] completed with state SUCCESS. Commit: ce58d29
/LLM/main/L0_MergeRequest_PR pipeline #29153 completed with status: 'SUCCESS'

Link to invocation

byshiue

LGTM

) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

## 📌 Description - Integrate NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize - Refactor MOE cu files ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations. * **Bug Fixes** * Clamped token counts in kernel launches to prevent oversized grid launches. * **Performance** * Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings. * **Tests** * Added test scenarios covering large-expert (2,048) + top-k (32) configurations.  --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

## 📌 Description - Integrate NVIDIA/TensorRT-LLM#11510 to support 2048 num of experts and 32 TopK in renormalize - Refactor MOE cu files ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Expanded MoE routing/renormalize to support up to 2,048 experts and top-k up to 32; backend reorganized to enable larger configurations. * **Bug Fixes** * Clamped token counts in kernel launches to prevent oversized grid launches. * **Performance** * Reworked routing/launch paths for improved scalability and throughput with large expert/top-k settings. * **Tests** * Added test scenarios covering large-expert (2,048) + top-k (32) configurations.  --------- Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>

coderabbitai Bot reviewed Feb 13, 2026

View reviewed changes

ChristinaZ assigned yweng0828 and ChristinaZ and unassigned yweng0828 Mar 3, 2026

ChristinaZ requested review from byshiue and yweng0828 March 3, 2026 14:45

Add support for expert_number=2048 and K=32

ce58d29

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

ChristinaZ force-pushed the add_2048_support branch from 72c23a1 to ce58d29 Compare March 3, 2026 14:52

ChristinaZ changed the title ~~[Draft] Add support for expert_number<=2048 and K<=32~~ [None][feat] Add support for expert_number<=2048 and K<=32 Mar 3, 2026

byshiue approved these changes Mar 5, 2026

View reviewed changes

byshiue merged commit e01c38f into NVIDIA:main Mar 5, 2026
6 of 9 checks passed

wenscarl mentioned this pull request Mar 6, 2026

[Bug]: IMA fails on trtllm_gen_moe backend for nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4 #11921

Closed

4 tasks

This was referenced Mar 6, 2026

Integration needed for Qwen3-Coder-480B-A35B-Instruct-NVFP4 flashinfer-ai/flashinfer#2714

Closed

[bug] [trtllm_fp8_block_scale_moe] Accuracy issues with small scales flashinfer-ai/flashinfer#2356

Closed

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026

[None][feat] Add support for expert_number<=2048 and K<=32 (NVIDIA#11510

e5ab276

) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

jiahanc mentioned this pull request Mar 11, 2026

[feat] Add 2048 experts and 32 Top K flashinfer-ai/flashinfer#2744

Merged

5 tasks

coderabbitai Bot mentioned this pull request Mar 16, 2026

[None][feat] Refactor the routing part in trtllmgen #12246

Merged

1 task

tianyuz-nv pushed a commit to wanqian-nv/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][feat] Add support for expert_number<=2048 and K<=32 (NVIDIA#11510

3f89c31

) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026

[None][feat] Add support for expert_number<=2048 and K<=32 (NVIDIA#11510

23fb516

) Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>

Search code, repositories, users, issues, pull requests...

Conversation

ChristinaZ commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChristinaZ commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

tensorrt-cicd commented Mar 3, 2026

Uh oh!

ChristinaZ commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

jiahanc commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

ChristinaZ commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

tensorrt-cicd commented Mar 4, 2026

Uh oh!

byshiue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ChristinaZ commented Feb 13, 2026 •

edited

Loading

coderabbitai Bot commented Feb 13, 2026 •

edited

Loading