Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[None][feat] Add support for expert_number<=2048 and K<=32#11510

Merged
byshiue merged 1 commit intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
ChristinaZ:add_2048_supportChristinaZ/TensorRT-LLM:add_2048_supportCopy head branch name to clipboard
Mar 5, 2026
Merged

[None][feat] Add support for expert_number<=2048 and K<=32#11510
byshiue merged 1 commit intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
ChristinaZ:add_2048_supportChristinaZ/TensorRT-LLM:add_2048_supportCopy head branch name to clipboard

Conversation

@ChristinaZ
Copy link
Copy Markdown
Collaborator

@ChristinaZ ChristinaZ commented Feb 13, 2026

Summary by CodeRabbit

  • New Features

    • Extended expert count support from 512 to 2048 experts.
    • Expanded top-K selection range from 22 to 32 supported values.
    • Added multi-expert-per-thread processing capability for improved scalability.
  • Performance Improvements

    • Optimized kernel dispatch with parallel compilation support.
    • Enhanced top-K sorting with adaptive bitonic and odd-even strategies.

Description

  • Refactor the code to reduce the compiling time.
  • Extend the range of topK support to candidates number 2048 and K 32.

Test Coverage

cd cpp/build
./tests/unit_tests/kernels/routingKernelsTest

pytest tests/unittest/_torch/thop/serial/test_moe.py -k "test_no_autotune and RoutingRenormalize_large_experts" -v
pytest tests/unittest/_torch/thop/serial/test_moe.py::TestMoeFp4::test_autotune -k "RoutingRenormalize_large_experts" -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 13, 2026

📝 Walkthrough

Walkthrough

The PR refactors MOE routing kernels from monolithic designs to a split-compile launcher pattern, introduces per-thread multi-expert handling for larger expert counts, increases MaxSupportedTopExperts from 22 to 32 and total expert support to 2048, and updates validator constraints accordingly.

Changes

Cohort / File(s) Summary
Build Configuration
CMakeLists.txt
Adds CUDA --split-compile=0 flag to parallelize ptxas compilation for the FP8 block scale MOE target.
Kernel Parameter Infrastructure
RoutingKernel.h, DevKernel.h
Extends KernelParamsBase with new MaxNumTopExperts_ template parameter; updates routing macros (LAUNCH_ROUTING_LLAMA4, LAUNCH_ROUTING_WITH_NUM_EXPERTS*) to propagate numTopExperts parameter and restructures PDL launch blocks into unified do-while control flow.
Routing Kernel Core Logic
RoutingKernel.cuh
Implements multi-expert-per-thread strategy with ExpertsPerThread constant, converts single-expert per-thread arrays into MaxNumExperts-sized arrays, updates histogram/offset kernels and shared memory layouts to handle larger expert counts with per-thread expert loops and bounds checks.
Top-K Sorting Infrastructure
RoutingKernelTopK.cuh
Increases MaxSupportedTopExperts from 22 to 32; introduces IsPowerOf2 compile-time utility; replaces templated Sort with dispatch-based implementation selecting bitonic sort for power-of-two N and odd-even sort for non-power-of-two N; renames reduceTopKFunc to reduceTopK; relaxes constraints (K ≤ 32, N ≤ 64).
DeepSeek Routing Organization
RoutingDeepSeek.cu, routingDeepSeek/RoutingDeepSeekCommon.cuh
Replaces monolithic kernel implementations with forward declarations for six launch wrappers; introduces common header defining expert count tiers, helper dispatch macros (LAUNCH_DEEPSEEK_WITH_TOPK, LAUNCH_ROUTING_DEEPSEEK), and getMaxNumExperts() function for tier selection.
DeepSeek Kernel Launchers
routingDeepSeek/launchMainKernel.cu, routingDeepSeek/launchClusterKernel.cu, routingDeepSeek/launchCoopKernel.cu, routingDeepSeek/launch{Histogram,InitExpertCounts,Offsets}Kernel.cu
Implements six new kernel entry points: main (top-k routing with group support and score computation), cluster (per-expert token assignment), coop (cooperative multi-expert mapping with PDL), and three utility kernels (histogram, init expert counts, offsets), each with dedicated launcher function.
Renormalize Routing Organization
RoutingRenormalize.cu, routingRenormalize/RoutingRenormalizeCommon.cuh
Replaces large kernel implementations with forward declarations and common header containing tier constants (NumExperts128/512Experts, MaxSupportedExperts=2048), helper function getMaxNumExperts(), and dispatch macros (LAUNCH_ROUTING_WITH_TOPK, LAUNCH_ROUTING_RENORNALIZE).
Renormalize Kernel Launchers
routingRenormalize/launch{Block,Cluster,Histogram,HistogramScores,InitExpertCounts,Offsets}Kernel.cu
Implements six new launcher functions coordinating block-level, cluster-level, and histogram-based routing with top-K extraction, shared memory management, expert count initialization, and offset computation, with optional softmax pre-processing.
Constant and Constraint Updates
RoutingLlama4.cu
Renames NumExpertsLimit to MaxSupportedExperts (value 128).
Host-Side Validation
thop/fp4BlockScaleMoe.cpp, thop/fp8BlockScaleMoe.cpp, thop/fp8PerTensorScaleMoe.cpp, thop/mxFp4BlockScaleMoe.cpp
Increases top_k maximum from 10 to 32 in Renormalize/RenormalizeNaive paths; increases num_experts maximum from 512 to 2048; adds new validation for Renormalize routing method in FP8 PerTensor variant.
Tests
routingRenormalizeTest.cpp, routingTest.cpp, test_moe.py
Updates test parameters to exercise larger expert counts (512→2048) and top-K values (8→10, 10→32); relaxes Python test bounds for top-K from 22 to 32; removes minor whitespace.

Sequence Diagram(s)

sequenceDiagram
    actor Host
    participant Launcher as Launch Wrapper
    participant MainKernel as Main Kernel
    participant GridSync as PDL Grid Sync
    participant HistoKernel as Histogram Kernel
    participant OffsetKernel as Offset Kernel

    Host->>Launcher: launchMainKernel()
    Launcher->>MainKernel: Execute (compute top-k scores)
    MainKernel-->>Host: (if no PDL: complete)
    MainKernel->>GridSync: Trigger PDL
    
    Host->>Launcher: launchHistogramKernel()
    Launcher->>HistoKernel: Execute (count experts)
    HistoKernel-->>Host: (if no PDL: complete)
    HistoKernel->>GridSync: Trigger PDL
    
    Host->>Launcher: launchOffsetsKernel()
    Launcher->>OffsetKernel: Execute (compute offsets)
    OffsetKernel-->>Host: Complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • MatthiasKohl
  • syuoni
🚥 Pre-merge checks | ✅ 1 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (127 files):

⚔️ README.md (content)
⚔️ cpp/CMakeLists.txt (content)
⚔️ cpp/include/tensorrt_llm/batch_manager/llmRequest.h (content)
⚔️ cpp/include/tensorrt_llm/executor/executor.h (content)
⚔️ cpp/include/tensorrt_llm/executor/serialization.h (content)
⚔️ cpp/include/tensorrt_llm/executor/types.h (content)
⚔️ cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (content)
⚔️ cpp/tensorrt_llm/common/ncclUtils.cpp (content)
⚔️ cpp/tensorrt_llm/common/opUtils.h (content)
⚔️ cpp/tensorrt_llm/executor/cache_transmission/nixl_utils/CMakeLists.txt (content)
⚔️ cpp/tensorrt_llm/executor/multimodalInput.cpp (content)
⚔️ cpp/tensorrt_llm/executor/serialization.cpp (content)
⚔️ cpp/tensorrt_llm/executor/serializeUtils.h (content)
⚔️ cpp/tensorrt_llm/kernels/causalConv1d/causalConv1d.cu (content)
⚔️ cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_v2.cpp (content)
⚔️ cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp (content)
⚔️ cpp/tensorrt_llm/kernels/fusedLayernormKernels/layernorm_param.h (content)
⚔️ cpp/tensorrt_llm/kernels/fusedLayernormKernels/low_latency_layernorm.cuh (content)
⚔️ cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.cuh (content)
⚔️ cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm.h (content)
⚔️ cpp/tensorrt_llm/kernels/fusedLayernormKernels/ws_layernorm_fp4_traits.cu (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/CMakeLists.txt (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.cuh (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernel.h (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingKernelTopK.cuh (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (content)
⚔️ cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaKernels.h (content)
⚔️ cpp/tensorrt_llm/nanobind/CMakeLists.txt (content)
⚔️ cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (content)
⚔️ cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (content)
⚔️ cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (content)
⚔️ cpp/tensorrt_llm/nanobind/executor/bindings.cpp (content)
⚔️ cpp/tensorrt_llm/nanobind/executor/request.cpp (content)
⚔️ cpp/tensorrt_llm/plugins/CMakeLists.txt (content)
⚔️ cpp/tensorrt_llm/plugins/ncclPlugin/allreducePlugin.cpp (content)
⚔️ cpp/tensorrt_llm/runtime/CMakeLists.txt (content)
⚔️ cpp/tensorrt_llm/runtime/ipcNvlsMemory.cu (content)
⚔️ cpp/tensorrt_llm/thop/CMakeLists.txt (content)
⚔️ cpp/tensorrt_llm/thop/allreduceOp.cpp (content)
⚔️ cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (content)
⚔️ cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (content)
⚔️ cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (content)
⚔️ cpp/tensorrt_llm/thop/fusedAddRMSNormQuant.cpp (content)
⚔️ cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (content)
⚔️ cpp/tests/e2e_tests/batch_manager/trtGptModelTest.cpp (content)
⚔️ cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (content)
⚔️ cpp/tests/unit_tests/batch_manager/microBatchSchedulerTest.cpp (content)
⚔️ cpp/tests/unit_tests/executor/serializeUtilsTest.cpp (content)
⚔️ cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp (content)
⚔️ cpp/tests/unit_tests/kernels/routing/routingTest.cpp (content)
⚔️ cpp/tests/unit_tests/runtime/virtualMemoryTest.cpp (content)
⚔️ docs/source/features/kvcache.md (content)
⚔️ examples/constraints.txt (content)
⚔️ examples/models/core/glm-4-9b/README.md (content)
⚔️ requirements.txt (content)
⚔️ scripts/build_wheel.py (content)
⚔️ tensorrt_llm/_torch/attention_backend/trtllm.py (content)
⚔️ tensorrt_llm/_torch/auto_deploy/transform/library/ssm_cache.py (content)
⚔️ tensorrt_llm/_torch/custom_ops/cpp_custom_ops.py (content)
⚔️ tensorrt_llm/_torch/custom_ops/torch_custom_ops.py (content)
⚔️ tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (content)
⚔️ tensorrt_llm/_torch/models/modeling_llava_next.py (content)
⚔️ tensorrt_llm/_torch/models/modeling_nemotron_h.py (content)
⚔️ tensorrt_llm/_torch/models/modeling_qwen2vl.py (content)
⚔️ tensorrt_llm/_torch/models/modeling_qwen3vl.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/communication/deep_ep_low_latency.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/configurable_moe.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/fused_moe_cute_dsl.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/fused_moe_deepgemm.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/interface.py (content)
⚔️ tensorrt_llm/_torch/modules/fused_moe/quantization.py (content)
⚔️ tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (content)
⚔️ tensorrt_llm/_torch/modules/mamba/mamba2_mixer.py (content)
⚔️ tensorrt_llm/_torch/modules/mamba/ssd_chunk_scan.py (content)
⚔️ tensorrt_llm/_torch/modules/mamba/ssd_chunk_state.py (content)
⚔️ tensorrt_llm/_torch/modules/mlp.py (content)
⚔️ tensorrt_llm/_torch/modules/rms_norm.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/_util.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/llm_request.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/mamba_cache_manager.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/model_engine.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/py_executor.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/request_utils.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/resource_manager.py (content)
⚔️ tensorrt_llm/_torch/pyexecutor/sampler.py (content)
⚔️ tensorrt_llm/_torch/utils.py (content)
⚔️ tensorrt_llm/_utils.py (content)
⚔️ tensorrt_llm/executor/base_worker.py (content)
⚔️ tensorrt_llm/executor/executor.py (content)
⚔️ tensorrt_llm/executor/result.py (content)
⚔️ tensorrt_llm/inputs/data.py (content)
⚔️ tensorrt_llm/inputs/multimodal.py (content)
⚔️ tensorrt_llm/inputs/registry.py (content)
⚔️ tensorrt_llm/llmapi/llm.py (content)
⚔️ tensorrt_llm/llmapi/llm_args.py (content)
⚔️ tensorrt_llm/serve/openai_server.py (content)
⚔️ tensorrt_llm/tools/layer_wise_benchmarks/runner.py (content)
⚔️ tensorrt_llm/version.py (content)
⚔️ tests/integration/defs/perf/pytorch_model_config.py (content)
⚔️ tests/integration/defs/perf/test_perf.py (content)
⚔️ tests/integration/test_lists/qa/llm_spark_perf.yml (content)
⚔️ tests/integration/test_lists/test-db/l0_dgx_b200.yml (content)
⚔️ tests/integration/test_lists/test-db/l0_dgx_h100.yml (content)
⚔️ tests/integration/test_lists/waives.txt (content)
⚔️ tests/unittest/_torch/executor/test_py_executor.py (content)
⚔️ tests/unittest/_torch/executor/test_request_utils.py (content)
⚔️ tests/unittest/_torch/modeling/test_modeling_llama.py (content)
⚔️ tests/unittest/_torch/modules/moe/quantize_utils.py (content)
⚔️ tests/unittest/_torch/modules/moe/test_moe_backend.py (content)
⚔️ tests/unittest/_torch/modules/moe/test_moe_module.py (content)
⚔️ tests/unittest/_torch/multimodal/test_find_num_image_tokens.py (content)
⚔️ tests/unittest/_torch/multimodal/test_mm_encoder_standalone.py (content)
⚔️ tests/unittest/_torch/sampler/test_torch_sampler.py (content)
⚔️ tests/unittest/_torch/thop/serial/test_moe.py (content)
⚔️ tests/unittest/api_stability/api_stability_core.py (content)
⚔️ tests/unittest/api_stability/references_committed/request_output.yaml (content)
⚔️ tests/unittest/bindings/test_executor_bindings.py (content)
⚔️ tests/unittest/llmapi/test_llm_kv_cache_events.py (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
Description check ❓ Inconclusive The PR description is minimal and lacks key details about the refactoring approach and implementation strategy. Add more detail explaining: (1) specific refactoring techniques used to reduce compile time (e.g., split-compile, code reorganization, splitting kernels into separate files); (2) why extending top-K to 32 and expert count to 2048 required refactoring; (3) impact assessment (performance, memory, compatibility); (4) any breaking changes or migration notes.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: extending support for expert numbers up to 2048 and K up to 32, which aligns with the substantial refactoring and parameter expansion throughout the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The file is being modified in 2026 but the copyright header still reads 2022-2025. As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

Proposed fix
- * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The header currently reads 2022-2024, but this file is being modified in 2026. As per coding guidelines, "update year on modified files."

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
tests/unittest/_torch/thop/serial/test_moe.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

The copyright header says 2022-2024, but this file is being modified in 2026. As per coding guidelines, all source files must update the year on modified files.

-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to include 2026.

The copyright header reads 2022-2024, but this file has meaningful modifications in 2026. As per coding guidelines, "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Copyright year needs updating.

The copyright header says 2022-2024 but this file has been modified in 2025. Update to 2022-2025.

As per coding guidelines: "All source files must contain an NVIDIA copyright header with the year of latest meaningful modification."

Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu (1)

42-44: ⚠️ Potential issue | 🟡 Minor

Copy-paste error in error message: says "Llama4" but this is the Renormalize path.

Line 44 reads "Llama4 routing kernel expects permuted idx..." but this is routingRenormalize::run. Should say "Renormalize routing kernel".

Proposed fix
     TLLM_CHECK_WITH_INFO(data.mPtrPermutedIdxSize != nullptr && data.mPtrCtaIdxXyToBatchIdx != nullptr
             && data.mPtrCtaIdxXyToMnLimit != nullptr && data.mPtrNumNonExitingCtas != nullptr,
-        "Llama4 routing kernel expects permuted idx and grouped Gemm launch config buffers");
+        "Renormalize routing kernel expects permuted idx and grouped Gemm launch config buffers");
🤖 Fix all issues with AI agents
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu`:
- Around line 78-81: The stride-alignment check in launchCoopKernel.cu is using
a bitwise test `(localExpertIdx & params.mLocalExpertsStrideLog2) == 0` which is
incorrect; update the check used when computing isLocalExpert (the expression
that now references params.mLocalExpertsStrideLog2) to use a proper mask:
`(localExpertIdx & ((1u << params.mLocalExpertsStrideLog2) - 1)) == 0`, then
keep the rest of the logic that computes expertOffsets (the ternary
atomicAdd(smemExpertCount + expertIdx, 1) : 0) unchanged; ensure you apply the
same corrected mask pattern to the other identical checks in this file (and
mirror the fix in the other files cited) so that localExpertIdx,
params.mLocalExpertsStrideLog2, expertOffsets and the atomicAdd usage behave
correctly for stride log2 >= 1.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchHistogramKernel.cu`:
- Line 2: The copyright header in launchHistogramKernel.cu (and other new
launcher .cu files listed: launchOffsetsKernel.cu, launchInitExpertCounts.cu,
launchCoopKernel.cu, launchClusterKernel.cu, launchMainKernel.cu) stops at
"2022-2025" but these are new 2026 files—update the header to include 2026
(e.g., "2022-2026" or the appropriate range) so the file-level copyright
reflects the latest modification year; modify the top-of-file comment in each
mentioned file (launchHistogramKernel.cu and the other launcher .cu filenames)
to replace "2022-2025" with the correct year range including 2026.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchMainKernel.cu`:
- Around line 204-205: Rename the misspelled local arrays intermidiateScore and
intermidiateExpert to intermediateScore and intermediateExpert, and update every
usage/reference to these symbols (including places originally between the nearby
block where they are read/written) to match the corrected names so compilation
and semantics remain consistent.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchBlockKernel.cu`:
- Around line 167-187: The two consecutive CUB BlockScan calls using the same
TempStorage (Scan(tempStorage).ExclusiveSum(numCtaPerExpert, ctaOffsetPerExpert,
numNonExitingCtas) and Scan(tempStorage).ExclusiveSum(tmpCountPerExpert,
expertScanCountsPerExpert)) must be separated by a barrier: add a
__syncthreads() immediately after the first ExclusiveSum (after
ctaOffsetPerExpert/numNonExitingCtas are computed) and before computing
tmpCountPerExpert to prevent TempStorage reuse races; keep the existing
__syncthreads() after the second scan as well and ensure the barrier location
references the existing tempStorage, ctaOffsetPerExpert, accExpertCount,
tmpCountPerExpert, and expertScanCountsPerExpert variables.
- Around line 128-131: The check treating mLocalExpertsStrideLog2 as a bitmask
is wrong: wherever you use (localExpIdx & params.mLocalExpertsStrideLog2) == 0
(e.g., in the localExpIdx/isLocal computation and the later conditional in
launchBlockKernel.cu), replace it with a proper mask built from the log2 value,
i.e. compute mask = (1u << params.mLocalExpertsStrideLog2) - 1 and test
(localExpIdx & mask) == 0; update uses of localExpIdx,
params.mLocalExpertsStartIdx, params.mNumLocalExperts, and
params.mLocalExpertsStrideLog2 in this file (and mirror the same change in other
affected functions like routingDeepSeek, RoutingLlama4.cu, RoutingKernel.cuh) so
the stride-log2 is interpreted correctly.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchInitExpertCounts.cu`:
- Line 2: Replace the outdated copyright header string "2022-2025" with the
correct latest-year range that includes 2026 (e.g. "2022-2026" or a header that
ends with 2026) in the new launcher files (e.g. launchInitExpertCounts.cu and
the other five launchers under routingRenormalize) so the top-of-file NVIDIA
copyright reflects the 2026 modification; search for the exact literal
"2022-2025" in each new file and update it consistently.

In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/RoutingRenormalizeCommon.cuh`:
- Line 136: The macro name is misspelled: replace all occurrences of
LAUNCH_ROUTING_RENORNALIZE with the correctly spelled
LAUNCH_ROUTING_RENORMALIZE; update the macro definition in
RoutingRenormalizeCommon.cuh and rename every invocation where the old macro is
used so the compile-time symbol matches (ensure you update the macro token in
the six call sites that reference it), then rebuild to verify no remaining
references to LAUNCH_ROUTING_RENORNALIZE remain.
🧹 Nitpick comments (11)
cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)

2-2: Update the copyright year to include 2026.

The header currently says 2022-2024, but this file is being modified now. As per coding guidelines: "update year on modified files."

- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
tests/unittest/_torch/thop/serial/test_moe.py (2)

1095-1105: Large memory footprint — consider CI impact.

With num_experts=2048, hidden_size=1024, and intermediate_size=1024, gemm1_weights alone is (2048, 2048, 1024) in bf16 (~8 GB), plus gemm2_weights (~4 GB), plus all quantized copies and scales. Combined with the parametrize cross-product (num_tokens × intermediate_size × act_type), this generates multiple heavy test cases.

Consider either:

  • Reducing hidden_size/intermediate_size for this specific large-expert case, or
  • Limiting the cross-product (e.g., fixing intermediate_size=768 and a single act_type for this param).

1189-1199: Same memory concern as test_autotune applies here.

This test_no_autotune variant additionally cross-products with use_topk_as_input=[False, True], further doubling the large-expert test matrix. The use_topk_as_input=True path hits the DeepSeekV3-only skip at line 1357, but the False path still runs all combinations.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingLlama4.cu (1)

29-30: Commented-out code should use #if/#endif instead of comments.

Line 29 uses a comment to disable code (// static constexpr int MaxNumExperts = 128;). If this is dead code, remove it entirely. If it's intentionally kept for reference, use #if 0 / #endif per the coding guidelines.

As per coding guidelines: "Use #if / #endif to disable code, preferably with a mnemonic condition... Do not use comments to disable code."

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h (1)

37-70: Good refactor to do { ... } while (0) for the LAUNCH_PDL macro.

This is a well-known best practice for multi-statement macros. One minor coding guideline nit: the if on lines 56 and 65 should be followed by brace-delimited statements.

As per coding guidelines: "If and else should always be followed by brace-delimited statements, even if empty or a single statement."

Proposed fix (lines 56-58, 65-67)
             if (smemSize > 48 * 1024)                                                                  \
-                TLLM_CUDA_CHECK(                                                                       \
-                    cudaFuncSetAttribute(kernelTyped, cudaFuncAttributeMaxDynamicSharedMemorySize, smemSize));         \
+            {                                                                                          \
+                TLLM_CUDA_CHECK(                                                                       \
+                    cudaFuncSetAttribute(kernelTyped, cudaFuncAttributeMaxDynamicSharedMemorySize, smemSize));         \
+            }                                                                                          \

Apply similarly to both the true and false UsePdl branches.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/RoutingDeepSeekCommon.cuh (1)

26-29: std::max with initializer list may need explicit <algorithm> include.

Line 29 uses std::max({...}) with an initializer list, which requires <algorithm> (and implicitly <initializer_list>). These may be provided transitively via RoutingKernel.cuhDevKernel.h, but relying on transitive includes is fragile. Consider a direct #include <algorithm>, or alternatively, since all three values are compile-time constants, a simpler expression like nested std::max calls or just hardcoding 512 (the actual max) would avoid the dependency entirely.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchInitExpertCounts.cu (1)

27-30: Missing /*coopLaunch=*/ inline comment for consistency.

All other launcher wrappers annotate the coopLaunch parameter with an inline comment (/*coopLaunch=*/false). This one passes bare false.

Suggested fix
-    LAUNCH_ROUTING_DEEPSEEK(data, false, routingInitExpertCounts, (2 * data.mNumExperts - 1) / numThreadsHist + 1,
+    LAUNCH_ROUTING_DEEPSEEK(data, /*coopLaunch=*/false, routingInitExpertCounts, (2 * data.mNumExperts - 1) / numThreadsHist + 1,

As per coding guidelines: "In function calls where parameters are not obvious, use inline C comments to document the parameter."

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu (1)

216-222: TODO note: PDL visibility concern is documented but unresolved.

The comment on line 218 explicitly states: "this is not sufficient to ensure visibility in the next kernel!" This suggests the secondary kernel may observe stale data for mPtrCtaIdxXyToBatchIdx, mPtrCtaIdxXyToMnLimit, mPtrNumNonExitingCtas, and mPtrPermutedIdxSize.

Is there a tracking issue for this? If the dependent FC1 kernel relies on these outputs being visible, a missing memory fence before cudaTriggerProgrammaticLaunchCompletion could cause data races on SM90+.

Would you like me to open an issue to track this PDL visibility concern?

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchClusterKernel.cu (1)

44-49: Non-SM90 fallback is missing __launch_bounds__ unlike the SM90+ path.

The SM90+ variant (line 27) uses __launch_bounds__(KernelParams::MaxNumExperts) while the fallback (line 45) has no launch bounds annotation. For consistency with the Renormalize variant (which uses __launch_bounds__(NumThreads) on its fallback), consider adding it here. This is minor since the fallback only asserts false.

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramScoresKernel.cu (1)

44-44: Unused variable minScore.

minScore is initialized to -INFINITY but never read or passed to any function in this kernel. Likely a copy-paste artifact from the block kernel. Remove it to avoid compiler warnings and dead code.

Suggested fix
-    BaseType minScore = BaseType{-INFINITY};
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchBlockKernel.cu (1)

100-100: Unused variable minScore.

Same as in launchHistogramScoresKernel.cuminScore is initialized but never referenced. Remove to avoid dead code.

Suggested fix
-        BaseType minScore = BaseType{-INFINITY};

@ChristinaZ ChristinaZ assigned yweng0828 and ChristinaZ and unassigned yweng0828 Mar 3, 2026
@ChristinaZ ChristinaZ requested review from byshiue and yweng0828 March 3, 2026 14:45
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
@ChristinaZ ChristinaZ changed the title [Draft] Add support for expert_number<=2048 and K<=32 [None][feat] Add support for expert_number<=2048 and K<=32 Mar 3, 2026
@ChristinaZ
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37524 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37524 [ run ] completed with state SUCCESS. Commit: ce58d29
/LLM/main/L0_MergeRequest_PR pipeline #29032 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@ChristinaZ
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37621 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37621 [ run ] completed with state SUCCESS. Commit: ce58d29
/LLM/main/L0_MergeRequest_PR pipeline #29112 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jiahanc
Copy link
Copy Markdown
Collaborator

jiahanc commented Mar 4, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37662 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

@ChristinaZ
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37668 [ run ] triggered by Bot. Commit: ce58d29 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #37668 [ run ] completed with state SUCCESS. Commit: ce58d29
/LLM/main/L0_MergeRequest_PR pipeline #29153 completed with status: 'SUCCESS'

Link to invocation

Copy link
Copy Markdown
Collaborator

@byshiue byshiue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@byshiue byshiue merged commit e01c38f into NVIDIA:main Mar 5, 2026
6 of 9 checks passed
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Mar 9, 2026
)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
jiahanc added a commit to flashinfer-ai/flashinfer that referenced this pull request Mar 17, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description


- Integrate NVIDIA/TensorRT-LLM#11510 to support
2048 num of experts and 32 TopK in renormalize
- Refactor MOE cu files

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Expanded MoE routing/renormalize to support up to 2,048 experts and
top-k up to 32; backend reorganized to enable larger configurations.

* **Bug Fixes**
* Clamped token counts in kernel launches to prevent oversized grid
launches.

* **Performance**
* Reworked routing/launch paths for improved scalability and throughput
with large expert/top-k settings.

* **Tests**
* Added test scenarios covering large-expert (2,048) + top-k (32)
configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
tianyuz-nv pushed a commit to wanqian-nv/TensorRT-LLM that referenced this pull request Mar 19, 2026
)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
)

Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
murphymatt pushed a commit to fw-ai/flashinfer that referenced this pull request Mar 31, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description


- Integrate NVIDIA/TensorRT-LLM#11510 to support
2048 num of experts and 32 TopK in renormalize
- Refactor MOE cu files

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Expanded MoE routing/renormalize to support up to 2,048 experts and
top-k up to 32; backend reorganized to enable larger configurations.

* **Bug Fixes**
* Clamped token counts in kernel launches to prevent oversized grid
launches.

* **Performance**
* Reworked routing/launch paths for improved scalability and throughput
with large expert/top-k settings.

* **Tests**
* Added test scenarios covering large-expert (2,048) + top-k (32)
configurations.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.