[#12288][feat] Add Mistral 4-small support to AutoDeploy by bmarimuthu-nv · Pull Request #12266 · NVIDIA/TensorRT-LLM

bmarimuthu-nv · Mar 17, 2026

Summary

add AutoDeploy custom modeling support for Mistral Small 4 and related Mistral 3 multimodal wrappers
add temporary tokenizer/processor bridge for the upstream TokenizersBackend checkpoint metadata until TRT-LLM upgrades transformers
fix the supporting AutoDeploy issues exercised by this model family, including MLA sharding shape updates, FlashInfer MLA cache append fallback, FP8 MLA RoPE checkpoint reordering, and graph/export cleanup regressions

Validation

pytest -p no:cacheprovider -q tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py -k "mistral or rope or flashinfer or graph or export_info or tokenizer_bridge or update_node_args_preserves_nested_symbolic_shape_nodes"
HF_HOME=/tmp/trtllm-hf TMPDIR=/tmp/trtllm-ad PYTHONDONTWRITEBYTECODE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model mistralai/Mistral-Small-4-119B-2603 --args.yaml-extra examples/auto_deploy/model_registry/configs/mistral_small_4_119b_lite.yaml --prompt.batch-size 1

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

New Features
- Support for Mistral Small 4 (119B) model deployment.
- Mistral Small 3.2 (24B-Instruct) with multimodal capabilities.
- End-to-end deployment cookbook with serving instructions.
Improvements
- Enhanced FP8 quantization and tensor handling.
- Optimized export pipeline with improved pattern matching.
- Better checkpoint loading for pre-quantized weights.

coderabbitai · Mar 17, 2026

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 05ddf48f-83ec-488c-94a1-79f1065c3f77

📥 Commits

Reviewing files that changed from the base of the PR and between 7110a7e and 28e6cfd.

📒 Files selected for processing (23)

examples/auto_deploy/cookbooks/mistral_small_4_trtllm_cookbook.ipynb
examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yaml
examples/auto_deploy/model_registry/models.yaml
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_pattern_matcher.py
tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py
tests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py

📝 Walkthrough

Walkthrough

Adds support for Mistral Small 3.2 and Mistral Small 4 models in AutoDeploy with custom implementations including MLA attention, MoE, and multimodal capabilities. Introduces configuration files, custom model classes with FP8 quantization and checkpoint hooks, utility updates for RoPE deinterleaving and graph operations, quantization infrastructure enhancements, and comprehensive test coverage plus an example notebook.

Changes

Cohort / File(s)	Summary
Model Registry Configuration `examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yaml`, `examples/auto_deploy/model_registry/models.yaml`	Added standalone YAML config for Mistral Small 4 119B with TRTLLM runtime, cached MLA attention, and world_size 8. Registered two new model entries with yaml_extra references to dashboard, world_size, multimodal, and custom config overlays.
Custom Model Implementation `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py`, `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`	Implemented Mistral4TextConfig, Mistral4Model, Mistral4ForCausalLM, and Mistral3ForConditionalGenerationAD with MLA-style attention using torch.ops.auto_deploy.torch_mla, MoE support with fused expert checkpoint expansion, RoPE with YARN parameterization, and FP8 quantization load hooks. Added tokenizer/processor wrappers and factory registration. Exported new classes in custom module init.
RoPE & MLA Utilities `tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py`	Added `_index_select_with_float8_cpu_workaround()` to handle FP8 tensors on CPU by converting to uint8 view, performing index selection, then reverting to original dtype. Updated `_rope_deinterleave_load_hook` to use this helper for FP8 CPU tensor indexing operations.
Graph Export & HuggingFace Integration `tensorrt_llm/_torch/auto_deploy/models/hf.py`	Modified `TextModelExportInfo.post_process` to insert embedding keepalive assertion before FX graph output node using symbolic shape comparison (sym_size.int and operator.ge) instead of direct tensor reference, improving compatibility with symbolic shape propagation.
Node & Layer Analysis Utilities `tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`	Updated `get_layer_after_linear_node` to exclude linear nodes that feed into downstream MLA operations from embedding-dimension boundaries. Added logic to identify and preserve deepest linear sinks when multiple candidates exist with exactly one MLA node in forward slice.
Graph Transformation Infrastructure `tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`, `tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py`	Replaced positional argument updates with recursive `_merge_arg()` helper supporting nested tuple/list merging and Node-typed arguments. Introduced `ADReplacementPatternEntry` for multi-output pattern replacements with topological sort repair. Added `_register_replacement_with_safe_insertion()` with fake-mode tracing and dynamic pattern detection.
Quantization Framework Enhancements `tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py`, `tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py`, `tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py`	Added `FLOAT8_DTYPES` constant for available FP8 dtypes. Enhanced FP8LinearQuantizationFromConfig with prefix-aware key handling and pre-quantized FP8 checkpoint remapping (activation_scale/weight_scale_inv → input_scale/weight_scale). Added early-exit guards when weight_block_size is None. Updated QuantizeFP8MOE to check both quant_algo and quant_method.
MoE Fusion & Stacking `tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py`	Normalized FP8 MoE weight/scale tensor stacking to consistent 2D layout via reshape. Ensured scalar and single-element scales become uniformly shaped `[E, S]` tensors, with special handling for empty scales to match the normalized second dimension.
Comprehensive Unit Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py`	Added extensive test suite for Mistral4/Mistral3 custom implementation covering RMSNorm, rotary embeddings, MLA attention, MoE with fused checkpoint expansion, FP8 dequantization hooks, decoder layers, full model equivalence, export validation, and factory registration.
Specialized Utility Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_pattern_matcher.py`	Added RoPE deinterleave hook tests for weight permutation and FP8 byte-level stability. Added pattern matcher tests validating multi-output graph topology repair with stable topological sort.
Graph & Export Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py`	Added tests for graph canonicalization restoring topological order and TextModelExportInfo embedding keepalive assertion using scalar symbolic shapes.
Tokenizer & Model Integration Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py`, `tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py`	Added conditional tests for Mistral Small 4 tokenizer/processor wrapper loading. Added symbolic shape preservation test for `_update_node_args` during nested view operations.
Quantization & Fusion Regression Tests `tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py`, `tests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py`	Added FP8 MoE scalar input scale handling regression test. Added FP8 checkpoint remapping tests covering prefix-aware and prefix-less key conversions.
Example & Documentation `examples/auto_deploy/cookbooks/mistral_small_4_trtllm_cookbook.ipynb`	Added Jupyter notebook demonstrating end-to-end Mistral Small 4 deployment: NVIDIA container setup, pip/package installation, OpenAI-compatible server launch with AutoDeploy, and client usage examples with streaming and non-streaming chat completions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 11.89% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature addition: support for Mistral Small 4 in AutoDeploy. It is concise, focused, and directly reflects the primary objective of this changeset.
Description check	✅ Passed	The PR description provides a clear summary of changes, lists validation steps with specific test commands and E2E examples, and includes a completed PR checklist addressing coding guidelines, test coverage, dependencies, and documentation requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (5)

tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)

1660-1675: Add strict=True to zip calls in recursive arg merge.

This merge logic assumes aligned lengths; explicit strictness makes mismatches fail fast and addresses the Ruff B905 findings in this block.

♻️ Proposed patch

-                return tuple(_merge_arg(cur, old) for cur, old in zip(current_arg, stored_arg))
+                return tuple(
+                    _merge_arg(cur, old)
+                    for cur, old in zip(current_arg, stored_arg, strict=True)
+                )
@@
-                return [_merge_arg(cur, old) for cur, old in zip(current_arg, stored_arg)]
+                return [
+                    _merge_arg(cur, old)
+                    for cur, old in zip(current_arg, stored_arg, strict=True)
+                ]
@@
-    new_args = [
-        _merge_arg(current_arg, stored_arg) for current_arg, stored_arg in zip(node.args, args)
-    ]
+    new_args = [
+        _merge_arg(current_arg, stored_arg)
+        for current_arg, stored_arg in zip(node.args, args, strict=True)
+    ]

Based on learnings: In TensorRT-LLM (Python requires >=3.10 and <4 as per setup.py), you can use Python 3.10+ features (e.g., PEP 585 generics), so zip(..., strict=True) is available.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines
1660 - 1675, The zip usage in the recursive merge (_merge_arg) and when building
new_args assumes equal-length iterables but doesn't fail on mismatch; update the
zip(...) calls in the tuple/list handling branches and the final new_args
construction to use zip(..., strict=True) so mismatched lengths raise
immediately. Locate the _merge_arg function and replace the two zip(...) usages
inside the tuple and list branches and the zip(...) in the new_args list
comprehension with zip(..., strict=True) to enforce strict alignment.

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py (1)

7-10: Use module-namespace imports in this test.

This file introduces direct symbol imports; repository guidance prefers module imports with namespaced usage.

♻️ Proposed patch

 import operator

 import torch
-from torch import nn
-from torch.fx import symbolic_trace
+import torch.fx as fx
+import torch.nn as nn

-from tensorrt_llm._torch.auto_deploy.models.hf import TextModelExportInfo
+import tensorrt_llm._torch.auto_deploy.models.hf as hf_models
@@
-    gm = symbolic_trace(model)
+    gm = fx.symbolic_trace(model)
@@
-    export_info = TextModelExportInfo("dummy")
+    export_info = hf_models.TextModelExportInfo("dummy")

As per coding guidelines: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions..."

Also applies to: 27-30

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py`
around lines 7 - 10, The test file uses direct symbol imports (nn,
symbolic_trace, TextModelExportInfo); change these to module-namespace imports
and update usages accordingly—e.g., replace "from torch import nn" with "import
torch.nn as nn" or "import torch" and use "torch.nn", replace "from torch.fx
import symbolic_trace" with "import torch.fx as fx" and use "fx.symbolic_trace",
and replace "from tensorrt_llm._torch.auto_deploy.models.hf import
TextModelExportInfo" with "import tensorrt_llm._torch.auto_deploy.models.hf as
hf" and use "hf.TextModelExportInfo"; also update the other occurrences noted
around lines 27-30 similarly.

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py (1)

47-54: Consider adding strict=True to zip() for defensive programming.

While batch_indices and positions are constructed with the same length in _make_append_metadata, adding strict=True would catch any future mismatches early.

♻️ Suggested fix

     for token_idx, (batch_idx, position) in enumerate(
-        zip(batch_indices.tolist(), positions.tolist())
+        zip(batch_indices.tolist(), positions.tolist(), strict=True)
     ):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py`
around lines 47 - 54, The loop in test_flashinfer_mla_cache_append.py that
iterates "for token_idx, (batch_idx, position) in
enumerate(zip(batch_indices.tolist(), positions.tolist()))" should use zip(...,
strict=True) to defensively ensure batch_indices and positions are the same
length; update that zip call to zip(batch_indices.tolist(), positions.tolist(),
strict=True) so any future mismatch (despite _make_append_metadata currently
producing equal lengths) raises immediately and surfaces the bug during tests.

tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py (2)

148-149: Same assertion pattern in processor - consider consistent error handling.

Apply the same improvement as suggested for the tokenizer class for consistency.

♻️ Suggested improvement

-        assert source_processor_config_path is not None
+        if source_processor_config_path is None:
+            raise FileNotFoundError(
+                f"Could not find {_PROCESSOR_CONFIG_FILE} for {source_model_name_or_path}"
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`
around lines 148 - 149, Replace the bare assert in processing_mistral_small_4.py
that checks source_processor_config_path with explicit error handling similar to
the tokenizer class: check if source_processor_config_path is None and raise a
clear ValueError (or RuntimeError) with a descriptive message identifying
source_processor_config_path and the expected config; then call
_load_json(Path(source_processor_config_path)) as before. This change should be
applied around the source_processor_config_path usage in the processing logic to
ensure consistent, explicit error reporting.

94-98: Consider replacing assertions with informative exceptions for user-facing errors.

The assert statements will raise AssertionError without context if the config files are missing. For better user experience, consider raising FileNotFoundError or ValueError with a descriptive message.

♻️ Suggested improvement

-        assert source_tokenizer_config_path is not None
+        if source_tokenizer_config_path is None:
+            raise FileNotFoundError(
+                f"Could not find {_TOKENIZER_CONFIG_FILE} for {source_model_name_or_path}"
+            )
         source_tokenizer_config = _load_json(Path(source_tokenizer_config_path))
 
         tokenizer_file = cached_file(source_model_name_or_path, _TOKENIZER_FILE, **kwargs)
-        assert tokenizer_file is not None
+        if tokenizer_file is None:
+            raise FileNotFoundError(
+                f"Could not find {_TOKENIZER_FILE} for {source_model_name_or_path}"
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`
around lines 94 - 98, Replace the bare assert checks for missing tokenizer
files/configs with explicit exceptions: instead of "assert
source_tokenizer_config_path is not None" raise a FileNotFoundError or
ValueError that includes the variable value (source_tokenizer_config_path) and a
clear message before calling _load_json; likewise, after obtaining
tokenizer_file from cached_file(source_model_name_or_path, _TOKENIZER_FILE,
**kwargs) replace "assert tokenizer_file is not None" with a FileNotFoundError
that includes source_model_name_or_path and _TOKENIZER_FILE so callers get a
descriptive error rather than an AssertionError.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`:
- Around line 148-149: Replace the bare assert in processing_mistral_small_4.py
that checks source_processor_config_path with explicit error handling similar to
the tokenizer class: check if source_processor_config_path is None and raise a
clear ValueError (or RuntimeError) with a descriptive message identifying
source_processor_config_path and the expected config; then call
_load_json(Path(source_processor_config_path)) as before. This change should be
applied around the source_processor_config_path usage in the processing logic to
ensure consistent, explicit error reporting.
- Around line 94-98: Replace the bare assert checks for missing tokenizer
files/configs with explicit exceptions: instead of "assert
source_tokenizer_config_path is not None" raise a FileNotFoundError or
ValueError that includes the variable value (source_tokenizer_config_path) and a
clear message before calling _load_json; likewise, after obtaining
tokenizer_file from cached_file(source_model_name_or_path, _TOKENIZER_FILE,
**kwargs) replace "assert tokenizer_file is not None" with a FileNotFoundError
that includes source_model_name_or_path and _TOKENIZER_FILE so callers get a
descriptive error rather than an AssertionError.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Around line 1660-1675: The zip usage in the recursive merge (_merge_arg) and
when building new_args assumes equal-length iterables but doesn't fail on
mismatch; update the zip(...) calls in the tuple/list handling branches and the
final new_args construction to use zip(..., strict=True) so mismatched lengths
raise immediately. Locate the _merge_arg function and replace the two zip(...)
usages inside the tuple and list branches and the zip(...) in the new_args list
comprehension with zip(..., strict=True) to enforce strict alignment.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py`:
- Around line 47-54: The loop in test_flashinfer_mla_cache_append.py that
iterates "for token_idx, (batch_idx, position) in
enumerate(zip(batch_indices.tolist(), positions.tolist()))" should use zip(...,
strict=True) to defensively ensure batch_indices and positions are the same
length; update that zip call to zip(batch_indices.tolist(), positions.tolist(),
strict=True) so any future mismatch (despite _make_append_metadata currently
producing equal lengths) raises immediately and surfaces the bug during tests.

In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py`:
- Around line 7-10: The test file uses direct symbol imports (nn,
symbolic_trace, TextModelExportInfo); change these to module-namespace imports
and update usages accordingly—e.g., replace "from torch import nn" with "import
torch.nn as nn" or "import torch" and use "torch.nn", replace "from torch.fx
import symbolic_trace" with "import torch.fx as fx" and use "fx.symbolic_trace",
and replace "from tensorrt_llm._torch.auto_deploy.models.hf import
TextModelExportInfo" with "import tensorrt_llm._torch.auto_deploy.models.hf as
hf" and use "hf.TextModelExportInfo"; also update the other occurrences noted
around lines 27-30 similarly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 440c06b4-d3d4-46d6-8a0a-7c056879adc4

📥 Commits

Reviewing files that changed from the base of the PR and between 5003d38 and bdb7238.

📒 Files selected for processing (23)

examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yaml
examples/auto_deploy/model_registry/configs/mistral_small_4_119b_lite.yaml
examples/auto_deploy/model_registry/models.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/tokenizers/__init__.py
tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/__init__.py
tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py
tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processor_config.json
tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/tokenizer_config.json
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
tensorrt_llm/_torch/auto_deploy/utils/_graph.py
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py
tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py

arysef · Mar 17, 2026

Are you planning to add it to the model support matrix in a separate PR?

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · Mar 24, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

bmarimuthu-nv · Mar 24, 2026

@coderabbitai help

coderabbitai · Mar 24, 2026

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
- @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
- @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai generate unit tests to generate unit tests for this PR.
@coderabbitai resolve merge conflict to automatically resolve merge conflicts.
@coderabbitai autofix to automatically fix issues identified in unresolved review comments.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

bmarimuthu-nv · Mar 24, 2026

@coderabbitai review

bmarimuthu-nv · Mar 24, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

tensorrt-cicd · Mar 25, 2026

PR_Github #40189 [ run ] triggered by Bot. Commit: 6a3584e Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · Mar 25, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

bmarimuthu-nv · Mar 25, 2026

@coderabbitai summary

coderabbitai · Mar 25, 2026

✅ Actions performed

Summary regeneration triggered.

tensorrt-cicd · Mar 25, 2026

PR_Github #40190 [ run ] triggered by Bot. Commit: 28e6cfd Link to invocation

tensorrt-cicd · Mar 25, 2026

PR_Github #40190 [ run ] completed with state SUCCESS. Commit: 28e6cfd
/LLM/main/L0_MergeRequest_PR pipeline #31333 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · Mar 25, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · Mar 25, 2026

PR_Github #40334 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

tensorrt-cicd · Mar 26, 2026

PR_Github #40334 [ run ] completed with state ABORTED. Commit: 3a3130f
LLM/main/L0_MergeRequest_PR #31440 (Blue Ocean) completed with status: ABORTED

Link to invocation

bmarimuthu-nv · Mar 30, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --reuse-test

bmarimuthu-nv · Mar 30, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · Mar 30, 2026

PR_Github #40780 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

tensorrt-cicd · Mar 31, 2026

PR_Github #40780 [ run ] completed with state SUCCESS. Commit: 3a3130f
/LLM/main/L0_MergeRequest_PR pipeline #31797 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bmarimuthu-nv · Mar 31, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --reuse-test

tensorrt-cicd · Mar 31, 2026

PR_Github #40890 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

tensorrt-cicd · Mar 31, 2026

PR_Github #40890 [ run ] completed with state SUCCESS. Commit: 3a3130f
/LLM/main/L0_MergeRequest_PR pipeline #31893 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bmarimuthu-nv · Mar 31, 2026

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

tensorrt-cicd · Mar 31, 2026

PR_Github #40964 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

tensorrt-cicd · Mar 31, 2026

PR_Github #40964 [ run ] completed with state SUCCESS. Commit: 3a3130f
/LLM/main/L0_MergeRequest_PR pipeline #31950 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

bmarimuthu-nv · Mar 31, 2026

Latest CI summary on head 3a3130f603:

AutoDeploy signal for this PR is clean:

DGX_B200-4_GPUs-AutoDeploy-1 passed in LLM/main/L0_MergeRequest_PR #31797
no AutoDeploy stage has failed on the current head

The remaining blocker is in an unrelated non-AutoDeploy area:

stage: DGX_B200-8_GPUs-PyTorch-1
suite: accuracy/test_disaggregated_serving.py
class: TestQwen3_8B
failure mode: Test terminated unexpectedly

Observed variants on the current head:

fifo_v2-cudagraph:with_padding-pp1tp1cp4

seen in #31797
already known/waived
matches historical run #31772
waiver: nvbugs/6007201

fifo_v2-cudagraph:with_padding-pp1tp2cp2

seen in #31893
reproduced again in targeted rerun #31950 (/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1")
same issue family in B200 PyTorch disaggregated serving, but this parameterization is not currently waived in the CI report payload

So the PR is currently blocked by a flaky/infra failure family outside the AutoDeploy scope of this change, not by an AutoDeploy regression on this branch.

bmarimuthu-nv · Mar 31, 2026

/bot skip --comment "DGX_B200-8_GPUs-PyTorch-1 accuracy/test_disaggregated_serving.py failure is a known unrelated failure, others are passing"

tensorrt-cicd · Mar 31, 2026

PR_Github #40999 [ skip ] triggered by Bot. Commit: 3a3130f Link to invocation

tensorrt-cicd · Mar 31, 2026

PR_Github #40999 [ skip ] completed with state SUCCESS. Commit: 3a3130f
Skipping testing for commit 3a3130f

Link to invocation

…A#12266) Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv requested a review from a team as a code owner March 17, 2026 02:28

bmarimuthu-nv requested a review from nvchenghaoz March 17, 2026 02:28

github-actions Bot assigned bmarimuthu-nv Mar 17, 2026

coderabbitai Bot reviewed Mar 17, 2026

View reviewed changes

bmarimuthu-nv force-pushed the bala/mistral4-small branch 2 times, most recently from 445d807 to 3b02e76 Compare March 17, 2026 16:50

bmarimuthu-nv changed the title ~~[None][feat] add Mistral 4 support to AutoDeploy~~ [#12288][feat] Add Mistral 4 support to AutoDeploy Mar 17, 2026

bmarimuthu-nv changed the title ~~[#12288][feat] Add Mistral 4 support to AutoDeploy~~ [#12288][feat] Add Mistral 4-small support to AutoDeploy Mar 17, 2026

bmarimuthu-nv force-pushed the bala/mistral4-small branch from aad2fbe to 990e3be Compare March 17, 2026 19:22

nvchenghaoz reviewed Mar 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py Outdated

bmarimuthu-nv commented Mar 19, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py

bmarimuthu-nv commented Mar 19, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated

bmarimuthu-nv commented Mar 19, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/tokenizers/__init__.py Outdated

bmarimuthu-nv commented Mar 19, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/utils/_graph.py Outdated

bmarimuthu-nv added 9 commits March 23, 2026 10:04

[None][fix] preserve nested FX nodes in sharding arg updates

374ba24

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

[None][feat] add Mistral 4 support to AutoDeploy

5f81c05

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

[None][feat] improve Mistral 4 fp8 checkpoint loading

0c6aa8d

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

[None][config] make Mistral 4 registry default to torch_mla

bbd8482

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up

e75f772

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

share AutoDeploy float8 dtype helper

d0aa205

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

clean up review follow-ups

a97565b

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

limit toposort to multi-output matcher replacements

4d6b6e8

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

use stable topological sort for multi-output matcher rewrites

a36507c

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv force-pushed the bala/mistral4-small branch from 6b86a11 to a36507c Compare March 24, 2026 23:03

[None][cleanup] remove flashinfer mla fallback path

16a9722

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

[None][cleanup] remove stale flashinfer mla fallback test

28e6cfd

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

suyoggupta approved these changes Mar 25, 2026

View reviewed changes

fix CI tests

3a3130f

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv merged commit 6ac5c15 into NVIDIA:main Mar 31, 2026
6 checks passed

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[NVIDIA#12288][feat] Add Mistral 4-small support to AutoDeploy (NVIDI…

fc34ffd

…A#12266) Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

Search code, repositories, users, issues, pull requests...

Conversation

bmarimuthu-nv commented Mar 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

PR Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

arysef commented Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmarimuthu-nv commented Mar 24, 2026

Uh oh!

bmarimuthu-nv commented Mar 24, 2026

Uh oh!

coderabbitai Bot commented Mar 24, 2026

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

bmarimuthu-nv commented Mar 24, 2026

Uh oh!

bmarimuthu-nv commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

bmarimuthu-nv commented Mar 25, 2026

Uh oh!

bmarimuthu-nv commented Mar 25, 2026

Uh oh!

coderabbitai Bot commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

bmarimuthu-nv commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 25, 2026

Uh oh!

tensorrt-cicd commented Mar 26, 2026

Uh oh!

bmarimuthu-nv commented Mar 30, 2026

Uh oh!

bmarimuthu-nv commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

bmarimuthu-nv commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

bmarimuthu-nv commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

bmarimuthu-nv commented Mar 31, 2026

Uh oh!

bmarimuthu-nv commented Mar 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 17, 2026 •

edited

Loading