Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[#12288][feat] Add Mistral 4-small support to AutoDeploy#12266

Merged
bmarimuthu-nv merged 14 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
nv-auto-deploy:bala/mistral4-smallnv-auto-deploy/TensorRT-LLM:bala/mistral4-smallCopy head branch name to clipboard
Mar 31, 2026
Merged

[#12288][feat] Add Mistral 4-small support to AutoDeploy#12266
bmarimuthu-nv merged 14 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
nv-auto-deploy:bala/mistral4-smallnv-auto-deploy/TensorRT-LLM:bala/mistral4-smallCopy head branch name to clipboard

Conversation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator

@bmarimuthu-nv bmarimuthu-nv commented Mar 17, 2026

Summary

  • add AutoDeploy custom modeling support for Mistral Small 4 and related Mistral 3 multimodal wrappers
  • add temporary tokenizer/processor bridge for the upstream TokenizersBackend checkpoint metadata until TRT-LLM upgrades transformers
  • fix the supporting AutoDeploy issues exercised by this model family, including MLA sharding shape updates, FlashInfer MLA cache append fallback, FP8 MLA RoPE checkpoint reordering, and graph/export cleanup regressions

Validation

  • pytest -p no:cacheprovider -q tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py -k "mistral or rope or flashinfer or graph or export_info or tokenizer_bridge or update_node_args_preserves_nested_symbolic_shape_nodes"
  • HF_HOME=/tmp/trtllm-hf TMPDIR=/tmp/trtllm-ad PYTHONDONTWRITEBYTECODE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model mistralai/Mistral-Small-4-119B-2603 --args.yaml-extra examples/auto_deploy/model_registry/configs/mistral_small_4_119b_lite.yaml --prompt.batch-size 1

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

Summary by CodeRabbit

  • New Features

    • Support for Mistral Small 4 (119B) model deployment.
    • Mistral Small 3.2 (24B-Instruct) with multimodal capabilities.
    • End-to-end deployment cookbook with serving instructions.
  • Improvements

    • Enhanced FP8 quantization and tensor handling.
    • Optimized export pipeline with improved pattern matching.
    • Better checkpoint loading for pre-quantized weights.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 05ddf48f-83ec-488c-94a1-79f1065c3f77

📥 Commits

Reviewing files that changed from the base of the PR and between 7110a7e and 28e6cfd.

📒 Files selected for processing (23)
  • examples/auto_deploy/cookbooks/mistral_small_4_trtllm_cookbook.ipynb
  • examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yaml
  • examples/auto_deploy/model_registry/models.yaml
  • tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
  • tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
  • tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
  • tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_pattern_matcher.py
  • tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
  • tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py
  • tests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py

📝 Walkthrough

Walkthrough

Adds support for Mistral Small 3.2 and Mistral Small 4 models in AutoDeploy with custom implementations including MLA attention, MoE, and multimodal capabilities. Introduces configuration files, custom model classes with FP8 quantization and checkpoint hooks, utility updates for RoPE deinterleaving and graph operations, quantization infrastructure enhancements, and comprehensive test coverage plus an example notebook.

Changes

Cohort / File(s) Summary
Model Registry Configuration
examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yaml, examples/auto_deploy/model_registry/models.yaml
Added standalone YAML config for Mistral Small 4 119B with TRTLLM runtime, cached MLA attention, and world_size 8. Registered two new model entries with yaml_extra references to dashboard, world_size, multimodal, and custom config overlays.
Custom Model Implementation
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py, tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
Implemented Mistral4TextConfig, Mistral4Model, Mistral4ForCausalLM, and Mistral3ForConditionalGenerationAD with MLA-style attention using torch.ops.auto_deploy.torch_mla, MoE support with fused expert checkpoint expansion, RoPE with YARN parameterization, and FP8 quantization load hooks. Added tokenizer/processor wrappers and factory registration. Exported new classes in custom module init.
RoPE & MLA Utilities
tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py
Added _index_select_with_float8_cpu_workaround() to handle FP8 tensors on CPU by converting to uint8 view, performing index selection, then reverting to original dtype. Updated _rope_deinterleave_load_hook to use this helper for FP8 CPU tensor indexing operations.
Graph Export & HuggingFace Integration
tensorrt_llm/_torch/auto_deploy/models/hf.py
Modified TextModelExportInfo.post_process to insert embedding keepalive assertion before FX graph output node using symbolic shape comparison (sym_size.int and operator.ge) instead of direct tensor reference, improving compatibility with symbolic shape propagation.
Node & Layer Analysis Utilities
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Updated get_layer_after_linear_node to exclude linear nodes that feed into downstream MLA operations from embedding-dimension boundaries. Added logic to identify and preserve deepest linear sinks when multiple candidates exist with exactly one MLA node in forward slice.
Graph Transformation Infrastructure
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py, tensorrt_llm/_torch/auto_deploy/utils/pattern_matcher.py
Replaced positional argument updates with recursive _merge_arg() helper supporting nested tuple/list merging and Node-typed arguments. Introduced ADReplacementPatternEntry for multi-output pattern replacements with topological sort repair. Added _register_replacement_with_safe_insertion() with fake-mode tracing and dynamic pattern detection.
Quantization Framework Enhancements
tensorrt_llm/_torch/auto_deploy/transform/library/quantization.py, tensorrt_llm/_torch/auto_deploy/transform/library/quantize_moe.py, tensorrt_llm/_torch/auto_deploy/utils/quantization_utils.py
Added FLOAT8_DTYPES constant for available FP8 dtypes. Enhanced FP8LinearQuantizationFromConfig with prefix-aware key handling and pre-quantized FP8 checkpoint remapping (activation_scale/weight_scale_inv → input_scale/weight_scale). Added early-exit guards when weight_block_size is None. Updated QuantizeFP8MOE to check both quant_algo and quant_method.
MoE Fusion & Stacking
tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py
Normalized FP8 MoE weight/scale tensor stacking to consistent 2D layout via reshape. Ensured scalar and single-element scales become uniformly shaped [E, S] tensors, with special handling for empty scales to match the normalized second dimension.
Comprehensive Unit Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py
Added extensive test suite for Mistral4/Mistral3 custom implementation covering RMSNorm, rotary embeddings, MLA attention, MoE with fused checkpoint expansion, FP8 dequantization hooks, decoder layers, full model equivalence, export validation, and factory registration.
Specialized Utility Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/test_pattern_matcher.py
Added RoPE deinterleave hook tests for weight permutation and FP8 byte-level stability. Added pattern matcher tests validating multi-output graph topology repair with stable topological sort.
Graph & Export Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py, tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py
Added tests for graph canonicalization restoring topological order and TextModelExportInfo embedding keepalive assertion using scalar symbolic shapes.
Tokenizer & Model Integration Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py, tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
Added conditional tests for Mistral Small 4 tokenizer/processor wrapper loading. Added symbolic shape preservation test for _update_node_args during nested view operations.
Quantization & Fusion Regression Tests
tests/unittest/auto_deploy/singlegpu/transformations/library/test_moe_fusion.py, tests/unittest/auto_deploy/singlegpu/utils/test_quantization_utils.py
Added FP8 MoE scalar input scale handling regression test. Added FP8 checkpoint remapping tests covering prefix-aware and prefix-less key conversions.
Example & Documentation
examples/auto_deploy/cookbooks/mistral_small_4_trtllm_cookbook.ipynb
Added Jupyter notebook demonstrating end-to-end Mistral Small 4 deployment: NVIDIA container setup, pip/package installation, OpenAI-compatible server launch with AutoDeploy, and client usage examples with streaming and non-streaming chat completions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.89% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature addition: support for Mistral Small 4 in AutoDeploy. It is concise, focused, and directly reflects the primary objective of this changeset.
Description check ✅ Passed The PR description provides a clear summary of changes, lists validation steps with specific test commands and E2E examples, and includes a completed PR checklist addressing coding guidelines, test coverage, dependencies, and documentation requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)

1660-1675: Add strict=True to zip calls in recursive arg merge.

This merge logic assumes aligned lengths; explicit strictness makes mismatches fail fast and addresses the Ruff B905 findings in this block.

♻️ Proposed patch
-                return tuple(_merge_arg(cur, old) for cur, old in zip(current_arg, stored_arg))
+                return tuple(
+                    _merge_arg(cur, old)
+                    for cur, old in zip(current_arg, stored_arg, strict=True)
+                )
@@
-                return [_merge_arg(cur, old) for cur, old in zip(current_arg, stored_arg)]
+                return [
+                    _merge_arg(cur, old)
+                    for cur, old in zip(current_arg, stored_arg, strict=True)
+                ]
@@
-    new_args = [
-        _merge_arg(current_arg, stored_arg) for current_arg, stored_arg in zip(node.args, args)
-    ]
+    new_args = [
+        _merge_arg(current_arg, stored_arg)
+        for current_arg, stored_arg in zip(node.args, args, strict=True)
+    ]

Based on learnings: In TensorRT-LLM (Python requires >=3.10 and <4 as per setup.py), you can use Python 3.10+ features (e.g., PEP 585 generics), so zip(..., strict=True) is available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines
1660 - 1675, The zip usage in the recursive merge (_merge_arg) and when building
new_args assumes equal-length iterables but doesn't fail on mismatch; update the
zip(...) calls in the tuple/list handling branches and the final new_args
construction to use zip(..., strict=True) so mismatched lengths raise
immediately. Locate the _merge_arg function and replace the two zip(...) usages
inside the tuple and list branches and the zip(...) in the new_args list
comprehension with zip(..., strict=True) to enforce strict alignment.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py (1)

7-10: Use module-namespace imports in this test.

This file introduces direct symbol imports; repository guidance prefers module imports with namespaced usage.

♻️ Proposed patch
 import operator

 import torch
-from torch import nn
-from torch.fx import symbolic_trace
+import torch.fx as fx
+import torch.nn as nn

-from tensorrt_llm._torch.auto_deploy.models.hf import TextModelExportInfo
+import tensorrt_llm._torch.auto_deploy.models.hf as hf_models
@@
-    gm = symbolic_trace(model)
+    gm = fx.symbolic_trace(model)
@@
-    export_info = TextModelExportInfo("dummy")
+    export_info = hf_models.TextModelExportInfo("dummy")

As per coding guidelines: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions..."

Also applies to: 27-30

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py`
around lines 7 - 10, The test file uses direct symbol imports (nn,
symbolic_trace, TextModelExportInfo); change these to module-namespace imports
and update usages accordingly—e.g., replace "from torch import nn" with "import
torch.nn as nn" or "import torch" and use "torch.nn", replace "from torch.fx
import symbolic_trace" with "import torch.fx as fx" and use "fx.symbolic_trace",
and replace "from tensorrt_llm._torch.auto_deploy.models.hf import
TextModelExportInfo" with "import tensorrt_llm._torch.auto_deploy.models.hf as
hf" and use "hf.TextModelExportInfo"; also update the other occurrences noted
around lines 27-30 similarly.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py (1)

47-54: Consider adding strict=True to zip() for defensive programming.

While batch_indices and positions are constructed with the same length in _make_append_metadata, adding strict=True would catch any future mismatches early.

♻️ Suggested fix
     for token_idx, (batch_idx, position) in enumerate(
-        zip(batch_indices.tolist(), positions.tolist())
+        zip(batch_indices.tolist(), positions.tolist(), strict=True)
     ):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py`
around lines 47 - 54, The loop in test_flashinfer_mla_cache_append.py that
iterates "for token_idx, (batch_idx, position) in
enumerate(zip(batch_indices.tolist(), positions.tolist()))" should use zip(...,
strict=True) to defensively ensure batch_indices and positions are the same
length; update that zip call to zip(batch_indices.tolist(), positions.tolist(),
strict=True) so any future mismatch (despite _make_append_metadata currently
producing equal lengths) raises immediately and surfaces the bug during tests.
tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py (2)

148-149: Same assertion pattern in processor - consider consistent error handling.

Apply the same improvement as suggested for the tokenizer class for consistency.

♻️ Suggested improvement
-        assert source_processor_config_path is not None
+        if source_processor_config_path is None:
+            raise FileNotFoundError(
+                f"Could not find {_PROCESSOR_CONFIG_FILE} for {source_model_name_or_path}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`
around lines 148 - 149, Replace the bare assert in processing_mistral_small_4.py
that checks source_processor_config_path with explicit error handling similar to
the tokenizer class: check if source_processor_config_path is None and raise a
clear ValueError (or RuntimeError) with a descriptive message identifying
source_processor_config_path and the expected config; then call
_load_json(Path(source_processor_config_path)) as before. This change should be
applied around the source_processor_config_path usage in the processing logic to
ensure consistent, explicit error reporting.

94-98: Consider replacing assertions with informative exceptions for user-facing errors.

The assert statements will raise AssertionError without context if the config files are missing. For better user experience, consider raising FileNotFoundError or ValueError with a descriptive message.

♻️ Suggested improvement
-        assert source_tokenizer_config_path is not None
+        if source_tokenizer_config_path is None:
+            raise FileNotFoundError(
+                f"Could not find {_TOKENIZER_CONFIG_FILE} for {source_model_name_or_path}"
+            )
         source_tokenizer_config = _load_json(Path(source_tokenizer_config_path))
 
         tokenizer_file = cached_file(source_model_name_or_path, _TOKENIZER_FILE, **kwargs)
-        assert tokenizer_file is not None
+        if tokenizer_file is None:
+            raise FileNotFoundError(
+                f"Could not find {_TOKENIZER_FILE} for {source_model_name_or_path}"
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`
around lines 94 - 98, Replace the bare assert checks for missing tokenizer
files/configs with explicit exceptions: instead of "assert
source_tokenizer_config_path is not None" raise a FileNotFoundError or
ValueError that includes the variable value (source_tokenizer_config_path) and a
clear message before calling _load_json; likewise, after obtaining
tokenizer_file from cached_file(source_model_name_or_path, _TOKENIZER_FILE,
**kwargs) replace "assert tokenizer_file is not None" with a FileNotFoundError
that includes source_model_name_or_path and _TOKENIZER_FILE so callers get a
descriptive error rather than an AssertionError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`:
- Around line 148-149: Replace the bare assert in processing_mistral_small_4.py
that checks source_processor_config_path with explicit error handling similar to
the tokenizer class: check if source_processor_config_path is None and raise a
clear ValueError (or RuntimeError) with a descriptive message identifying
source_processor_config_path and the expected config; then call
_load_json(Path(source_processor_config_path)) as before. This change should be
applied around the source_processor_config_path usage in the processing logic to
ensure consistent, explicit error reporting.
- Around line 94-98: Replace the bare assert checks for missing tokenizer
files/configs with explicit exceptions: instead of "assert
source_tokenizer_config_path is not None" raise a FileNotFoundError or
ValueError that includes the variable value (source_tokenizer_config_path) and a
clear message before calling _load_json; likewise, after obtaining
tokenizer_file from cached_file(source_model_name_or_path, _TOKENIZER_FILE,
**kwargs) replace "assert tokenizer_file is not None" with a FileNotFoundError
that includes source_model_name_or_path and _TOKENIZER_FILE so callers get a
descriptive error rather than an AssertionError.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Around line 1660-1675: The zip usage in the recursive merge (_merge_arg) and
when building new_args assumes equal-length iterables but doesn't fail on
mismatch; update the zip(...) calls in the tuple/list handling branches and the
final new_args construction to use zip(..., strict=True) so mismatched lengths
raise immediately. Locate the _merge_arg function and replace the two zip(...)
usages inside the tuple and list branches and the zip(...) in the new_args list
comprehension with zip(..., strict=True) to enforce strict alignment.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py`:
- Around line 47-54: The loop in test_flashinfer_mla_cache_append.py that
iterates "for token_idx, (batch_idx, position) in
enumerate(zip(batch_indices.tolist(), positions.tolist()))" should use zip(...,
strict=True) to defensively ensure batch_indices and positions are the same
length; update that zip call to zip(batch_indices.tolist(), positions.tolist(),
strict=True) so any future mismatch (despite _make_append_metadata currently
producing equal lengths) raises immediately and surfaces the bug during tests.

In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py`:
- Around line 7-10: The test file uses direct symbol imports (nn,
symbolic_trace, TextModelExportInfo); change these to module-namespace imports
and update usages accordingly—e.g., replace "from torch import nn" with "import
torch.nn as nn" or "import torch" and use "torch.nn", replace "from torch.fx
import symbolic_trace" with "import torch.fx as fx" and use "fx.symbolic_trace",
and replace "from tensorrt_llm._torch.auto_deploy.models.hf import
TextModelExportInfo" with "import tensorrt_llm._torch.auto_deploy.models.hf as
hf" and use "hf.TextModelExportInfo"; also update the other occurrences noted
around lines 27-30 similarly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 440c06b4-d3d4-46d6-8a0a-7c056879adc4

📥 Commits

Reviewing files that changed from the base of the PR and between 5003d38 and bdb7238.

📒 Files selected for processing (23)
  • examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yaml
  • examples/auto_deploy/model_registry/configs/mistral_small_4_119b_lite.yaml
  • examples/auto_deploy/model_registry/models.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/tokenizers/__init__.py
  • tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/__init__.py
  • tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py
  • tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processor_config.json
  • tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/tokenizer_config.json
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
  • tensorrt_llm/_torch/auto_deploy/utils/_graph.py
  • tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py
  • tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py

@bmarimuthu-nv bmarimuthu-nv force-pushed the bala/mistral4-small branch 2 times, most recently from 445d807 to 3b02e76 Compare March 17, 2026 16:50
@bmarimuthu-nv bmarimuthu-nv changed the title [None][feat] add Mistral 4 support to AutoDeploy [#12288][feat] Add Mistral 4 support to AutoDeploy Mar 17, 2026
@bmarimuthu-nv bmarimuthu-nv changed the title [#12288][feat] Add Mistral 4 support to AutoDeploy [#12288][feat] Add Mistral 4-small support to AutoDeploy Mar 17, 2026
@arysef
Copy link
Copy Markdown
Collaborator

arysef commented Mar 17, 2026

Are you planning to add it to the model support matrix in a separate PR?

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/tokenizers/__init__.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/utils/_graph.py Outdated
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

@coderabbitai help

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 24, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve merge conflict to automatically resolve merge conflicts.
  • @coderabbitai autofix to automatically fix issues identified in unresolved review comments.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40189 [ run ] triggered by Bot. Commit: 6a3584e Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

@coderabbitai summary

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 25, 2026

✅ Actions performed

Summary regeneration triggered.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40190 [ run ] triggered by Bot. Commit: 28e6cfd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40190 [ run ] completed with state SUCCESS. Commit: 28e6cfd
/LLM/main/L0_MergeRequest_PR pipeline #31333 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40334 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40334 [ run ] completed with state ABORTED. Commit: 3a3130f
LLM/main/L0_MergeRequest_PR #31440 (Blue Ocean) completed with status: ABORTED

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --reuse-test

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40780 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40780 [ run ] completed with state SUCCESS. Commit: 3a3130f
/LLM/main/L0_MergeRequest_PR pipeline #31797 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --reuse-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40890 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40890 [ run ] completed with state SUCCESS. Commit: 3a3130f
/LLM/main/L0_MergeRequest_PR pipeline #31893 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40964 [ run ] triggered by Bot. Commit: 3a3130f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40964 [ run ] completed with state SUCCESS. Commit: 3a3130f
/LLM/main/L0_MergeRequest_PR pipeline #31950 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

Latest CI summary on head 3a3130f603:

AutoDeploy signal for this PR is clean:

  • DGX_B200-4_GPUs-AutoDeploy-1 passed in LLM/main/L0_MergeRequest_PR #31797
  • no AutoDeploy stage has failed on the current head

The remaining blocker is in an unrelated non-AutoDeploy area:

  • stage: DGX_B200-8_GPUs-PyTorch-1
  • suite: accuracy/test_disaggregated_serving.py
  • class: TestQwen3_8B
  • failure mode: Test terminated unexpectedly

Observed variants on the current head:

  1. fifo_v2-cudagraph:with_padding-pp1tp1cp4
  • seen in #31797
  • already known/waived
  • matches historical run #31772
  • waiver: nvbugs/6007201
  1. fifo_v2-cudagraph:with_padding-pp1tp2cp2
  • seen in #31893
  • reproduced again in targeted rerun #31950 (/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1")
  • same issue family in B200 PyTorch disaggregated serving, but this parameterization is not currently waived in the CI report payload

So the PR is currently blocked by a flaky/infra failure family outside the AutoDeploy scope of this change, not by an AutoDeploy regression on this branch.

@bmarimuthu-nv
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "DGX_B200-8_GPUs-PyTorch-1 accuracy/test_disaggregated_serving.py failure is a known unrelated failure, others are passing"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40999 [ skip ] triggered by Bot. Commit: 3a3130f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40999 [ skip ] completed with state SUCCESS. Commit: 3a3130f
Skipping testing for commit 3a3130f

Link to invocation

@bmarimuthu-nv bmarimuthu-nv merged commit 6ac5c15 into NVIDIA:main Mar 31, 2026
6 checks passed
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
…A#12266)

Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.