[#12288][feat] Add Mistral 4-small support to AutoDeploy#12266
[#12288][feat] Add Mistral 4-small support to AutoDeploy#12266bmarimuthu-nv merged 14 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom nv-auto-deploy:bala/mistral4-smallnv-auto-deploy/TensorRT-LLM:bala/mistral4-smallCopy head branch name to clipboard
Conversation
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (23)
📝 WalkthroughWalkthroughAdds support for Mistral Small 3.2 and Mistral Small 4 models in AutoDeploy with custom implementations including MLA attention, MoE, and multimodal capabilities. Introduces configuration files, custom model classes with FP8 quantization and checkpoint hooks, utility updates for RoPE deinterleaving and graph operations, quantization infrastructure enhancements, and comprehensive test coverage plus an example notebook. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1660-1675: Addstrict=Truetozipcalls in recursive arg merge.This merge logic assumes aligned lengths; explicit strictness makes mismatches fail fast and addresses the Ruff B905 findings in this block.
♻️ Proposed patch
- return tuple(_merge_arg(cur, old) for cur, old in zip(current_arg, stored_arg)) + return tuple( + _merge_arg(cur, old) + for cur, old in zip(current_arg, stored_arg, strict=True) + ) @@ - return [_merge_arg(cur, old) for cur, old in zip(current_arg, stored_arg)] + return [ + _merge_arg(cur, old) + for cur, old in zip(current_arg, stored_arg, strict=True) + ] @@ - new_args = [ - _merge_arg(current_arg, stored_arg) for current_arg, stored_arg in zip(node.args, args) - ] + new_args = [ + _merge_arg(current_arg, stored_arg) + for current_arg, stored_arg in zip(node.args, args, strict=True) + ]Based on learnings: In TensorRT-LLM (Python requires >=3.10 and <4 as per setup.py), you can use Python 3.10+ features (e.g., PEP 585 generics), so
zip(..., strict=True)is available.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` around lines 1660 - 1675, The zip usage in the recursive merge (_merge_arg) and when building new_args assumes equal-length iterables but doesn't fail on mismatch; update the zip(...) calls in the tuple/list handling branches and the final new_args construction to use zip(..., strict=True) so mismatched lengths raise immediately. Locate the _merge_arg function and replace the two zip(...) usages inside the tuple and list branches and the zip(...) in the new_args list comprehension with zip(..., strict=True) to enforce strict alignment.tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py (1)
7-10: Use module-namespace imports in this test.This file introduces direct symbol imports; repository guidance prefers module imports with namespaced usage.
♻️ Proposed patch
import operator import torch -from torch import nn -from torch.fx import symbolic_trace +import torch.fx as fx +import torch.nn as nn -from tensorrt_llm._torch.auto_deploy.models.hf import TextModelExportInfo +import tensorrt_llm._torch.auto_deploy.models.hf as hf_models @@ - gm = symbolic_trace(model) + gm = fx.symbolic_trace(model) @@ - export_info = TextModelExportInfo("dummy") + export_info = hf_models.TextModelExportInfo("dummy")As per coding guidelines: "When importing in Python, always maintain the namespace. Import the module, not individual classes or functions..."
Also applies to: 27-30
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py` around lines 7 - 10, The test file uses direct symbol imports (nn, symbolic_trace, TextModelExportInfo); change these to module-namespace imports and update usages accordingly—e.g., replace "from torch import nn" with "import torch.nn as nn" or "import torch" and use "torch.nn", replace "from torch.fx import symbolic_trace" with "import torch.fx as fx" and use "fx.symbolic_trace", and replace "from tensorrt_llm._torch.auto_deploy.models.hf import TextModelExportInfo" with "import tensorrt_llm._torch.auto_deploy.models.hf as hf" and use "hf.TextModelExportInfo"; also update the other occurrences noted around lines 27-30 similarly.tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py (1)
47-54: Consider addingstrict=Trueto zip() for defensive programming.While
batch_indicesandpositionsare constructed with the same length in_make_append_metadata, addingstrict=Truewould catch any future mismatches early.♻️ Suggested fix
for token_idx, (batch_idx, position) in enumerate( - zip(batch_indices.tolist(), positions.tolist()) + zip(batch_indices.tolist(), positions.tolist(), strict=True) ):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py` around lines 47 - 54, The loop in test_flashinfer_mla_cache_append.py that iterates "for token_idx, (batch_idx, position) in enumerate(zip(batch_indices.tolist(), positions.tolist()))" should use zip(..., strict=True) to defensively ensure batch_indices and positions are the same length; update that zip call to zip(batch_indices.tolist(), positions.tolist(), strict=True) so any future mismatch (despite _make_append_metadata currently producing equal lengths) raises immediately and surfaces the bug during tests.tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py (2)
148-149: Same assertion pattern in processor - consider consistent error handling.Apply the same improvement as suggested for the tokenizer class for consistency.
♻️ Suggested improvement
- assert source_processor_config_path is not None + if source_processor_config_path is None: + raise FileNotFoundError( + f"Could not find {_PROCESSOR_CONFIG_FILE} for {source_model_name_or_path}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py` around lines 148 - 149, Replace the bare assert in processing_mistral_small_4.py that checks source_processor_config_path with explicit error handling similar to the tokenizer class: check if source_processor_config_path is None and raise a clear ValueError (or RuntimeError) with a descriptive message identifying source_processor_config_path and the expected config; then call _load_json(Path(source_processor_config_path)) as before. This change should be applied around the source_processor_config_path usage in the processing logic to ensure consistent, explicit error reporting.
94-98: Consider replacing assertions with informative exceptions for user-facing errors.The
assertstatements will raiseAssertionErrorwithout context if the config files are missing. For better user experience, consider raisingFileNotFoundErrororValueErrorwith a descriptive message.♻️ Suggested improvement
- assert source_tokenizer_config_path is not None + if source_tokenizer_config_path is None: + raise FileNotFoundError( + f"Could not find {_TOKENIZER_CONFIG_FILE} for {source_model_name_or_path}" + ) source_tokenizer_config = _load_json(Path(source_tokenizer_config_path)) tokenizer_file = cached_file(source_model_name_or_path, _TOKENIZER_FILE, **kwargs) - assert tokenizer_file is not None + if tokenizer_file is None: + raise FileNotFoundError( + f"Could not find {_TOKENIZER_FILE} for {source_model_name_or_path}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py` around lines 94 - 98, Replace the bare assert checks for missing tokenizer files/configs with explicit exceptions: instead of "assert source_tokenizer_config_path is not None" raise a FileNotFoundError or ValueError that includes the variable value (source_tokenizer_config_path) and a clear message before calling _load_json; likewise, after obtaining tokenizer_file from cached_file(source_model_name_or_path, _TOKENIZER_FILE, **kwargs) replace "assert tokenizer_file is not None" with a FileNotFoundError that includes source_model_name_or_path and _TOKENIZER_FILE so callers get a descriptive error rather than an AssertionError.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@tensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.py`:
- Around line 148-149: Replace the bare assert in processing_mistral_small_4.py
that checks source_processor_config_path with explicit error handling similar to
the tokenizer class: check if source_processor_config_path is None and raise a
clear ValueError (or RuntimeError) with a descriptive message identifying
source_processor_config_path and the expected config; then call
_load_json(Path(source_processor_config_path)) as before. This change should be
applied around the source_processor_config_path usage in the processing logic to
ensure consistent, explicit error reporting.
- Around line 94-98: Replace the bare assert checks for missing tokenizer
files/configs with explicit exceptions: instead of "assert
source_tokenizer_config_path is not None" raise a FileNotFoundError or
ValueError that includes the variable value (source_tokenizer_config_path) and a
clear message before calling _load_json; likewise, after obtaining
tokenizer_file from cached_file(source_model_name_or_path, _TOKENIZER_FILE,
**kwargs) replace "assert tokenizer_file is not None" with a FileNotFoundError
that includes source_model_name_or_path and _TOKENIZER_FILE so callers get a
descriptive error rather than an AssertionError.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Around line 1660-1675: The zip usage in the recursive merge (_merge_arg) and
when building new_args assumes equal-length iterables but doesn't fail on
mismatch; update the zip(...) calls in the tuple/list handling branches and the
final new_args construction to use zip(..., strict=True) so mismatched lengths
raise immediately. Locate the _merge_arg function and replace the two zip(...)
usages inside the tuple and list branches and the zip(...) in the new_args list
comprehension with zip(..., strict=True) to enforce strict alignment.
In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py`:
- Around line 47-54: The loop in test_flashinfer_mla_cache_append.py that
iterates "for token_idx, (batch_idx, position) in
enumerate(zip(batch_indices.tolist(), positions.tolist()))" should use zip(...,
strict=True) to defensively ensure batch_indices and positions are the same
length; update that zip call to zip(batch_indices.tolist(), positions.tolist(),
strict=True) so any future mismatch (despite _make_append_metadata currently
producing equal lengths) raises immediately and surfaces the bug during tests.
In `@tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py`:
- Around line 7-10: The test file uses direct symbol imports (nn,
symbolic_trace, TextModelExportInfo); change these to module-namespace imports
and update usages accordingly—e.g., replace "from torch import nn" with "import
torch.nn as nn" or "import torch" and use "torch.nn", replace "from torch.fx
import symbolic_trace" with "import torch.fx as fx" and use "fx.symbolic_trace",
and replace "from tensorrt_llm._torch.auto_deploy.models.hf import
TextModelExportInfo" with "import tensorrt_llm._torch.auto_deploy.models.hf as
hf" and use "hf.TextModelExportInfo"; also update the other occurrences noted
around lines 27-30 similarly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 440c06b4-d3d4-46d6-8a0a-7c056879adc4
📒 Files selected for processing (23)
examples/auto_deploy/model_registry/configs/mistral_small_4_119b.yamlexamples/auto_deploy/model_registry/configs/mistral_small_4_119b_lite.yamlexamples/auto_deploy/model_registry/models.yamltensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.pytensorrt_llm/_torch/auto_deploy/models/custom/__init__.pytensorrt_llm/_torch/auto_deploy/models/custom/mla_rope_utils.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_mistral3.pytensorrt_llm/_torch/auto_deploy/models/hf.pytensorrt_llm/_torch/auto_deploy/tokenizers/__init__.pytensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/__init__.pytensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processing_mistral_small_4.pytensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/processor_config.jsontensorrt_llm/_torch/auto_deploy/tokenizers/mistral_small_4_119b/tokenizer_config.jsontensorrt_llm/_torch/auto_deploy/transform/library/sharding.pytensorrt_llm/_torch/auto_deploy/utils/_graph.pytensorrt_llm/_torch/auto_deploy/utils/node_utils.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.pytests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.pytests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.pytests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py
445d807 to
3b02e76
Compare
aad2fbe to
990e3be
Compare
|
Are you planning to add it to the model support matrix in a separate PR? |
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
6b86a11 to
a36507c
Compare
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
|
@coderabbitai review |
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
|
PR_Github #40189 [ run ] triggered by Bot. Commit: |
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
@coderabbitai summary |
✅ Actions performedSummary regeneration triggered. |
|
PR_Github #40190 [ run ] triggered by Bot. Commit: |
|
PR_Github #40190 [ run ] completed with state
|
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #40334 [ run ] triggered by Bot. Commit: |
|
PR_Github #40334 [ run ] completed with state |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --reuse-test |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #40780 [ run ] triggered by Bot. Commit: |
|
PR_Github #40780 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --reuse-test |
|
PR_Github #40890 [ run ] triggered by Bot. Commit: |
|
PR_Github #40890 [ run ] completed with state
|
|
/bot run --stage-list "DGX_B200-8_GPUs-PyTorch-1" |
|
PR_Github #40964 [ run ] triggered by Bot. Commit: |
|
PR_Github #40964 [ run ] completed with state
|
|
Latest CI summary on head AutoDeploy signal for this PR is clean:
The remaining blocker is in an unrelated non-AutoDeploy area:
Observed variants on the current head:
So the PR is currently blocked by a flaky/infra failure family outside the AutoDeploy scope of this change, not by an AutoDeploy regression on this branch. |
|
/bot skip --comment "DGX_B200-8_GPUs-PyTorch-1 accuracy/test_disaggregated_serving.py failure is a known unrelated failure, others are passing" |
|
PR_Github #40999 [ skip ] triggered by Bot. Commit: |
|
PR_Github #40999 [ skip ] completed with state |
…A#12266) Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
Summary
Validation
pytest -p no:cacheprovider -q tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mistral3_modeling.py tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_mla_rope_utils.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_flashinfer_mla_cache_append.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_graph_canonicalize.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_hf_export_info.py tests/unittest/_torch/auto_deploy/unit/singlegpu/test_mistral_small_4_tokenizer_bridge.py tests/unittest/auto_deploy/multigpu/transformations/library/test_tp_sharding.py -k "mistral or rope or flashinfer or graph or export_info or tokenizer_bridge or update_node_args_preserves_nested_symbolic_shape_nodes"HF_HOME=/tmp/trtllm-hf TMPDIR=/tmp/trtllm-ad PYTHONDONTWRITEBYTECODE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python examples/auto_deploy/build_and_run_ad.py --model mistralai/Mistral-Small-4-119B-2603 --args.yaml-extra examples/auto_deploy/model_registry/configs/mistral_small_4_119b_lite.yaml --prompt.batch-size 1PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
Summary by CodeRabbit
New Features
Improvements