[None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property#12284
[None][fix] Fix VLM guided decoding startup crash due to missing vocab_size_padded property#12284pengbowang-nv merged 3 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom stefanpantic:users/stefan/fix-vllm-vocab-sizestefanpantic/TensorRT-LLM:users/stefan/fix-vllm-vocab-sizeCopy head branch name to clipboard
Conversation
📝 WalkthroughWalkthroughThe changes add a new public Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/unittest/_torch/models/test_vlm_vocab_size_padded.py`:
- Around line 1-14: Update the file header copyright year range from "2022-2024"
to "2022-2026" in the top-of-file license block so it reads "Copyright (c)
2022-2026 NVIDIA CORPORATION & AFFILIATES"; locate the SPDX and license comment
block at the start of the file (the header containing "SPDX-FileCopyrightText"
and "SPDX-License-Identifier") and change the year range accordingly without
altering any other license text.
In `@tests/unittest/llmapi/apps/_test_openai_chat_vlm_guided_decoding.py`:
- Around line 1-18: Update the file header: change the copyright year range from
"2022-2024" to "2022-2026" and replace the placeholder issue reference
"https://github.com/NVIDIA/TensorRT-LLM/issues/XXXX" with the actual PR number
"https://github.com/NVIDIA/TensorRT-LLM/issues/12284"; ensure the updated lines
(the SPDX header and the regression-test comment containing the issue link)
reflect these exact substitutions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ad3859a0-e5a5-42be-8bba-9bfe75d06388
📒 Files selected for processing (11)
tensorrt_llm/_torch/models/modeling_gemma3vl.pytensorrt_llm/_torch/models/modeling_hyperclovax.pytensorrt_llm/_torch/models/modeling_llava_next.pytensorrt_llm/_torch/models/modeling_mistral.pytensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/_torch/models/modeling_phi4mm.pytensorrt_llm/_torch/models/modeling_qwen2vl.pytensorrt_llm/_torch/models/modeling_qwen3vl.pytensorrt_llm/_torch/models/modeling_vila.pytests/unittest/_torch/models/test_vlm_vocab_size_padded.pytests/unittest/llmapi/apps/_test_openai_chat_vlm_guided_decoding.py
ee0ec72 to
e6a6e29
Compare
|
Thank you for your contribution! Hi @NVIDIA/trt-llm-torch-models-vlm-devs (also cc @yechank-nvidia and @chang-l ) , could you please take a look at this PR? I have managed to confirm both the problem and the fix from this PR. |
|
/bot run --disable-fail-fast |
|
PR_Github #40921 [ run ] triggered by Bot. Commit: |
yechank-nvidia
left a comment
There was a problem hiding this comment.
LGTM. Thanks for the work!
|
PR_Github #40921 [ run ] completed with state
|
1f8ad5e to
afad089
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #41097 [ run ] triggered by Bot. Commit: |
|
PR_Github #41097 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41162 [ run ] triggered by Bot. Commit: |
|
PR_Github #41162 [ run ] completed with state
|
…ided decoding startup crash Code review update Signed-off-by: Stefan Pantic <stefanpantic13@gmail.com>
afad089 to
0c4d84a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #41324 [ run ] triggered by Bot. Commit: |
|
PR_Github #41324 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41451 [ run ] triggered by Bot. Commit: |
|
PR_Github #41451 [ run ] completed with state |
|
Hi, @stefanpantic . I found tests/unittest/_torch/models/test_vlm_vocab_size_padded.py rather trivial and would like to remove the file, what do you think? Thanks! |
|
@pengbowang-nv No objections on my end. Should I remove or will you do it? |
|
Hi @stefanpantic , could you please remove it and I'll continue with CI and merge after that? Thanks! |
Signed-off-by: Stefan Pantic <stefanpantic13@gmail.com>
|
@pengbowang-nv done ✔️ |
|
/bot run --disable-fail-fast |
|
PR_Github #42074 [ run ] triggered by Bot. Commit: |
|
PR_Github #42074 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42233 [ run ] triggered by Bot. Commit: |
|
PR_Github #42233 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42304 [ run ] triggered by Bot. Commit: |
|
PR_Github #42304 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42353 [ run ] triggered by Bot. Commit: |
|
PR_Github #42353 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42434 [ run ] triggered by Bot. Commit: |
|
PR_Github #42434 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
vocab_size_paddedproperty to vision-language models for accessing padded vocabulary size information.Tests
vocab_size_paddedproperty across all vision-language models.Description
VLM wrapper classes (
Qwen3VLModel,Qwen2VLModel,LlavaNextModel, etc.) extendHuggingFace's
PreTrainedModelrather than TRT-LLM'sDecoderModelForCausalLM, so theydo not inherit the
vocab_size_paddedproperty.py_executor_creator.pyunconditionallyreads
model_engine.model.vocab_size_paddedwhen initialising theGuidedDecodercausingan
AttributeErrorat server startup whenever guided decoding is configured with any VLMmodel, regardless of GPU or request.
Fix: add
vocab_size_paddedas a@propertyto all 9 affected VLM wrapper classes,delegating to
self.llm.vocab_size_padded. This follows the same pattern as the existinginfer_max_seq_lendelegation already present in every one of these classes.Affected classes:
Qwen3VLModelBase,Qwen2VLModelBase,LlavaNextModel,Gemma3VLM,Phi4MMForCausalLM,Mistral3VLM,VilaModel,HCXVisionForCausalLM,NemotronH_Nano_VL_V2.Test Coverage
tests/unittest/_torch/models/test_vlm_vocab_size_padded.py(no GPU required):verifies
vocab_size_paddedis defined as a@propertyon each class, delegates toself.llm, and reflects live updates (27 parametrized cases across 9 classes × 3 tests).tests/unittest/llmapi/apps/_test_openai_chat_vlm_guided_decoding.py(E2E, requiresGPU): starts
trtllm-servewith Qwen3-VL-8B-Instruct andguided_decoding_backend: xgrammar, sends a multimodal request withresponse_format: json_schema, and validates the response against the schema. Without thefix the server crashes at startup and the fixture itself fails.
PR Checklist
Please review the following before submitting your PR: