[TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt#11708
[TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt#11708schetlur-nv merged 11 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom moraxu:dev-mguzek-optimize-generate-async-for-vlmsmoraxu/TensorRT-LLM:dev-mguzek-optimize-generate-async-for-vlmsCopy head branch name to clipboard
Conversation
89ee815 to
1a6f189
Compare
22a36dc to
8dca856
Compare
|
/bot run |
|
PR_Github #38077 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis pull request introduces a fast path for processing tokenized prompts with multimodal data in the LLaVA-Next model pipeline. New methods handle image placeholder expansion in token IDs, registry utilities orchestrate the fast-path routing, and the LLM API detects and activates the fast path when compatible processors are available. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant LLM API
participant InputRegistry
participant LlavaNextProcessor
participant Tokenizer
Client->>LLM API: _preprocess(inputs with prompt_token_ids + mm_data)
LLM API->>LLM API: Check vlm_fast_path_for_token_ids_and_mm_data_available
alt Fast Path Available
LLM API->>InputRegistry: input_processor_wrapper(prompt_token_ids, mm_data)
InputRegistry->>InputRegistry: Detect tokenized+MM path
InputRegistry->>LlavaNextProcessor: expand_prompt_token_ids_for_mm()
LlavaNextProcessor->>LlavaNextProcessor: _expand_image_placeholders_in_token_ids()
LlavaNextProcessor-->>InputRegistry: expanded_ids, mm_token_length, mm_token_offsets
InputRegistry->>InputRegistry: tokenized_multimodal_process()
InputRegistry-->>LLM API: Processed output (skips detokenization)
else Fast Path Unavailable
LLM API->>Tokenizer: decode(prompt_token_ids)
Tokenizer-->>LLM API: text_prompt
LLM API->>InputRegistry: input_processor_wrapper(text_prompt, mm_data)
InputRegistry->>LlavaNextProcessor: Standard text+MM processing
LlavaNextProcessor-->>InputRegistry: Processed output
InputRegistry-->>LLM API: Processed output
end
LLM API-->>Client: Preprocessed inputs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tensorrt_llm/llmapi/llm.py (2)
491-503:⚠️ Potential issue | 🔴 CriticalPreserve the non-text multimodal fields when rewriting
inputs.This fallback rebuilds the request as
TextPrompt(...), which dropsmulti_modal_embeddings/multi_modal_uuidsand also materializesmulti_modal_data=None. On non-fast-path VLMs,prompt_token_ids + multi_modal_embeddingsthen falls into themulti_modal_databranch and dies on.keys(), while UUID-based cache IDs are silently lost.🐛 Suggested fix
- inputs = TextPrompt( - prompt=prompt, - multi_modal_data=inputs.get("multi_modal_data"), - mm_processor_kwargs=inputs.get("mm_processor_kwargs") or {}) + fallback_inputs: dict[str, Any] = {"prompt": prompt} + for key in ( + "multi_modal_data", + "multi_modal_embeddings", + "multi_modal_uuids", + "mm_processor_kwargs", + "query", + "query_token_ids", + ): + value = inputs.get(key) + if value is not None: + fallback_inputs[key] = value + inputs = fallback_inputs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/llmapi/llm.py` around lines 491 - 503, When rebuilding inputs from prompt_token_ids in the VLM fallback, preserve all non-text multimodal fields instead of dropping them: use the existing inputs.get("multi_modal_data"), inputs.get("multi_modal_embeddings"), and inputs.get("multi_modal_uuids") when constructing the TextPrompt so embeddings/UUID cache IDs aren’t lost and multi_modal_data isn’t materialized to None; update the block that calls tokenizer.decode and constructs TextPrompt (referencing inputs, prompt_token_ids, tokenizer.decode, TextPrompt, mm_processor_kwargs, DefaultInputProcessor, and vlm_fast_path_for_token_ids_and_mm_data_available) to forward multi_modal_embeddings and multi_modal_uuids and keep mm_processor_kwargs as before.
554-578:⚠️ Potential issue | 🟠 MajorTreat empty multimodal payloads as absent.
These conditions use
is None/ key presence, so{"prompt_token_ids": ..., "multi_modal_data": {}}is routed into the multimodal branch instead of the plain token-id branch. Downstream that means either a fast-path MM length lookup on an empty map or calling the VLM processor withprompt=Noneand no media.🐛 Suggested fix
+ has_multi_modal_data = bool(inputs.get("multi_modal_data")) + has_multi_modal_embeddings = bool( + inputs.get("multi_modal_embeddings")) + - elif ("prompt_token_ids" in inputs - and inputs.get("multi_modal_data") is None - and inputs.get("multi_modal_embeddings") is None): + elif ("prompt_token_ids" in inputs + and not has_multi_modal_data + and not has_multi_modal_embeddings): prompt_token_ids = inputs['prompt_token_ids'] @@ - elif "prompt" in inputs or ("prompt_token_ids" in inputs and - (("multi_modal_data" in inputs - or "multi_modal_embeddings" in inputs))): + elif "prompt" in inputs or ("prompt_token_ids" in inputs and + (has_multi_modal_data + or has_multi_modal_embeddings)):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/llmapi/llm.py` around lines 554 - 578, The multimodal branch currently triggers on key presence or is None checks and treats empty dicts as present; change the checks to treat empty multimodal payloads as absent by using truthiness/emptiness checks instead of is None or key presence. Specifically, when inspecting inputs in the initial branch with "prompt_token_ids", use something like checking inputs.get("multi_modal_data") and inputs.get("multi_modal_embeddings") are non-empty (truthy) before routing into multimodal handling and before constructing multimodal_data/MultimodalParams; likewise adjust the subsequent elif condition that looks for ("multi_modal_data" in inputs or "multi_modal_embeddings" in inputs) to require non-empty values so plain token-id paths (prompt_token_ids, prompt_token_ids + empty multimodal) follow the token-only logic. Ensure mrope handling (disaggregated_params.mrope_position_ids_handle / mrope_position_deltas_handle) logic remains the same but only runs when multimodal payloads are actually present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 526-529: The MM disaggregation path currently assumes
inputs["prompt_token_ids"] exists when
vlm_fast_path_for_token_ids_and_mm_data_available is true; update the call site
around input_processor.get_prompt_token_ids to first detect whether
prompt_token_ids are present and, if not, pass the raw prompt (or invoke the
tokenizer) so tokenization happens before preprocessing; alternatively make
get_prompt_token_ids backward-compatible by accepting raw prompt text and
mm_handles and producing token ids when prompt_token_ids is absent — adjust the
logic using vlm_fast_path_for_token_ids_and_mm_data_available,
inputs["prompt_token_ids"], input_processor.get_prompt_token_ids, and mm_handles
accordingly so raw-text disagg requests no longer crash.
---
Outside diff comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 491-503: When rebuilding inputs from prompt_token_ids in the VLM
fallback, preserve all non-text multimodal fields instead of dropping them: use
the existing inputs.get("multi_modal_data"),
inputs.get("multi_modal_embeddings"), and inputs.get("multi_modal_uuids") when
constructing the TextPrompt so embeddings/UUID cache IDs aren’t lost and
multi_modal_data isn’t materialized to None; update the block that calls
tokenizer.decode and constructs TextPrompt (referencing inputs,
prompt_token_ids, tokenizer.decode, TextPrompt, mm_processor_kwargs,
DefaultInputProcessor, and vlm_fast_path_for_token_ids_and_mm_data_available) to
forward multi_modal_embeddings and multi_modal_uuids and keep
mm_processor_kwargs as before.
- Around line 554-578: The multimodal branch currently triggers on key presence
or is None checks and treats empty dicts as present; change the checks to treat
empty multimodal payloads as absent by using truthiness/emptiness checks instead
of is None or key presence. Specifically, when inspecting inputs in the initial
branch with "prompt_token_ids", use something like checking
inputs.get("multi_modal_data") and inputs.get("multi_modal_embeddings") are
non-empty (truthy) before routing into multimodal handling and before
constructing multimodal_data/MultimodalParams; likewise adjust the subsequent
elif condition that looks for ("multi_modal_data" in inputs or
"multi_modal_embeddings" in inputs) to require non-empty values so plain
token-id paths (prompt_token_ids, prompt_token_ids + empty multimodal) follow
the token-only logic. Ensure mrope handling
(disaggregated_params.mrope_position_ids_handle / mrope_position_deltas_handle)
logic remains the same but only runs when multimodal payloads are actually
present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d81eef8-9b02-4737-9132-c6f88766cf8d
📒 Files selected for processing (3)
tensorrt_llm/_torch/models/modeling_llava_next.pytensorrt_llm/inputs/registry.pytensorrt_llm/llmapi/llm.py
|
PR_Github #38077 [ run ] completed with state
|
|
/bot run |
|
PR_Github #38195 [ run ] triggered by Bot. Commit: |
|
PR_Github #38195 [ run ] completed with state
|
|
/bot run |
|
PR_Github #38301 [ run ] triggered by Bot. Commit: |
|
PR_Github #38301 [ run ] completed with state |
Branch:
|
|
/bot run |
|
PR_Github #39640 [ run ] triggered by Bot. Commit: |
|
PR_Github #39640 [ run ] completed with state
|
|
/bot run |
|
PR_Github #39655 [ run ] triggered by Bot. Commit: |
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
|
PR_Github #39655 [ run ] completed with state
|
3a45d9c to
3607934
Compare
|
/bot run |
|
PR_Github #39701 [ run ] triggered by Bot. Commit: |
|
PR_Github #39701 [ run ] completed with state |
…instead of de-tokenizing already encoded prompt (NVIDIA#11708) Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Summary by CodeRabbit
Release Notes
New Features
Refactor
Description
Process token IDs + MM data without de-tokenizing, that is, by instead:
See: https://docs.vllm.ai/en/latest/design/mm_processing/#multi-modal-data-processing
Currently implemented only for
tensorrt_llm/_torch/models/modeling_llava_next.pyTest Coverage
Tested for:
pytest -sv tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py::TestLlava_V1_6_Mistral_7B::test_auto_dtypepytest -sv tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py::TestQwen3VL::test_auto_dtypellava-v1.6-mistral-7b-hfin Dynamo with fix: TRT-LLM multimodal preprocessor - fix the dictionary naming for an embeddings case ai-dynamo/dynamo#6567:Qwen/Qwen3-VL-2B-Instructin Dynamo with fix: TRT-LLM multimodal preprocessor - fix the dictionary naming for an embeddings case ai-dynamo/dynamo#6567:PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.