[TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt by moraxu · Pull Request #11708 · NVIDIA/TensorRT-LLM

moraxu · Feb 25, 2026

Summary by CodeRabbit

Release Notes

New Features
- Added fast-path support for processing pre-tokenized prompts with multimodal data, eliminating unnecessary detokenization steps for improved performance.
- Enhanced image placeholder expansion and handling for multimodal inputs.
Refactor
- Improved input processing pipeline to seamlessly support both text-based and pre-tokenized prompt paths with unified multimodal handling.

Description

Process token IDs + MM data without de-tokenizing, that is, by instead:

Processing multi-modal inputs with dummy inputs (for media placeholder tokens to match the number of multi-modal inputs),
Replacing the placeholder token IDs for the multi-modal inputs with the actual feature token IDs.

See: https://docs.vllm.ai/en/latest/design/mm_processing/#multi-modal-data-processing

Currently implemented only for tensorrt_llm/_torch/models/modeling_llava_next.py

Test Coverage

Tested for:

pytest -sv tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py::TestLlava_V1_6_Mistral_7B::test_auto_dtype
pytest -sv tests/integration/defs/accuracy/test_llm_api_pytorch_multimodal.py::TestQwen3VL::test_auto_dtype
llava-v1.6-mistral-7b-hf in Dynamo with fix: TRT-LLM multimodal preprocessor - fix the dictionary naming for an embeddings case ai-dynamo/dynamo#6567:
- image URL:
  - AGG, E/P/D
- embeddings:
  - AGG, E/P/D, E/PD
- text only:
  - AGG, E/P/D
Qwen/Qwen3-VL-2B-Instruct in Dynamo with fix: TRT-LLM multimodal preprocessor - fix the dictionary naming for an embeddings case ai-dynamo/dynamo#6567:
- image URL:
  - E/P/D

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

moraxu · Mar 7, 2026

/bot run

tensorrt-cicd · Mar 7, 2026

PR_Github #38077 [ run ] triggered by Bot. Commit: 8dca856 Link to invocation

coderabbitai · Mar 7, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a fast path for processing tokenized prompts with multimodal data in the LLaVA-Next model pipeline. New methods handle image placeholder expansion in token IDs, registry utilities orchestrate the fast-path routing, and the LLM API detects and activates the fast path when compatible processors are available.

Changes

Cohort / File(s)	Summary
LLaVA Next Input Processor Methods `tensorrt_llm/_torch/models/modeling_llava_next.py`	Added `get_text_with_mm_placeholders()`, `_expand_image_placeholders_in_token_ids()`, and `expand_prompt_token_ids_for_mm()` to support placeholder expansion in tokenized prompts. Modified `get_prompt_token_ids()` signature to accept `prompt_token_ids` directly instead of text prompts. Updated `attach_multimodal_embeddings()` to handle tokenized+MM path with assertion checks.
Multimodal Registry Fast Path `tensorrt_llm/inputs/registry.py`	Added helper utilities for normalizing MM counts, generating dummy placeholders, and fetching MM token lengths. Implemented `tokenized_multimodal_process()` and updated `multimodal_hashing_process()` to support precomputed token IDs and extra multimodal inputs. Enhanced `input_processor_wrapper()` to detect and route the tokenized+MM fast path.
LLM API Fast Path Detection `tensorrt_llm/llmapi/llm.py`	Introduced `vlm_fast_path_for_token_ids_and_mm_data_available` feature flag. Added detokenization fallback when fast path is unavailable. Updated `_preprocess()` to route tokenized+MM inputs appropriately and modified key access patterns for safer handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant LLM API
    participant InputRegistry
    participant LlavaNextProcessor
    participant Tokenizer

    Client->>LLM API: _preprocess(inputs with prompt_token_ids + mm_data)
    
    LLM API->>LLM API: Check vlm_fast_path_for_token_ids_and_mm_data_available
    
    alt Fast Path Available
        LLM API->>InputRegistry: input_processor_wrapper(prompt_token_ids, mm_data)
        InputRegistry->>InputRegistry: Detect tokenized+MM path
        InputRegistry->>LlavaNextProcessor: expand_prompt_token_ids_for_mm()
        LlavaNextProcessor->>LlavaNextProcessor: _expand_image_placeholders_in_token_ids()
        LlavaNextProcessor-->>InputRegistry: expanded_ids, mm_token_length, mm_token_offsets
        InputRegistry->>InputRegistry: tokenized_multimodal_process()
        InputRegistry-->>LLM API: Processed output (skips detokenization)
    else Fast Path Unavailable
        LLM API->>Tokenizer: decode(prompt_token_ids)
        Tokenizer-->>LLM API: text_prompt
        LLM API->>InputRegistry: input_processor_wrapper(text_prompt, mm_data)
        InputRegistry->>LlavaNextProcessor: Standard text+MM processing
        LlavaNextProcessor-->>InputRegistry: Processed output
        InputRegistry-->>LLM API: Processed output
    end
    
    LLM API-->>Client: Preprocessed inputs

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: introducing a fast path for processing token IDs with multimodal data without de-tokenization.
Description check	✅ Passed	The PR description provides a clear explanation of what is being implemented (a fast path for processing token IDs + MM data without de-tokenizing), includes comprehensive test coverage details, and confirms the PR checklist is completed.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/llmapi/llm.py (2)

491-503: ⚠️ Potential issue | 🔴 Critical

Preserve the non-text multimodal fields when rewriting inputs.

This fallback rebuilds the request as TextPrompt(...), which drops multi_modal_embeddings/multi_modal_uuids and also materializes multi_modal_data=None. On non-fast-path VLMs, prompt_token_ids + multi_modal_embeddings then falls into the multi_modal_data branch and dies on .keys(), while UUID-based cache IDs are silently lost.

🐛 Suggested fix

-            inputs = TextPrompt(
-                prompt=prompt,
-                multi_modal_data=inputs.get("multi_modal_data"),
-                mm_processor_kwargs=inputs.get("mm_processor_kwargs") or {})
+            fallback_inputs: dict[str, Any] = {"prompt": prompt}
+            for key in (
+                    "multi_modal_data",
+                    "multi_modal_embeddings",
+                    "multi_modal_uuids",
+                    "mm_processor_kwargs",
+                    "query",
+                    "query_token_ids",
+            ):
+                value = inputs.get(key)
+                if value is not None:
+                    fallback_inputs[key] = value
+            inputs = fallback_inputs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm.py` around lines 491 - 503, When rebuilding inputs
from prompt_token_ids in the VLM fallback, preserve all non-text multimodal
fields instead of dropping them: use the existing
inputs.get("multi_modal_data"), inputs.get("multi_modal_embeddings"), and
inputs.get("multi_modal_uuids") when constructing the TextPrompt so
embeddings/UUID cache IDs aren’t lost and multi_modal_data isn’t materialized to
None; update the block that calls tokenizer.decode and constructs TextPrompt
(referencing inputs, prompt_token_ids, tokenizer.decode, TextPrompt,
mm_processor_kwargs, DefaultInputProcessor, and
vlm_fast_path_for_token_ids_and_mm_data_available) to forward
multi_modal_embeddings and multi_modal_uuids and keep mm_processor_kwargs as
before.

554-578: ⚠️ Potential issue | 🟠 Major

Treat empty multimodal payloads as absent.

These conditions use is None / key presence, so {"prompt_token_ids": ..., "multi_modal_data": {}} is routed into the multimodal branch instead of the plain token-id branch. Downstream that means either a fast-path MM length lookup on an empty map or calling the VLM processor with prompt=None and no media.

🐛 Suggested fix

+        has_multi_modal_data = bool(inputs.get("multi_modal_data"))
+        has_multi_modal_embeddings = bool(
+            inputs.get("multi_modal_embeddings"))
+
-        elif ("prompt_token_ids" in inputs
-              and inputs.get("multi_modal_data") is None
-              and inputs.get("multi_modal_embeddings") is None):
+        elif ("prompt_token_ids" in inputs
+              and not has_multi_modal_data
+              and not has_multi_modal_embeddings):
             prompt_token_ids = inputs['prompt_token_ids']
@@
-        elif "prompt" in inputs or ("prompt_token_ids" in inputs and
-                                    (("multi_modal_data" in inputs
-                                      or "multi_modal_embeddings" in inputs))):
+        elif "prompt" in inputs or ("prompt_token_ids" in inputs and
+                                    (has_multi_modal_data
+                                     or has_multi_modal_embeddings)):

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm.py` around lines 554 - 578, The multimodal branch
currently triggers on key presence or is None checks and treats empty dicts as
present; change the checks to treat empty multimodal payloads as absent by using
truthiness/emptiness checks instead of is None or key presence. Specifically,
when inspecting inputs in the initial branch with "prompt_token_ids", use
something like checking inputs.get("multi_modal_data") and
inputs.get("multi_modal_embeddings") are non-empty (truthy) before routing into
multimodal handling and before constructing multimodal_data/MultimodalParams;
likewise adjust the subsequent elif condition that looks for ("multi_modal_data"
in inputs or "multi_modal_embeddings" in inputs) to require non-empty values so
plain token-id paths (prompt_token_ids, prompt_token_ids + empty multimodal)
follow the token-only logic. Ensure mrope handling
(disaggregated_params.mrope_position_ids_handle / mrope_position_deltas_handle)
logic remains the same but only runs when multimodal payloads are actually
present.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 526-529: The MM disaggregation path currently assumes
inputs["prompt_token_ids"] exists when
vlm_fast_path_for_token_ids_and_mm_data_available is true; update the call site
around input_processor.get_prompt_token_ids to first detect whether
prompt_token_ids are present and, if not, pass the raw prompt (or invoke the
tokenizer) so tokenization happens before preprocessing; alternatively make
get_prompt_token_ids backward-compatible by accepting raw prompt text and
mm_handles and producing token ids when prompt_token_ids is absent — adjust the
logic using vlm_fast_path_for_token_ids_and_mm_data_available,
inputs["prompt_token_ids"], input_processor.get_prompt_token_ids, and mm_handles
accordingly so raw-text disagg requests no longer crash.

---

Outside diff comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 491-503: When rebuilding inputs from prompt_token_ids in the VLM
fallback, preserve all non-text multimodal fields instead of dropping them: use
the existing inputs.get("multi_modal_data"),
inputs.get("multi_modal_embeddings"), and inputs.get("multi_modal_uuids") when
constructing the TextPrompt so embeddings/UUID cache IDs aren’t lost and
multi_modal_data isn’t materialized to None; update the block that calls
tokenizer.decode and constructs TextPrompt (referencing inputs,
prompt_token_ids, tokenizer.decode, TextPrompt, mm_processor_kwargs,
DefaultInputProcessor, and vlm_fast_path_for_token_ids_and_mm_data_available) to
forward multi_modal_embeddings and multi_modal_uuids and keep
mm_processor_kwargs as before.
- Around line 554-578: The multimodal branch currently triggers on key presence
or is None checks and treats empty dicts as present; change the checks to treat
empty multimodal payloads as absent by using truthiness/emptiness checks instead
of is None or key presence. Specifically, when inspecting inputs in the initial
branch with "prompt_token_ids", use something like checking
inputs.get("multi_modal_data") and inputs.get("multi_modal_embeddings") are
non-empty (truthy) before routing into multimodal handling and before
constructing multimodal_data/MultimodalParams; likewise adjust the subsequent
elif condition that looks for ("multi_modal_data" in inputs or
"multi_modal_embeddings" in inputs) to require non-empty values so plain
token-id paths (prompt_token_ids, prompt_token_ids + empty multimodal) follow
the token-only logic. Ensure mrope handling
(disaggregated_params.mrope_position_ids_handle / mrope_position_deltas_handle)
logic remains the same but only runs when multimodal payloads are actually
present.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d81eef8-9b02-4737-9132-c6f88766cf8d

📥 Commits

Reviewing files that changed from the base of the PR and between cc16289 and 8dca856.

📒 Files selected for processing (3)

tensorrt_llm/_torch/models/modeling_llava_next.py
tensorrt_llm/inputs/registry.py
tensorrt_llm/llmapi/llm.py

tensorrt-cicd · Mar 7, 2026

PR_Github #38077 [ run ] completed with state SUCCESS. Commit: 8dca856
/LLM/main/L0_MergeRequest_PR pipeline #29504 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

moraxu · Mar 9, 2026

/bot run

tensorrt-cicd · Mar 9, 2026

PR_Github #38195 [ run ] triggered by Bot. Commit: 9545eff Link to invocation

tensorrt-cicd · Mar 9, 2026

PR_Github #38195 [ run ] completed with state SUCCESS. Commit: 9545eff
/LLM/main/L0_MergeRequest_PR pipeline #29588 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

moraxu · Mar 9, 2026

/bot run

tensorrt-cicd · Mar 9, 2026

PR_Github #38301 [ run ] triggered by Bot. Commit: 9545eff Link to invocation

tensorrt-cicd · Mar 9, 2026

PR_Github #38301 [ run ] completed with state SUCCESS. Commit: 9545eff
/LLM/main/L0_MergeRequest_PR pipeline #29678 completed with status: 'SUCCESS'

Link to invocation

moraxu · Mar 16, 2026

TODO: before / after tests for perf for a model that don't support the fast path with aiperf for due diligence.

Branch:

                                               NVIDIA AIPerf | LLM Metrics                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃                               Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│             Time to First Token (ms) │  1,655.27 │    339.27 │  3,694.30 │  3,487.93 │  2,258.40 │  1,773.55 │ 659.40 │
│            Time to Second Token (ms) │    129.52 │      0.00 │    306.08 │    173.52 │    142.89 │    134.74 │  29.94 │
│      Time to First Output Token (ms) │  1,655.27 │    339.27 │  3,694.30 │  3,487.93 │  2,258.40 │  1,773.55 │ 659.40 │
│                 Request Latency (ms) │  9,028.55 │  7,100.27 │ 10,726.06 │ 10,468.85 │  9,386.16 │  9,028.99 │ 436.75 │
│             Inter Token Latency (ms) │     15.72 │     10.43 │     18.96 │     18.64 │     17.92 │     15.37 │   1.52 │
│     Output Token Throughput Per User │     64.25 │     52.75 │     95.87 │     91.38 │     68.32 │     65.08 │   6.82 │
│                    (tokens/sec/user) │           │           │           │           │           │           │        │
│      Output Sequence Length (tokens) │    469.97 │    468.00 │    470.00 │    470.00 │    470.00 │    470.00 │   0.20 │
│       Input Sequence Length (tokens) │ 13,451.73 │ 13,449.00 │ 13,452.00 │ 13,452.00 │ 13,452.00 │ 13,452.00 │   0.50 │
│ Output Token Throughput (tokens/sec) │  1,637.57 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│        Image Throughput (images/sec) │      0.33 │      0.28 │      0.42 │      0.41 │      0.34 │      0.33 │   0.02 │
│             Image Latency (ms/image) │  3,009.52 │  2,366.76 │  3,575.35 │  3,489.62 │  3,128.72 │  3,009.66 │ 145.58 │
│    Request Throughput (requests/sec) │      3.48 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│             Request Count (requests) │    224.00 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-VL-30B-A3B-Instruct-FP8' --url 'http://localhost:8123' --shared-system-prompt-length 8600 --user-context-prompt-length 4300 --num-dataset-entries 500 --endpoint-type 'chat' --streaming 
--warmup-request-count 5 --request-rate 32 --request-rate-mode 'constant' --concurrency 32 --benchmark-duration 60 --benchmark-grace-period 30 --extra-inputs 'max_tokens:470' --extra-inputs 'min_tokens:470' --extra-inputs 
'ignore_eos:true' --extra-inputs 'skip_special_tokens:false' --image-batch-size 3 --image-width-mean 512 --image-height-mean 512 --artifact-dir 'trtllm_fp8_kv_cache_fp8/conc32' --no-server-metrics
Benchmark Duration: 64.29 sec

main:

                                               NVIDIA AIPerf | LLM Metrics                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃                               Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│             Time to First Token (ms) │  1,687.13 │    349.43 │  3,651.55 │  3,490.32 │  2,261.78 │  1,777.97 │ 661.54 │
│            Time to Second Token (ms) │    131.93 │      9.74 │    295.12 │    206.96 │    145.92 │    135.45 │  29.77 │
│      Time to First Output Token (ms) │  1,687.13 │    349.43 │  3,651.55 │  3,490.32 │  2,261.78 │  1,777.97 │ 661.54 │
│                 Request Latency (ms) │  9,058.80 │  6,797.73 │ 10,733.29 │ 10,516.75 │  9,591.21 │  9,064.43 │ 504.39 │
│             Inter Token Latency (ms) │     15.72 │     10.27 │     18.90 │     18.81 │     18.00 │     15.34 │   1.55 │
│     Output Token Throughput Per User │     64.29 │     52.91 │     97.36 │     92.82 │     68.18 │     65.18 │   7.04 │
│                    (tokens/sec/user) │           │           │           │           │           │           │        │
│      Output Sequence Length (tokens) │    469.93 │    462.00 │    471.00 │    470.00 │    470.00 │    470.00 │   0.59 │
│       Input Sequence Length (tokens) │ 13,451.84 │ 13,450.00 │ 13,453.00 │ 13,452.00 │ 13,452.00 │ 13,452.00 │   0.40 │
│ Output Token Throughput (tokens/sec) │  1,630.84 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│        Image Throughput (images/sec) │      0.33 │      0.28 │      0.44 │      0.42 │      0.34 │      0.33 │   0.02 │
│             Image Latency (ms/image) │  3,019.60 │  2,265.91 │  3,577.76 │  3,505.58 │  3,197.07 │  3,021.48 │ 168.13 │
│    Request Throughput (requests/sec) │      3.47 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│             Request Count (requests) │    224.00 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-VL-30B-A3B-Instruct-FP8' --url 'http://localhost:8123' --shared-system-prompt-length 8600 --user-context-prompt-length 4300 --num-dataset-entries 500 --endpoint-type 'chat' --streaming 
--warmup-request-count 5 --request-rate 32 --request-rate-mode 'constant' --concurrency 32 --benchmark-duration 60 --benchmark-grace-period 30 --extra-inputs 'max_tokens:470' --extra-inputs 'min_tokens:470' --extra-inputs 
'ignore_eos:true' --extra-inputs 'skip_special_tokens:false' --image-batch-size 3 --image-width-mean 512 --image-height-mean 512 --artifact-dir 'trtllm_fp8_kv_cache_fp8/conc32' --no-server-metrics
Benchmark Duration: 64.55 sec

moraxu · Mar 19, 2026

/bot run

tensorrt-cicd · Mar 19, 2026

PR_Github #39640 [ run ] triggered by Bot. Commit: 3a45d9c Link to invocation

tensorrt-cicd · Mar 19, 2026

PR_Github #39640 [ run ] completed with state SUCCESS. Commit: 3a45d9c
/LLM/main/L0_MergeRequest_PR pipeline #30845 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

moraxu · Mar 20, 2026

/bot run

tensorrt-cicd · Mar 20, 2026

PR_Github #39655 [ run ] triggered by Bot. Commit: 3a45d9c Link to invocation

Signed-off-by: Michal Guzek <mguzek@nvidia.com>

tensorrt-cicd · Mar 20, 2026

PR_Github #39655 [ run ] completed with state SUCCESS. Commit: 3a45d9c
/LLM/main/L0_MergeRequest_PR pipeline #30859 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

moraxu · Mar 20, 2026

/bot run

tensorrt-cicd · Mar 20, 2026

PR_Github #39701 [ run ] triggered by Bot. Commit: 3607934 Link to invocation

tensorrt-cicd · Mar 20, 2026

PR_Github #39701 [ run ] completed with state SUCCESS. Commit: 3607934
/LLM/main/L0_MergeRequest_PR pipeline #30898 completed with status: 'SUCCESS'

CI Report

Link to invocation

…instead of de-tokenizing already encoded prompt (NVIDIA#11708) Signed-off-by: Michal Guzek <mguzek@nvidia.com>

moraxu mentioned this pull request Feb 25, 2026

fix: TRT-LLM multimodal preprocessor - fix the dictionary naming for an embeddings case ai-dynamo/dynamo#6567

Closed

2ez4bz reviewed Feb 25, 2026

View reviewed changes

moraxu changed the title ~~[None][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt~~ [TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt Feb 27, 2026

moraxu force-pushed the dev-mguzek-optimize-generate-async-for-vlms branch from 89ee815 to 1a6f189 Compare March 4, 2026 11:41

moraxu commented Mar 5, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py

moraxu force-pushed the dev-mguzek-optimize-generate-async-for-vlms branch from 22a36dc to 8dca856 Compare March 7, 2026 01:50

moraxu marked this pull request as ready for review March 7, 2026 01:50

moraxu requested review from a team as code owners March 7, 2026 01:50

moraxu requested review from pcastonguay and symphonylyh March 7, 2026 01:50

moraxu requested a review from 2ez4bz March 7, 2026 01:50

coderabbitai Bot reviewed Mar 7, 2026

View reviewed changes

Comment thread tensorrt_llm/llmapi/llm.py Outdated

2ez4bz reviewed Mar 10, 2026

View reviewed changes

moraxu requested a review from 2ez4bz March 11, 2026 00:28