Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt#11708

Merged
schetlur-nv merged 11 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
moraxu:dev-mguzek-optimize-generate-async-for-vlmsmoraxu/TensorRT-LLM:dev-mguzek-optimize-generate-async-for-vlmsCopy head branch name to clipboard
Apr 2, 2026
Merged

[TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt#11708
schetlur-nv merged 11 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
moraxu:dev-mguzek-optimize-generate-async-for-vlmsmoraxu/TensorRT-LLM:dev-mguzek-optimize-generate-async-for-vlmsCopy head branch name to clipboard

Conversation

@moraxu
Copy link
Copy Markdown
Collaborator

@moraxu moraxu commented Feb 25, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added fast-path support for processing pre-tokenized prompts with multimodal data, eliminating unnecessary detokenization steps for improved performance.
    • Enhanced image placeholder expansion and handling for multimodal inputs.
  • Refactor

    • Improved input processing pipeline to seamlessly support both text-based and pre-tokenized prompt paths with unified multimodal handling.

Description

Process token IDs + MM data without de-tokenizing, that is, by instead:

  1. Processing multi-modal inputs with dummy inputs (for media placeholder tokens to match the number of multi-modal inputs),
  2. Replacing the placeholder token IDs for the multi-modal inputs with the actual feature token IDs.

See: https://docs.vllm.ai/en/latest/design/mm_processing/#multi-modal-data-processing

Currently implemented only for tensorrt_llm/_torch/models/modeling_llava_next.py

Test Coverage

Tested for:

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Comment thread tensorrt_llm/inputs/registry.py Outdated
Comment thread tensorrt_llm/inputs/registry.py Outdated
Comment thread tensorrt_llm/inputs/registry.py Outdated
Comment thread tensorrt_llm/inputs/registry.py Outdated
Comment thread tensorrt_llm/inputs/registry.py
Comment thread tensorrt_llm/inputs/registry.py
Comment thread tensorrt_llm/inputs/registry.py
Comment thread tensorrt_llm/llmapi/llm.py Outdated
Comment thread tensorrt_llm/llmapi/llm.py Outdated
@moraxu moraxu changed the title [None][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt [TRTLLM-11163][feat] Introduce a fast path (token IDs + MM) for VLMs instead of de-tokenizing already encoded prompt Feb 27, 2026
@moraxu moraxu force-pushed the dev-mguzek-optimize-generate-async-for-vlms branch from 89ee815 to 1a6f189 Compare March 4, 2026 11:41
Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py
@moraxu moraxu force-pushed the dev-mguzek-optimize-generate-async-for-vlms branch from 22a36dc to 8dca856 Compare March 7, 2026 01:50
@moraxu moraxu marked this pull request as ready for review March 7, 2026 01:50
@moraxu moraxu requested review from a team as code owners March 7, 2026 01:50
@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 7, 2026

/bot run

@moraxu moraxu requested a review from 2ez4bz March 7, 2026 01:50
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38077 [ run ] triggered by Bot. Commit: 8dca856 Link to invocation

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 7, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a fast path for processing tokenized prompts with multimodal data in the LLaVA-Next model pipeline. New methods handle image placeholder expansion in token IDs, registry utilities orchestrate the fast-path routing, and the LLM API detects and activates the fast path when compatible processors are available.

Changes

Cohort / File(s) Summary
LLaVA Next Input Processor Methods
tensorrt_llm/_torch/models/modeling_llava_next.py
Added get_text_with_mm_placeholders(), _expand_image_placeholders_in_token_ids(), and expand_prompt_token_ids_for_mm() to support placeholder expansion in tokenized prompts. Modified get_prompt_token_ids() signature to accept prompt_token_ids directly instead of text prompts. Updated attach_multimodal_embeddings() to handle tokenized+MM path with assertion checks.
Multimodal Registry Fast Path
tensorrt_llm/inputs/registry.py
Added helper utilities for normalizing MM counts, generating dummy placeholders, and fetching MM token lengths. Implemented tokenized_multimodal_process() and updated multimodal_hashing_process() to support precomputed token IDs and extra multimodal inputs. Enhanced input_processor_wrapper() to detect and route the tokenized+MM fast path.
LLM API Fast Path Detection
tensorrt_llm/llmapi/llm.py
Introduced vlm_fast_path_for_token_ids_and_mm_data_available feature flag. Added detokenization fallback when fast path is unavailable. Updated _preprocess() to route tokenized+MM inputs appropriately and modified key access patterns for safer handling.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant LLM API
    participant InputRegistry
    participant LlavaNextProcessor
    participant Tokenizer

    Client->>LLM API: _preprocess(inputs with prompt_token_ids + mm_data)
    
    LLM API->>LLM API: Check vlm_fast_path_for_token_ids_and_mm_data_available
    
    alt Fast Path Available
        LLM API->>InputRegistry: input_processor_wrapper(prompt_token_ids, mm_data)
        InputRegistry->>InputRegistry: Detect tokenized+MM path
        InputRegistry->>LlavaNextProcessor: expand_prompt_token_ids_for_mm()
        LlavaNextProcessor->>LlavaNextProcessor: _expand_image_placeholders_in_token_ids()
        LlavaNextProcessor-->>InputRegistry: expanded_ids, mm_token_length, mm_token_offsets
        InputRegistry->>InputRegistry: tokenized_multimodal_process()
        InputRegistry-->>LLM API: Processed output (skips detokenization)
    else Fast Path Unavailable
        LLM API->>Tokenizer: decode(prompt_token_ids)
        Tokenizer-->>LLM API: text_prompt
        LLM API->>InputRegistry: input_processor_wrapper(text_prompt, mm_data)
        InputRegistry->>LlavaNextProcessor: Standard text+MM processing
        LlavaNextProcessor-->>InputRegistry: Processed output
        InputRegistry-->>LLM API: Processed output
    end
    
    LLM API-->>Client: Preprocessed inputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: introducing a fast path for processing token IDs with multimodal data without de-tokenization.
Description check ✅ Passed The PR description provides a clear explanation of what is being implemented (a fast path for processing token IDs + MM data without de-tokenizing), includes comprehensive test coverage details, and confirms the PR checklist is completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tensorrt_llm/llmapi/llm.py (2)

491-503: ⚠️ Potential issue | 🔴 Critical

Preserve the non-text multimodal fields when rewriting inputs.

This fallback rebuilds the request as TextPrompt(...), which drops multi_modal_embeddings/multi_modal_uuids and also materializes multi_modal_data=None. On non-fast-path VLMs, prompt_token_ids + multi_modal_embeddings then falls into the multi_modal_data branch and dies on .keys(), while UUID-based cache IDs are silently lost.

🐛 Suggested fix
-            inputs = TextPrompt(
-                prompt=prompt,
-                multi_modal_data=inputs.get("multi_modal_data"),
-                mm_processor_kwargs=inputs.get("mm_processor_kwargs") or {})
+            fallback_inputs: dict[str, Any] = {"prompt": prompt}
+            for key in (
+                    "multi_modal_data",
+                    "multi_modal_embeddings",
+                    "multi_modal_uuids",
+                    "mm_processor_kwargs",
+                    "query",
+                    "query_token_ids",
+            ):
+                value = inputs.get(key)
+                if value is not None:
+                    fallback_inputs[key] = value
+            inputs = fallback_inputs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm.py` around lines 491 - 503, When rebuilding inputs
from prompt_token_ids in the VLM fallback, preserve all non-text multimodal
fields instead of dropping them: use the existing
inputs.get("multi_modal_data"), inputs.get("multi_modal_embeddings"), and
inputs.get("multi_modal_uuids") when constructing the TextPrompt so
embeddings/UUID cache IDs aren’t lost and multi_modal_data isn’t materialized to
None; update the block that calls tokenizer.decode and constructs TextPrompt
(referencing inputs, prompt_token_ids, tokenizer.decode, TextPrompt,
mm_processor_kwargs, DefaultInputProcessor, and
vlm_fast_path_for_token_ids_and_mm_data_available) to forward
multi_modal_embeddings and multi_modal_uuids and keep mm_processor_kwargs as
before.

554-578: ⚠️ Potential issue | 🟠 Major

Treat empty multimodal payloads as absent.

These conditions use is None / key presence, so {"prompt_token_ids": ..., "multi_modal_data": {}} is routed into the multimodal branch instead of the plain token-id branch. Downstream that means either a fast-path MM length lookup on an empty map or calling the VLM processor with prompt=None and no media.

🐛 Suggested fix
+        has_multi_modal_data = bool(inputs.get("multi_modal_data"))
+        has_multi_modal_embeddings = bool(
+            inputs.get("multi_modal_embeddings"))
+
-        elif ("prompt_token_ids" in inputs
-              and inputs.get("multi_modal_data") is None
-              and inputs.get("multi_modal_embeddings") is None):
+        elif ("prompt_token_ids" in inputs
+              and not has_multi_modal_data
+              and not has_multi_modal_embeddings):
             prompt_token_ids = inputs['prompt_token_ids']
@@
-        elif "prompt" in inputs or ("prompt_token_ids" in inputs and
-                                    (("multi_modal_data" in inputs
-                                      or "multi_modal_embeddings" in inputs))):
+        elif "prompt" in inputs or ("prompt_token_ids" in inputs and
+                                    (has_multi_modal_data
+                                     or has_multi_modal_embeddings)):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/llmapi/llm.py` around lines 554 - 578, The multimodal branch
currently triggers on key presence or is None checks and treats empty dicts as
present; change the checks to treat empty multimodal payloads as absent by using
truthiness/emptiness checks instead of is None or key presence. Specifically,
when inspecting inputs in the initial branch with "prompt_token_ids", use
something like checking inputs.get("multi_modal_data") and
inputs.get("multi_modal_embeddings") are non-empty (truthy) before routing into
multimodal handling and before constructing multimodal_data/MultimodalParams;
likewise adjust the subsequent elif condition that looks for ("multi_modal_data"
in inputs or "multi_modal_embeddings" in inputs) to require non-empty values so
plain token-id paths (prompt_token_ids, prompt_token_ids + empty multimodal)
follow the token-only logic. Ensure mrope handling
(disaggregated_params.mrope_position_ids_handle / mrope_position_deltas_handle)
logic remains the same but only runs when multimodal payloads are actually
present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 526-529: The MM disaggregation path currently assumes
inputs["prompt_token_ids"] exists when
vlm_fast_path_for_token_ids_and_mm_data_available is true; update the call site
around input_processor.get_prompt_token_ids to first detect whether
prompt_token_ids are present and, if not, pass the raw prompt (or invoke the
tokenizer) so tokenization happens before preprocessing; alternatively make
get_prompt_token_ids backward-compatible by accepting raw prompt text and
mm_handles and producing token ids when prompt_token_ids is absent — adjust the
logic using vlm_fast_path_for_token_ids_and_mm_data_available,
inputs["prompt_token_ids"], input_processor.get_prompt_token_ids, and mm_handles
accordingly so raw-text disagg requests no longer crash.

---

Outside diff comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 491-503: When rebuilding inputs from prompt_token_ids in the VLM
fallback, preserve all non-text multimodal fields instead of dropping them: use
the existing inputs.get("multi_modal_data"),
inputs.get("multi_modal_embeddings"), and inputs.get("multi_modal_uuids") when
constructing the TextPrompt so embeddings/UUID cache IDs aren’t lost and
multi_modal_data isn’t materialized to None; update the block that calls
tokenizer.decode and constructs TextPrompt (referencing inputs,
prompt_token_ids, tokenizer.decode, TextPrompt, mm_processor_kwargs,
DefaultInputProcessor, and vlm_fast_path_for_token_ids_and_mm_data_available) to
forward multi_modal_embeddings and multi_modal_uuids and keep
mm_processor_kwargs as before.
- Around line 554-578: The multimodal branch currently triggers on key presence
or is None checks and treats empty dicts as present; change the checks to treat
empty multimodal payloads as absent by using truthiness/emptiness checks instead
of is None or key presence. Specifically, when inspecting inputs in the initial
branch with "prompt_token_ids", use something like checking
inputs.get("multi_modal_data") and inputs.get("multi_modal_embeddings") are
non-empty (truthy) before routing into multimodal handling and before
constructing multimodal_data/MultimodalParams; likewise adjust the subsequent
elif condition that looks for ("multi_modal_data" in inputs or
"multi_modal_embeddings" in inputs) to require non-empty values so plain
token-id paths (prompt_token_ids, prompt_token_ids + empty multimodal) follow
the token-only logic. Ensure mrope handling
(disaggregated_params.mrope_position_ids_handle / mrope_position_deltas_handle)
logic remains the same but only runs when multimodal payloads are actually
present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d81eef8-9b02-4737-9132-c6f88766cf8d

📥 Commits

Reviewing files that changed from the base of the PR and between cc16289 and 8dca856.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_llava_next.py
  • tensorrt_llm/inputs/registry.py
  • tensorrt_llm/llmapi/llm.py

Comment thread tensorrt_llm/llmapi/llm.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38077 [ run ] completed with state SUCCESS. Commit: 8dca856
/LLM/main/L0_MergeRequest_PR pipeline #29504 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 9, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38195 [ run ] triggered by Bot. Commit: 9545eff Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38195 [ run ] completed with state SUCCESS. Commit: 9545eff
/LLM/main/L0_MergeRequest_PR pipeline #29588 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 9, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38301 [ run ] triggered by Bot. Commit: 9545eff Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #38301 [ run ] completed with state SUCCESS. Commit: 9545eff
/LLM/main/L0_MergeRequest_PR pipeline #29678 completed with status: 'SUCCESS'

Link to invocation

Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py
Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_llava_next.py
Comment thread tensorrt_llm/inputs/registry.py
Comment thread tensorrt_llm/inputs/registry.py Outdated
Comment thread tensorrt_llm/inputs/registry.py
Comment thread tensorrt_llm/llmapi/llm.py
@moraxu moraxu requested a review from 2ez4bz March 11, 2026 00:28
Comment thread tensorrt_llm/llmapi/llm.py Outdated
Comment thread tensorrt_llm/llmapi/llm.py
Comment thread tensorrt_llm/llmapi/llm.py
Comment thread tensorrt_llm/llmapi/llm.py
Comment thread tensorrt_llm/llmapi/llm.py
@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 16, 2026

TODO: before / after tests for perf for a model that don't support the fast path with aiperf for due diligence.

Branch:

                                               NVIDIA AIPerf | LLM Metrics                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃                               Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│             Time to First Token (ms) │  1,655.27 │    339.27 │  3,694.30 │  3,487.93 │  2,258.40 │  1,773.55 │ 659.40 │
│            Time to Second Token (ms) │    129.52 │      0.00 │    306.08 │    173.52 │    142.89 │    134.74 │  29.94 │
│      Time to First Output Token (ms) │  1,655.27 │    339.27 │  3,694.30 │  3,487.93 │  2,258.40 │  1,773.55 │ 659.40 │
│                 Request Latency (ms) │  9,028.55 │  7,100.27 │ 10,726.06 │ 10,468.85 │  9,386.16 │  9,028.99 │ 436.75 │
│             Inter Token Latency (ms) │     15.72 │     10.43 │     18.96 │     18.64 │     17.92 │     15.37 │   1.52 │
│     Output Token Throughput Per User │     64.25 │     52.75 │     95.87 │     91.38 │     68.32 │     65.08 │   6.82 │
│                    (tokens/sec/user) │           │           │           │           │           │           │        │
│      Output Sequence Length (tokens) │    469.97 │    468.00 │    470.00 │    470.00 │    470.00 │    470.00 │   0.20 │
│       Input Sequence Length (tokens) │ 13,451.73 │ 13,449.00 │ 13,452.00 │ 13,452.00 │ 13,452.00 │ 13,452.00 │   0.50 │
│ Output Token Throughput (tokens/sec) │  1,637.57 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│        Image Throughput (images/sec) │      0.33 │      0.28 │      0.42 │      0.41 │      0.34 │      0.33 │   0.02 │
│             Image Latency (ms/image) │  3,009.52 │  2,366.76 │  3,575.35 │  3,489.62 │  3,128.72 │  3,009.66 │ 145.58 │
│    Request Throughput (requests/sec) │      3.48 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│             Request Count (requests) │    224.00 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-VL-30B-A3B-Instruct-FP8' --url 'http://localhost:8123' --shared-system-prompt-length 8600 --user-context-prompt-length 4300 --num-dataset-entries 500 --endpoint-type 'chat' --streaming 
--warmup-request-count 5 --request-rate 32 --request-rate-mode 'constant' --concurrency 32 --benchmark-duration 60 --benchmark-grace-period 30 --extra-inputs 'max_tokens:470' --extra-inputs 'min_tokens:470' --extra-inputs 
'ignore_eos:true' --extra-inputs 'skip_special_tokens:false' --image-batch-size 3 --image-width-mean 512 --image-height-mean 512 --artifact-dir 'trtllm_fp8_kv_cache_fp8/conc32' --no-server-metrics
Benchmark Duration: 64.29 sec

main:

                                               NVIDIA AIPerf | LLM Metrics                                               
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┓
┃                               Metric ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p50 ┃    std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━┩
│             Time to First Token (ms) │  1,687.13 │    349.43 │  3,651.55 │  3,490.32 │  2,261.78 │  1,777.97 │ 661.54 │
│            Time to Second Token (ms) │    131.93 │      9.74 │    295.12 │    206.96 │    145.92 │    135.45 │  29.77 │
│      Time to First Output Token (ms) │  1,687.13 │    349.43 │  3,651.55 │  3,490.32 │  2,261.78 │  1,777.97 │ 661.54 │
│                 Request Latency (ms) │  9,058.80 │  6,797.73 │ 10,733.29 │ 10,516.75 │  9,591.21 │  9,064.43 │ 504.39 │
│             Inter Token Latency (ms) │     15.72 │     10.27 │     18.90 │     18.81 │     18.00 │     15.34 │   1.55 │
│     Output Token Throughput Per User │     64.29 │     52.91 │     97.36 │     92.82 │     68.18 │     65.18 │   7.04 │
│                    (tokens/sec/user) │           │           │           │           │           │           │        │
│      Output Sequence Length (tokens) │    469.93 │    462.00 │    471.00 │    470.00 │    470.00 │    470.00 │   0.59 │
│       Input Sequence Length (tokens) │ 13,451.84 │ 13,450.00 │ 13,453.00 │ 13,452.00 │ 13,452.00 │ 13,452.00 │   0.40 │
│ Output Token Throughput (tokens/sec) │  1,630.84 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│        Image Throughput (images/sec) │      0.33 │      0.28 │      0.44 │      0.42 │      0.34 │      0.33 │   0.02 │
│             Image Latency (ms/image) │  3,019.60 │  2,265.91 │  3,577.76 │  3,505.58 │  3,197.07 │  3,021.48 │ 168.13 │
│    Request Throughput (requests/sec) │      3.47 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
│             Request Count (requests) │    224.00 │       N/A │       N/A │       N/A │       N/A │       N/A │    N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┴────────┘

CLI Command: aiperf profile --model 'Qwen/Qwen3-VL-30B-A3B-Instruct-FP8' --url 'http://localhost:8123' --shared-system-prompt-length 8600 --user-context-prompt-length 4300 --num-dataset-entries 500 --endpoint-type 'chat' --streaming 
--warmup-request-count 5 --request-rate 32 --request-rate-mode 'constant' --concurrency 32 --benchmark-duration 60 --benchmark-grace-period 30 --extra-inputs 'max_tokens:470' --extra-inputs 'min_tokens:470' --extra-inputs 
'ignore_eos:true' --extra-inputs 'skip_special_tokens:false' --image-batch-size 3 --image-width-mean 512 --image-height-mean 512 --artifact-dir 'trtllm_fp8_kv_cache_fp8/conc32' --no-server-metrics
Benchmark Duration: 64.55 sec

@moraxu moraxu requested a review from 2ez4bz March 16, 2026 08:38
Comment thread tensorrt_llm/inputs/registry.py
Comment thread tensorrt_llm/inputs/registry.py Outdated
@moraxu moraxu requested a review from pcastonguay March 17, 2026 20:45
Comment thread tensorrt_llm/inputs/registry.py Outdated
Comment thread tests/unittest/_torch/modeling/test_modeling_llava_next.py
@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 19, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39640 [ run ] triggered by Bot. Commit: 3a45d9c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39640 [ run ] completed with state SUCCESS. Commit: 3a45d9c
/LLM/main/L0_MergeRequest_PR pipeline #30845 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 20, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39655 [ run ] triggered by Bot. Commit: 3a45d9c Link to invocation

moraxu added 11 commits March 19, 2026 18:33
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39655 [ run ] completed with state SUCCESS. Commit: 3a45d9c
/LLM/main/L0_MergeRequest_PR pipeline #30859 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@moraxu moraxu force-pushed the dev-mguzek-optimize-generate-async-for-vlms branch from 3a45d9c to 3607934 Compare March 20, 2026 05:54
@moraxu
Copy link
Copy Markdown
Collaborator Author

moraxu commented Mar 20, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39701 [ run ] triggered by Bot. Commit: 3607934 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39701 [ run ] completed with state SUCCESS. Commit: 3607934
/LLM/main/L0_MergeRequest_PR pipeline #30898 completed with status: 'SUCCESS'

CI Report

Link to invocation

@schetlur-nv schetlur-nv merged commit b9ba730 into NVIDIA:main Apr 2, 2026
5 checks passed
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
…instead of de-tokenizing already encoded prompt (NVIDIA#11708)

Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.