Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[None][fix] Cherry-pick Python-side remaining_budget guard for KV reuse budget overflow#13601

Closed
nv-yna wants to merge 4 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
nv-yna:dyn-2868-fixnv-yna/TensorRT-LLM:dyn-2868-fixCopy head branch name to clipboard
Closed

[None][fix] Cherry-pick Python-side remaining_budget guard for KV reuse budget overflow#13601
nv-yna wants to merge 4 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
nv-yna:dyn-2868-fixnv-yna/TensorRT-LLM:dyn-2868-fixCopy head branch name to clipboard

Conversation

@nv-yna
Copy link
Copy Markdown
Collaborator

@nv-yna nv-yna commented Apr 29, 2026

Summary

Cherry-picks the Python-side fix from feat/bench_y PRs #12682 and #12806 to main.

The C++ scheduler fix (reuse_adjusted_compute() in microBatchScheduler.cpp) was already ported to main via #12976. What's missing is the Python safety net that catches residual estimation-vs-reality gaps when KV cache eviction occurs between scheduling and prepare_resources().

Root cause: The C++ MicroBatchScheduler uses estimatedReusableTokens to discount KV reuse from the token budget. Under heavy cache eviction pressure, actual reuse can be lower than estimated, causing total_num_tokens in _prepare_tp_inputs() to exceed max_num_tokens.

What this PR adds:

  • remaining_budget re-validation guard in KVCacheManager.prepare_resources() — re-probes the radix tree for actual reusable blocks and skips requests that would overflow the budget
  • _estimate_post_reuse_compute() helper — correctly handles non-last-chunk vs last-chunk asymmetry in setPrepopulatedPromptLen
  • Pre-subtraction of non-first-chunk context costs so first-chunk budget checks are accurate
  • _compute_scheduled_tokens() and _maybe_log_batch_wait_decision() for compute-aware batch-wait decisions
  • Diagnostic logging before the max_num_tokens assertion in model_engine.py

Files changed:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py — budget guard + _estimate_post_reuse_compute()
  • tensorrt_llm/_torch/pyexecutor/py_executor.py — compute-aware scheduling helpers
  • tensorrt_llm/_torch/pyexecutor/model_engine.py — overflow diagnostic logging

Reproduction: AssertionError: total_num_tokens (514) > max_num_tokens (512) on rc11/rc12/main with enable_chunked_prefill=true, enable_block_reuse=true, free_gpu_memory_fraction=0.15, and concurrent multi-prefix multi-turn load (8 prefix variants, 32 sessions).

Verification: Python overlay of this fix on stock rc11 container — server ran 3500+ iterations and processed 413+ requests with zero assertion errors under the same conditions that reproduce the bug on unpatched rc11.

Fixes #13318.

Related PRs

Summary by CodeRabbit

  • Refactor

    • Enhanced diagnostic logging with detailed per-request breakdown when token capacity limits are approached
    • Improved batch-wait decision logic with more granular diagnostic reporting and resource utilization insights
  • Chores

    • Optimized resource allocation with preflight validation for block reuse to improve inference efficiency

@nv-yna nv-yna marked this pull request as ready for review April 29, 2026 06:53
@nv-yna nv-yna requested review from a team as code owners April 29, 2026 06:53
@nv-yna nv-yna requested a review from byshiue April 29, 2026 06:53
@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 29, 2026

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

Adds diagnostic logging enhancements and optimization logic across the executor pipeline. Token overflow detection now logs per-request breakdowns in model_engine.py, batch-wait decisions are refactored with explicit thresholds and diagnostic reporting in py_executor.py, and resource_manager.py implements a preflight budgeting pass for block reuse with compute cost re-estimation.

Changes

Cohort / File(s) Summary
Diagnostic Logging Enhancements
tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/_torch/pyexecutor/py_executor.py
Adds detailed error and diagnostic logging: token overflow breakdowns by context request in model_engine.py; batch-wait decision logging with generation vs. context token formulas and per-iteration counters in py_executor.py.
Resource Management Optimization
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Implements preflight budgeting pass for block reuse that estimates remaining compute budget, probes reusable blocks for actual cost, and filters requests that exceed post-reuse compute limits; adds _estimate_post_reuse_compute helper for alignment-aware estimation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies this as a fix cherry-picking a Python-side KV cache reuse budget guard, directly matching the substantive changes in resource_manager.py, py_executor.py, and model_engine.py.
Description check ✅ Passed The description comprehensively explains the root cause, what the PR adds, files changed, reproduction steps, and verification results; it follows the template structure with Summary, Test Coverage (implicitly through verification), and Related PRs sections.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

3052-3143: ⚠️ Potential issue | 🔴 Critical

Fix the broken logging call and add the missing environment variable gate.

The logger.info(...) call uses multiple positional arguments without %s format placeholders. This causes a TypeError: not all arguments converted during string formatting when the log record is emitted on rank 0. Additionally, the docstring documents that diagnostics require TLLM_LOG_BATCH_WAIT=1, but the code never checks this environment variable—the function will log unconditionally on rank 0.

Suggested fix
     def _maybe_log_batch_wait_decision(
         self,
         context_requests: list[LlmRequest],
         generation_requests: list[LlmRequest],
         num_scheduled_tokens: int,
         wait_threshold: float,
         should_waiting: bool,
     ) -> None:
         """Diagnostics for batch_wait: set TLLM_LOG_BATCH_WAIT=1 (rank 0 only)."""
         if self.dist.rank != 0:
             return
+        if os.getenv("TLLM_LOG_BATCH_WAIT") != "1":
+            return

         num_scheduled_gen_tokens = sum(1 + gen_req.num_draft_tokens
                                        for gen_req in generation_requests)
         num_scheduled_ctx_formula = num_scheduled_tokens - num_scheduled_gen_tokens

         chunk_ctx_sum = 0
-        ctx_summaries: List[str] = []
+        ctx_summaries: list[str] = []
         max_detail = 4
         for i, ctx_req in enumerate(context_requests):
             full_len = len(ctx_req.get_tokens(0))
             begin = ctx_req.context_current_position
             chunk_sz = ctx_req.context_chunk_size
             this_chunk = min(chunk_sz, max(0, full_len - begin))
             chunk_ctx_sum += this_chunk
             reusable = (ctx_req.estimated_reusable_tokens
                         if ctx_req.is_first_context_chunk else 0)
             reusable_in_chunk = max(0, reusable - begin)
             remaining = ctx_req.context_remaining_length
             if (reusable_in_chunk > 0
                     and reusable_in_chunk + chunk_sz < remaining):
                 formula_contrib = chunk_sz
             else:
                 formula_contrib = max(1, chunk_sz - reusable_in_chunk)
             if i < max_detail:
                 ctx_summaries.append(
                     f"rid={ctx_req.py_request_id} full={full_len} pos={begin} "
                     f"chunk_sz={chunk_sz} this_chunk={this_chunk} "
                     f"reusable={reusable} formula_contrib={formula_contrib}")
         n_ctx = len(context_requests)
         if n_ctx > max_detail:
             ctx_summaries.append(f"... +{n_ctx - max_detail} more ctx req(s)")

         logger.info(
-            "batch_wait: formula_total=",
-            num_scheduled_tokens,
-            " formula_ctx=",
-            num_scheduled_ctx_formula,
-            " formula_gen=",
-            num_scheduled_gen_tokens,
-            " chunk_ctx_sum=",
-            chunk_ctx_sum,
-            " threshold=",
-            wait_threshold,
-            " wait_iter=",
-            self.batch_wait_iters_count,
-            "/",
-            self.batch_wait_timeout_iters,
-            " should_defer_ctx=",
-            should_waiting,
-            " num_gen=",
-            len(generation_requests),
-            " ctx_detail=[",
-            "; ".join(ctx_summaries),
-            "]",
+            "batch_wait: formula_total=%s formula_ctx=%s formula_gen=%s "
+            "chunk_ctx_sum=%s threshold=%s wait_iter=%s/%s "
+            "should_defer_ctx=%s num_gen=%s ctx_detail=[%s]",
+            num_scheduled_tokens,
+            num_scheduled_ctx_formula,
+            num_scheduled_gen_tokens,
+            chunk_ctx_sum,
+            wait_threshold,
+            self.batch_wait_iters_count,
+            self.batch_wait_timeout_iters,
+            should_waiting,
+            len(generation_requests),
+            "; ".join(ctx_summaries),
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3052 - 3143, The
logging call in _maybe_log_batch_wait_decision currently passes many positional
args to logger.info causing a formatting TypeError and it never obeys the
documented TLLM_LOG_BATCH_WAIT gate; update the function to first check the env
var (e.g. os.environ.get("TLLM_LOG_BATCH_WAIT") == "1") in addition to
self.dist.rank, and replace the multi-argument logger.info(...) call with a
single formatted message (built with an f-string or using .format) that embeds
num_scheduled_tokens, num_scheduled_ctx_formula, num_scheduled_gen_tokens,
chunk_ctx_sum, wait_threshold, self.batch_wait_iters_count,
self.batch_wait_timeout_iters, should_waiting, len(generation_requests), and the
ctx_detail string so the logger receives one string and no extra arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3052-3143: The logging call in _maybe_log_batch_wait_decision
currently passes many positional args to logger.info causing a formatting
TypeError and it never obeys the documented TLLM_LOG_BATCH_WAIT gate; update the
function to first check the env var (e.g. os.environ.get("TLLM_LOG_BATCH_WAIT")
== "1") in addition to self.dist.rank, and replace the multi-argument
logger.info(...) call with a single formatted message (built with an f-string or
using .format) that embeds num_scheduled_tokens, num_scheduled_ctx_formula,
num_scheduled_gen_tokens, chunk_ctx_sum, wait_threshold,
self.batch_wait_iters_count, self.batch_wait_timeout_iters, should_waiting,
len(generation_requests), and the ctx_detail string so the logger receives one
string and no extra arguments.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7d54a4b6-7256-4ae0-84fa-db5cd6aa844b

📥 Commits

Reviewing files that changed from the base of the PR and between 0c86dda and c39c1c0.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46083 [ run ] triggered by Bot. Commit: c39c1c0 Link to invocation

…t overflow

Cherry-pick of Python portions from feat/bench_y PRs NVIDIA#12682 and NVIDIA#12806
that were not included in NVIDIA#12976 (which ported only the C++ fix to main).

Adds a remaining_budget re-validation guard in KVCacheManager.prepare_resources()
that re-probes the radix tree for actual reusable blocks after KV cache allocation
and skips requests whose forward cost exceeds the remaining budget. This catches
the estimation-vs-reality gap when cache eviction between scheduling and
prepare_resources() reduces actual reuse below the scheduler's estimate.

Original authors: Liao Lanyu (@lancelly), Jin Li (@liji-nv)

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 29, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46175 [ run ] triggered by Bot. Commit: f0a16e8 Link to invocation

Adds three pure-Python test files (no GPU, no model weights) covering
the budget guard introduced in the previous commit:

- test_post_reuse_compute.py (Tier 1):
  Branch-coverage table for KVCacheManager._estimate_post_reuse_compute()
  plus parametrized "production-relevant divergence" cases proving the
  helper undercharges when chunk_size is not block-aligned (legal per
  test_kv_cache_v2_scheduler.py:243-247). Includes a precise property
  test asserting helper <= actual in the reuse branch (0 < P < prompt_len)
  and documents the short-circuit overcharge case.

- test_resource_manager_v1_budget.py (Tier 2 + Tier 3):
  Mock-based integration tests driving the real prepare_resources() loop
  via KVCacheManager.__new__ to skip C++ init. Covers skip path, no-skip
  path, non-first-chunk pre-subtraction, gen+draft tokens consuming
  budget, no-op when reuse off / is_draft, reset_context_requests with
  is_last_context_chunk mutation, kv_connector elif branch, connector
  callbacks (update_state_after_alloc, build_scheduler_output),
  count_reusable_blocks call shape, and non-first-chunk fallthrough.

  Tier 3 contains the deterministic DYN-2868 regression pair:
  - test_m: post-prepare total <= max_num_tokens with the guard
  - test_n: negative control patches the helper to 0 and verifies
    overshoot via an INDEPENDENT helper (model_engine_total_tokens)
    that mirrors model_engine.py:2293-2297 + 2615-2616 token counting.

- test_batch_wait_log.py (Tier 4):
  Smoke test for PyExecutor._maybe_log_batch_wait_decision rank-0 gate.

Local verification:
- Tier 1 arithmetic verified standalone (no TRT-LLM deps).
- Tier 3 logic simulated independently (Test M = 512, Test N = 1024).
- All three files pass py_compile.
- Full pytest run requires tensorrt_llm.bindings (C++ build) and
  was not run locally; intended for CI.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46175 [ run ] completed with state SUCCESS. Commit: f0a16e8
/LLM/main/L0_MergeRequest_PR pipeline #36294 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yifjiang
Copy link
Copy Markdown
Contributor

This PR doesn't apply cleanly to current main because of an API rename introduced by #13029 (which merged 16 minutes after #12976 on 2026-04-18 and consolidated findNewContextBlock() + countReusableBlocks(true) + countReusableBlocks(false) into a single analyze_prefix_reuse() returning PrefixReuseSummary).

tensorrt_llm/_torch/pyexecutor/resource_manager.py line 718 in this PR calls:

reusable_blocks = self.impl.count_reusable_blocks(unique_tokens, req, False)

The False (= onlyAllocated=False) returns all matched reusable blocks per the pre-#13029 implementation in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (the conditional was if (!onlyAllocated || matchingBlock->hasRefs())). On main after #13029, count_reusable_blocks no longer exists; the equivalent is:

reusable_blocks = self.impl.analyze_prefix_reuse(unique_tokens, req).reusable_blocks_all

Without this fix, the worker dies on the first first-chunk admission with AttributeError: 'KVCacheManager' object has no attribute 'count_reusable_blocks'.

(Side note for the maintainers: py_executor.py and model_engine.py in this PR also have downstream-API dependencies — strip_mm_data_for_generation, ModelConfig.get_num_attention_layers(is_disagg=...) — but those are observability-only changes, so dropping them on a strict-main base is harmless if you want a minimal-surface fix.)

I tested this empirically with the issue #13318 reproducer (Qwen3-4B / 1×GB200 / 1 shared system prompt / max_num_tokens=4096 / fgmf=0.80 / qps=18 trigger) on three images:

Image Has #12976 Has #13029 Has this PR Result
Built from ea117de (= #12976 merge, parent of #13029) crashes: total_num_tokens (6946) > max=4096
same + this PR's gate still crashes: total_num_tokens (4242) > max=4096, 0 Reuse budget: skip req events
rc13 (= ea117de + #13029) clean

It looks like #13029's two-phase batch claim closes the within-batch eviction race that produces the over-admission, while this PR's gate doesn't catch the same race (gate's reuse re-probe runs before add_sequence, but the race fires inside add_sequence for sequences later in the batch, so the gate doesn't see it). I haven't been able to construct a workload where the gate's Reuse budget: skip req actually emits on a main+#13029 build — would be useful to know what workload class this PR's body's "heavy KV-cache eviction" scenario is intended to catch on main, since on #13029-equipped main I can't trigger it with multi-prefix + tight cache + high qps reproducers either.

@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented Apr 30, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46431 [ run ] triggered by Bot. Commit: e06999e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46431 [ run ] completed with state SUCCESS. Commit: e06999e
/LLM/main/L0_MergeRequest_PR pipeline #36502 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

nv-yna and others added 2 commits May 4, 2026 10:53
Replace the count_reusable_blocks() call in the prepare_resources budget
guard with analyze_prefix_reuse(...).reusable_blocks_all, matching the
public KVCacheManager binding API. Update tests to mock the new return
shape via a small Mock(reusable_blocks_all=N) helper.

Tighten comments and docstrings to describe current behavior without
referencing PRs, issues, or specific line numbers; reformat with
ruff format to satisfy the pre-commit hook.

Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
@nv-yna
Copy link
Copy Markdown
Collaborator Author

nv-yna commented May 4, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46674 [ run ] triggered by Bot. Commit: 8322e01 Link to invocation

@nv-yna nv-yna closed this May 4, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46674 [ run ] completed with state SUCCESS. Commit: 8322e01
/LLM/main/L0_MergeRequest_PR pipeline #36715 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.