[None][fix] Cherry-pick Python-side remaining_budget guard for KV reuse budget overflow by nv-yna · Pull Request #13601 · NVIDIA/TensorRT-LLM

nv-yna · Apr 29, 2026

Summary

Cherry-picks the Python-side fix from feat/bench_y PRs #12682 and #12806 to main.

The C++ scheduler fix (reuse_adjusted_compute() in microBatchScheduler.cpp) was already ported to main via #12976. What's missing is the Python safety net that catches residual estimation-vs-reality gaps when KV cache eviction occurs between scheduling and prepare_resources().

Root cause: The C++ MicroBatchScheduler uses estimatedReusableTokens to discount KV reuse from the token budget. Under heavy cache eviction pressure, actual reuse can be lower than estimated, causing total_num_tokens in _prepare_tp_inputs() to exceed max_num_tokens.

What this PR adds:

remaining_budget re-validation guard in KVCacheManager.prepare_resources() — re-probes the radix tree for actual reusable blocks and skips requests that would overflow the budget
_estimate_post_reuse_compute() helper — correctly handles non-last-chunk vs last-chunk asymmetry in setPrepopulatedPromptLen
Pre-subtraction of non-first-chunk context costs so first-chunk budget checks are accurate
_compute_scheduled_tokens() and _maybe_log_batch_wait_decision() for compute-aware batch-wait decisions
Diagnostic logging before the max_num_tokens assertion in model_engine.py

Files changed:

tensorrt_llm/_torch/pyexecutor/resource_manager.py — budget guard + _estimate_post_reuse_compute()
tensorrt_llm/_torch/pyexecutor/py_executor.py — compute-aware scheduling helpers
tensorrt_llm/_torch/pyexecutor/model_engine.py — overflow diagnostic logging

Reproduction: AssertionError: total_num_tokens (514) > max_num_tokens (512) on rc11/rc12/main with enable_chunked_prefill=true, enable_block_reuse=true, free_gpu_memory_fraction=0.15, and concurrent multi-prefix multi-turn load (8 prefix variants, 32 sessions).

Verification: Python overlay of this fix on stock rc11 container — server ran 3500+ iterations and processed 413+ requests with zero assertion errors under the same conditions that reproduce the bug on unpatched rc11.

Fixes #13318.

Related PRs

[None][fix] Fix compute token accounting for KV cache reuse with context chunking #12682 — Original fix on feat/bench_y (C++ + Python)
[None][fix] Pre-subtract non-first-chunk context costs in reuse budge… #12806 — Refinement: pre-subtract non-first-chunk costs
[None][fix] Fix compute token accounting for KV cache reuse with context chunking #12976 — C++ portion already merged to main
[Bug]: Scheduler deadlock on main + #12976 + #13029: AssertionError total_num_tokens > max_num_tokens in _prepare_tp_inputs under KV offload + chunked prefill permanently hangs the event loop #13318 — Bug report (open)

Summary by CodeRabbit

Refactor
- Enhanced diagnostic logging with detailed per-request breakdown when token capacity limits are approached
- Improved batch-wait decision logic with more granular diagnostic reporting and resource utilization insights
Chores
- Optimized resource allocation with preflight validation for block reuse to improve inference efficiency

nv-yna · Apr 29, 2026

/bot run --disable-fail-fast

coderabbitai · Apr 29, 2026

📝 Walkthrough

Walkthrough

Adds diagnostic logging enhancements and optimization logic across the executor pipeline. Token overflow detection now logs per-request breakdowns in model_engine.py, batch-wait decisions are refactored with explicit thresholds and diagnostic reporting in py_executor.py, and resource_manager.py implements a preflight budgeting pass for block reuse with compute cost re-estimation.

Changes

Cohort / File(s)	Summary
Diagnostic Logging Enhancements `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Adds detailed error and diagnostic logging: token overflow breakdowns by context request in `model_engine.py`; batch-wait decision logging with generation vs. context token formulas and per-iteration counters in `py_executor.py`.
Resource Management Optimization `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Implements preflight budgeting pass for block reuse that estimates remaining compute budget, probes reusable blocks for actual cost, and filters requests that exceed post-reuse compute limits; adds `_estimate_post_reuse_compute` helper for alignment-aware estimation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies this as a fix cherry-picking a Python-side KV cache reuse budget guard, directly matching the substantive changes in resource_manager.py, py_executor.py, and model_engine.py.
Description check	✅ Passed	The description comprehensively explains the root cause, what the PR adds, files changed, reproduction steps, and verification results; it follows the template structure with Summary, Test Coverage (implicitly through verification), and Related PRs sections.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

3052-3143: ⚠️ Potential issue | 🔴 Critical

Fix the broken logging call and add the missing environment variable gate.

The logger.info(...) call uses multiple positional arguments without %s format placeholders. This causes a TypeError: not all arguments converted during string formatting when the log record is emitted on rank 0. Additionally, the docstring documents that diagnostics require TLLM_LOG_BATCH_WAIT=1, but the code never checks this environment variable—the function will log unconditionally on rank 0.

Suggested fix

     def _maybe_log_batch_wait_decision(
         self,
         context_requests: list[LlmRequest],
         generation_requests: list[LlmRequest],
         num_scheduled_tokens: int,
         wait_threshold: float,
         should_waiting: bool,
     ) -> None:
         """Diagnostics for batch_wait: set TLLM_LOG_BATCH_WAIT=1 (rank 0 only)."""
         if self.dist.rank != 0:
             return
+        if os.getenv("TLLM_LOG_BATCH_WAIT") != "1":
+            return

         num_scheduled_gen_tokens = sum(1 + gen_req.num_draft_tokens
                                        for gen_req in generation_requests)
         num_scheduled_ctx_formula = num_scheduled_tokens - num_scheduled_gen_tokens

         chunk_ctx_sum = 0
-        ctx_summaries: List[str] = []
+        ctx_summaries: list[str] = []
         max_detail = 4
         for i, ctx_req in enumerate(context_requests):
             full_len = len(ctx_req.get_tokens(0))
             begin = ctx_req.context_current_position
             chunk_sz = ctx_req.context_chunk_size
             this_chunk = min(chunk_sz, max(0, full_len - begin))
             chunk_ctx_sum += this_chunk
             reusable = (ctx_req.estimated_reusable_tokens
                         if ctx_req.is_first_context_chunk else 0)
             reusable_in_chunk = max(0, reusable - begin)
             remaining = ctx_req.context_remaining_length
             if (reusable_in_chunk > 0
                     and reusable_in_chunk + chunk_sz < remaining):
                 formula_contrib = chunk_sz
             else:
                 formula_contrib = max(1, chunk_sz - reusable_in_chunk)
             if i < max_detail:
                 ctx_summaries.append(
                     f"rid={ctx_req.py_request_id} full={full_len} pos={begin} "
                     f"chunk_sz={chunk_sz} this_chunk={this_chunk} "
                     f"reusable={reusable} formula_contrib={formula_contrib}")
         n_ctx = len(context_requests)
         if n_ctx > max_detail:
             ctx_summaries.append(f"... +{n_ctx - max_detail} more ctx req(s)")

         logger.info(
-            "batch_wait: formula_total=",
-            num_scheduled_tokens,
-            " formula_ctx=",
-            num_scheduled_ctx_formula,
-            " formula_gen=",
-            num_scheduled_gen_tokens,
-            " chunk_ctx_sum=",
-            chunk_ctx_sum,
-            " threshold=",
-            wait_threshold,
-            " wait_iter=",
-            self.batch_wait_iters_count,
-            "/",
-            self.batch_wait_timeout_iters,
-            " should_defer_ctx=",
-            should_waiting,
-            " num_gen=",
-            len(generation_requests),
-            " ctx_detail=[",
-            "; ".join(ctx_summaries),
-            "]",
+            "batch_wait: formula_total=%s formula_ctx=%s formula_gen=%s "
+            "chunk_ctx_sum=%s threshold=%s wait_iter=%s/%s "
+            "should_defer_ctx=%s num_gen=%s ctx_detail=[%s]",
+            num_scheduled_tokens,
+            num_scheduled_ctx_formula,
+            num_scheduled_gen_tokens,
+            chunk_ctx_sum,
+            wait_threshold,
+            self.batch_wait_iters_count,
+            self.batch_wait_timeout_iters,
+            should_waiting,
+            len(generation_requests),
+            "; ".join(ctx_summaries),
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3052 - 3143, The
logging call in _maybe_log_batch_wait_decision currently passes many positional
args to logger.info causing a formatting TypeError and it never obeys the
documented TLLM_LOG_BATCH_WAIT gate; update the function to first check the env
var (e.g. os.environ.get("TLLM_LOG_BATCH_WAIT") == "1") in addition to
self.dist.rank, and replace the multi-argument logger.info(...) call with a
single formatted message (built with an f-string or using .format) that embeds
num_scheduled_tokens, num_scheduled_ctx_formula, num_scheduled_gen_tokens,
chunk_ctx_sum, wait_threshold, self.batch_wait_iters_count,
self.batch_wait_timeout_iters, should_waiting, len(generation_requests), and the
ctx_detail string so the logger receives one string and no extra arguments.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3052-3143: The logging call in _maybe_log_batch_wait_decision
currently passes many positional args to logger.info causing a formatting
TypeError and it never obeys the documented TLLM_LOG_BATCH_WAIT gate; update the
function to first check the env var (e.g. os.environ.get("TLLM_LOG_BATCH_WAIT")
== "1") in addition to self.dist.rank, and replace the multi-argument
logger.info(...) call with a single formatted message (built with an f-string or
using .format) that embeds num_scheduled_tokens, num_scheduled_ctx_formula,
num_scheduled_gen_tokens, chunk_ctx_sum, wait_threshold,
self.batch_wait_iters_count, self.batch_wait_timeout_iters, should_waiting,
len(generation_requests), and the ctx_detail string so the logger receives one
string and no extra arguments.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7d54a4b6-7256-4ae0-84fa-db5cd6aa844b

📥 Commits

Reviewing files that changed from the base of the PR and between 0c86dda and c39c1c0.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/resource_manager.py

tensorrt-cicd · Apr 29, 2026

PR_Github #46083 [ run ] triggered by Bot. Commit: c39c1c0 Link to invocation

…t overflow Cherry-pick of Python portions from feat/bench_y PRs NVIDIA#12682 and NVIDIA#12806 that were not included in NVIDIA#12976 (which ported only the C++ fix to main). Adds a remaining_budget re-validation guard in KVCacheManager.prepare_resources() that re-probes the radix tree for actual reusable blocks after KV cache allocation and skips requests whose forward cost exceeds the remaining budget. This catches the estimation-vs-reality gap when cache eviction between scheduling and prepare_resources() reduces actual reuse below the scheduler's estimate. Original authors: Liao Lanyu (@lancelly), Jin Li (@liji-nv) Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna · Apr 29, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 29, 2026

PR_Github #46175 [ run ] triggered by Bot. Commit: f0a16e8 Link to invocation

Adds three pure-Python test files (no GPU, no model weights) covering the budget guard introduced in the previous commit: - test_post_reuse_compute.py (Tier 1): Branch-coverage table for KVCacheManager._estimate_post_reuse_compute() plus parametrized "production-relevant divergence" cases proving the helper undercharges when chunk_size is not block-aligned (legal per test_kv_cache_v2_scheduler.py:243-247). Includes a precise property test asserting helper <= actual in the reuse branch (0 < P < prompt_len) and documents the short-circuit overcharge case. - test_resource_manager_v1_budget.py (Tier 2 + Tier 3): Mock-based integration tests driving the real prepare_resources() loop via KVCacheManager.__new__ to skip C++ init. Covers skip path, no-skip path, non-first-chunk pre-subtraction, gen+draft tokens consuming budget, no-op when reuse off / is_draft, reset_context_requests with is_last_context_chunk mutation, kv_connector elif branch, connector callbacks (update_state_after_alloc, build_scheduler_output), count_reusable_blocks call shape, and non-first-chunk fallthrough. Tier 3 contains the deterministic DYN-2868 regression pair: - test_m: post-prepare total <= max_num_tokens with the guard - test_n: negative control patches the helper to 0 and verifies overshoot via an INDEPENDENT helper (model_engine_total_tokens) that mirrors model_engine.py:2293-2297 + 2615-2616 token counting. - test_batch_wait_log.py (Tier 4): Smoke test for PyExecutor._maybe_log_batch_wait_decision rank-0 gate. Local verification: - Tier 1 arithmetic verified standalone (no TRT-LLM deps). - Tier 3 logic simulated independently (Test M = 512, Test N = 1024). - All three files pass py_compile. - Full pytest run requires tensorrt_llm.bindings (C++ build) and was not run locally; intended for CI. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

tensorrt-cicd · Apr 30, 2026

PR_Github #46175 [ run ] completed with state SUCCESS. Commit: f0a16e8
/LLM/main/L0_MergeRequest_PR pipeline #36294 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yifjiang · Apr 30, 2026

This PR doesn't apply cleanly to current main because of an API rename introduced by #13029 (which merged 16 minutes after #12976 on 2026-04-18 and consolidated findNewContextBlock() + countReusableBlocks(true) + countReusableBlocks(false) into a single analyze_prefix_reuse() returning PrefixReuseSummary).

tensorrt_llm/_torch/pyexecutor/resource_manager.py line 718 in this PR calls:

reusable_blocks = self.impl.count_reusable_blocks(unique_tokens, req, False)

The False (= onlyAllocated=False) returns all matched reusable blocks per the pre-#13029 implementation in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (the conditional was if (!onlyAllocated || matchingBlock->hasRefs())). On main after #13029, count_reusable_blocks no longer exists; the equivalent is:

reusable_blocks = self.impl.analyze_prefix_reuse(unique_tokens, req).reusable_blocks_all

Without this fix, the worker dies on the first first-chunk admission with AttributeError: 'KVCacheManager' object has no attribute 'count_reusable_blocks'.

(Side note for the maintainers: py_executor.py and model_engine.py in this PR also have downstream-API dependencies — strip_mm_data_for_generation, ModelConfig.get_num_attention_layers(is_disagg=...) — but those are observability-only changes, so dropping them on a strict-main base is harmless if you want a minimal-surface fix.)

I tested this empirically with the issue #13318 reproducer (Qwen3-4B / 1×GB200 / 1 shared system prompt / max_num_tokens=4096 / fgmf=0.80 / qps=18 trigger) on three images:

Image	Has #12976	Has #13029	Has this PR	Result
Built from `ea117de` (= #12976 merge, parent of #13029)	✅	❌	❌	crashes: `total_num_tokens (6946) > max=4096`
same + this PR's gate	✅	❌	✅	still crashes: `total_num_tokens (4242) > max=4096`, 0 `Reuse budget: skip req` events
rc13 (= `ea117de` + #13029)	✅	✅	❌	clean

It looks like #13029's two-phase batch claim closes the within-batch eviction race that produces the over-admission, while this PR's gate doesn't catch the same race (gate's reuse re-probe runs before add_sequence, but the race fires inside add_sequence for sequences later in the batch, so the gate doesn't see it). I haven't been able to construct a workload where the gate's Reuse budget: skip req actually emits on a main+#13029 build — would be useful to know what workload class this PR's body's "heavy KV-cache eviction" scenario is intended to catch on main, since on #13029-equipped main I can't trigger it with multi-prefix + tight cache + high qps reproducers either.

nv-yna · Apr 30, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 30, 2026

PR_Github #46431 [ run ] triggered by Bot. Commit: e06999e Link to invocation

tensorrt-cicd · May 1, 2026

PR_Github #46431 [ run ] completed with state SUCCESS. Commit: e06999e
/LLM/main/L0_MergeRequest_PR pipeline #36502 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Replace the count_reusable_blocks() call in the prepare_resources budget guard with analyze_prefix_reuse(...).reusable_blocks_all, matching the public KVCacheManager binding API. Update tests to mock the new return shape via a small Mock(reusable_blocks_all=N) helper. Tighten comments and docstrings to describe current behavior without referencing PRs, issues, or specific line numbers; reformat with ruff format to satisfy the pre-commit hook. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>

nv-yna · May 4, 2026

/bot run --disable-fail-fast

tensorrt-cicd · May 4, 2026

PR_Github #46674 [ run ] triggered by Bot. Commit: 8322e01 Link to invocation

tensorrt-cicd · May 5, 2026

PR_Github #46674 [ run ] completed with state SUCCESS. Commit: 8322e01
/LLM/main/L0_MergeRequest_PR pipeline #36715 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned nv-yna Apr 29, 2026

nv-yna marked this pull request as ready for review April 29, 2026 06:53

nv-yna requested review from a team as code owners April 29, 2026 06:53

nv-yna requested a review from byshiue April 29, 2026 06:53

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

nv-yna force-pushed the dyn-2868-fix branch from c39c1c0 to f0a16e8 Compare April 29, 2026 07:20

yifjiang mentioned this pull request Apr 29, 2026

chore(container): bump tensorrt-llm pin from 1.3.0rc11 to 1.3.0rc13 ai-dynamo/dynamo#8870

Draft

4 tasks

nvpohanh approved these changes Apr 30, 2026

View reviewed changes

nv-yna and others added 2 commits May 4, 2026 10:53

Merge branch 'main' into dyn-2868-fix

03a4fb9

nv-yna closed this May 4, 2026

Search code, repositories, users, issues, pull requests...

Conversation

nv-yna commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related PRs

Summary by CodeRabbit

Uh oh!

nv-yna commented Apr 29, 2026

Uh oh!

coderabbitai Bot commented Apr 29, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

nv-yna commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 29, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

yifjiang commented Apr 30, 2026

Uh oh!

nv-yna commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

nv-yna commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nv-yna commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading