[None][fix] Cherry-pick Python-side remaining_budget guard for KV reuse budget overflow#13601
[None][fix] Cherry-pick Python-side remaining_budget guard for KV reuse budget overflow#13601nv-yna wants to merge 4 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughAdds diagnostic logging enhancements and optimization logic across the executor pipeline. Token overflow detection now logs per-request breakdowns in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
3052-3143:⚠️ Potential issue | 🔴 CriticalFix the broken logging call and add the missing environment variable gate.
The
logger.info(...)call uses multiple positional arguments without%sformat placeholders. This causes aTypeError: not all arguments converted during string formattingwhen the log record is emitted on rank 0. Additionally, the docstring documents that diagnostics requireTLLM_LOG_BATCH_WAIT=1, but the code never checks this environment variable—the function will log unconditionally on rank 0.Suggested fix
def _maybe_log_batch_wait_decision( self, context_requests: list[LlmRequest], generation_requests: list[LlmRequest], num_scheduled_tokens: int, wait_threshold: float, should_waiting: bool, ) -> None: """Diagnostics for batch_wait: set TLLM_LOG_BATCH_WAIT=1 (rank 0 only).""" if self.dist.rank != 0: return + if os.getenv("TLLM_LOG_BATCH_WAIT") != "1": + return num_scheduled_gen_tokens = sum(1 + gen_req.num_draft_tokens for gen_req in generation_requests) num_scheduled_ctx_formula = num_scheduled_tokens - num_scheduled_gen_tokens chunk_ctx_sum = 0 - ctx_summaries: List[str] = [] + ctx_summaries: list[str] = [] max_detail = 4 for i, ctx_req in enumerate(context_requests): full_len = len(ctx_req.get_tokens(0)) begin = ctx_req.context_current_position chunk_sz = ctx_req.context_chunk_size this_chunk = min(chunk_sz, max(0, full_len - begin)) chunk_ctx_sum += this_chunk reusable = (ctx_req.estimated_reusable_tokens if ctx_req.is_first_context_chunk else 0) reusable_in_chunk = max(0, reusable - begin) remaining = ctx_req.context_remaining_length if (reusable_in_chunk > 0 and reusable_in_chunk + chunk_sz < remaining): formula_contrib = chunk_sz else: formula_contrib = max(1, chunk_sz - reusable_in_chunk) if i < max_detail: ctx_summaries.append( f"rid={ctx_req.py_request_id} full={full_len} pos={begin} " f"chunk_sz={chunk_sz} this_chunk={this_chunk} " f"reusable={reusable} formula_contrib={formula_contrib}") n_ctx = len(context_requests) if n_ctx > max_detail: ctx_summaries.append(f"... +{n_ctx - max_detail} more ctx req(s)") logger.info( - "batch_wait: formula_total=", - num_scheduled_tokens, - " formula_ctx=", - num_scheduled_ctx_formula, - " formula_gen=", - num_scheduled_gen_tokens, - " chunk_ctx_sum=", - chunk_ctx_sum, - " threshold=", - wait_threshold, - " wait_iter=", - self.batch_wait_iters_count, - "/", - self.batch_wait_timeout_iters, - " should_defer_ctx=", - should_waiting, - " num_gen=", - len(generation_requests), - " ctx_detail=[", - "; ".join(ctx_summaries), - "]", + "batch_wait: formula_total=%s formula_ctx=%s formula_gen=%s " + "chunk_ctx_sum=%s threshold=%s wait_iter=%s/%s " + "should_defer_ctx=%s num_gen=%s ctx_detail=[%s]", + num_scheduled_tokens, + num_scheduled_ctx_formula, + num_scheduled_gen_tokens, + chunk_ctx_sum, + wait_threshold, + self.batch_wait_iters_count, + self.batch_wait_timeout_iters, + should_waiting, + len(generation_requests), + "; ".join(ctx_summaries), )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3052 - 3143, The logging call in _maybe_log_batch_wait_decision currently passes many positional args to logger.info causing a formatting TypeError and it never obeys the documented TLLM_LOG_BATCH_WAIT gate; update the function to first check the env var (e.g. os.environ.get("TLLM_LOG_BATCH_WAIT") == "1") in addition to self.dist.rank, and replace the multi-argument logger.info(...) call with a single formatted message (built with an f-string or using .format) that embeds num_scheduled_tokens, num_scheduled_ctx_formula, num_scheduled_gen_tokens, chunk_ctx_sum, wait_threshold, self.batch_wait_iters_count, self.batch_wait_timeout_iters, should_waiting, len(generation_requests), and the ctx_detail string so the logger receives one string and no extra arguments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 3052-3143: The logging call in _maybe_log_batch_wait_decision
currently passes many positional args to logger.info causing a formatting
TypeError and it never obeys the documented TLLM_LOG_BATCH_WAIT gate; update the
function to first check the env var (e.g. os.environ.get("TLLM_LOG_BATCH_WAIT")
== "1") in addition to self.dist.rank, and replace the multi-argument
logger.info(...) call with a single formatted message (built with an f-string or
using .format) that embeds num_scheduled_tokens, num_scheduled_ctx_formula,
num_scheduled_gen_tokens, chunk_ctx_sum, wait_threshold,
self.batch_wait_iters_count, self.batch_wait_timeout_iters, should_waiting,
len(generation_requests), and the ctx_detail string so the logger receives one
string and no extra arguments.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7d54a4b6-7256-4ae0-84fa-db5cd6aa844b
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
|
PR_Github #46083 [ run ] triggered by Bot. Commit: |
…t overflow Cherry-pick of Python portions from feat/bench_y PRs NVIDIA#12682 and NVIDIA#12806 that were not included in NVIDIA#12976 (which ported only the C++ fix to main). Adds a remaining_budget re-validation guard in KVCacheManager.prepare_resources() that re-probes the radix tree for actual reusable blocks after KV cache allocation and skips requests whose forward cost exceeds the remaining budget. This catches the estimation-vs-reality gap when cache eviction between scheduling and prepare_resources() reduces actual reuse below the scheduler's estimate. Original authors: Liao Lanyu (@lancelly), Jin Li (@liji-nv) Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #46175 [ run ] triggered by Bot. Commit: |
Adds three pure-Python test files (no GPU, no model weights) covering
the budget guard introduced in the previous commit:
- test_post_reuse_compute.py (Tier 1):
Branch-coverage table for KVCacheManager._estimate_post_reuse_compute()
plus parametrized "production-relevant divergence" cases proving the
helper undercharges when chunk_size is not block-aligned (legal per
test_kv_cache_v2_scheduler.py:243-247). Includes a precise property
test asserting helper <= actual in the reuse branch (0 < P < prompt_len)
and documents the short-circuit overcharge case.
- test_resource_manager_v1_budget.py (Tier 2 + Tier 3):
Mock-based integration tests driving the real prepare_resources() loop
via KVCacheManager.__new__ to skip C++ init. Covers skip path, no-skip
path, non-first-chunk pre-subtraction, gen+draft tokens consuming
budget, no-op when reuse off / is_draft, reset_context_requests with
is_last_context_chunk mutation, kv_connector elif branch, connector
callbacks (update_state_after_alloc, build_scheduler_output),
count_reusable_blocks call shape, and non-first-chunk fallthrough.
Tier 3 contains the deterministic DYN-2868 regression pair:
- test_m: post-prepare total <= max_num_tokens with the guard
- test_n: negative control patches the helper to 0 and verifies
overshoot via an INDEPENDENT helper (model_engine_total_tokens)
that mirrors model_engine.py:2293-2297 + 2615-2616 token counting.
- test_batch_wait_log.py (Tier 4):
Smoke test for PyExecutor._maybe_log_batch_wait_decision rank-0 gate.
Local verification:
- Tier 1 arithmetic verified standalone (no TRT-LLM deps).
- Tier 3 logic simulated independently (Test M = 512, Test N = 1024).
- All three files pass py_compile.
- Full pytest run requires tensorrt_llm.bindings (C++ build) and
was not run locally; intended for CI.
Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
PR_Github #46175 [ run ] completed with state
|
|
This PR doesn't apply cleanly to current
reusable_blocks = self.impl.count_reusable_blocks(unique_tokens, req, False)The reusable_blocks = self.impl.analyze_prefix_reuse(unique_tokens, req).reusable_blocks_allWithout this fix, the worker dies on the first first-chunk admission with (Side note for the maintainers: I tested this empirically with the issue #13318 reproducer (Qwen3-4B / 1×GB200 / 1 shared system prompt / max_num_tokens=4096 / fgmf=0.80 / qps=18 trigger) on three images:
It looks like #13029's two-phase batch claim closes the within-batch eviction race that produces the over-admission, while this PR's gate doesn't catch the same race (gate's reuse re-probe runs before |
|
/bot run --disable-fail-fast |
|
PR_Github #46431 [ run ] triggered by Bot. Commit: |
|
PR_Github #46431 [ run ] completed with state
|
Replace the count_reusable_blocks() call in the prepare_resources budget guard with analyze_prefix_reuse(...).reusable_blocks_all, matching the public KVCacheManager binding API. Update tests to mock the new return shape via a small Mock(reusable_blocks_all=N) helper. Tighten comments and docstrings to describe current behavior without referencing PRs, issues, or specific line numbers; reformat with ruff format to satisfy the pre-commit hook. Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #46674 [ run ] triggered by Bot. Commit: |
|
PR_Github #46674 [ run ] completed with state
|
Summary
Cherry-picks the Python-side fix from
feat/bench_yPRs #12682 and #12806 tomain.The C++ scheduler fix (
reuse_adjusted_compute()inmicroBatchScheduler.cpp) was already ported tomainvia #12976. What's missing is the Python safety net that catches residual estimation-vs-reality gaps when KV cache eviction occurs between scheduling andprepare_resources().Root cause: The C++ MicroBatchScheduler uses
estimatedReusableTokensto discount KV reuse from the token budget. Under heavy cache eviction pressure, actual reuse can be lower than estimated, causingtotal_num_tokensin_prepare_tp_inputs()to exceedmax_num_tokens.What this PR adds:
remaining_budgetre-validation guard inKVCacheManager.prepare_resources()— re-probes the radix tree for actual reusable blocks and skips requests that would overflow the budget_estimate_post_reuse_compute()helper — correctly handles non-last-chunk vs last-chunk asymmetry insetPrepopulatedPromptLen_compute_scheduled_tokens()and_maybe_log_batch_wait_decision()for compute-aware batch-wait decisionsmax_num_tokensassertion inmodel_engine.pyFiles changed:
tensorrt_llm/_torch/pyexecutor/resource_manager.py— budget guard +_estimate_post_reuse_compute()tensorrt_llm/_torch/pyexecutor/py_executor.py— compute-aware scheduling helperstensorrt_llm/_torch/pyexecutor/model_engine.py— overflow diagnostic loggingReproduction:
AssertionError: total_num_tokens (514) > max_num_tokens (512)on rc11/rc12/main withenable_chunked_prefill=true,enable_block_reuse=true,free_gpu_memory_fraction=0.15, and concurrent multi-prefix multi-turn load (8 prefix variants, 32 sessions).Verification: Python overlay of this fix on stock rc11 container — server ran 3500+ iterations and processed 413+ requests with zero assertion errors under the same conditions that reproduce the bug on unpatched rc11.
Fixes #13318.
Related PRs
feat/bench_y(C++ + Python)mainSummary by CodeRabbit
Refactor
Chores