[None][fix] Fix compute token accounting for KV cache reuse with context chunking#12976
[None][fix] Fix compute token accounting for KV cache reuse with context chunking#12976longlee0622 merged 10 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
Conversation
📝 WalkthroughWalkthroughThis change introduces a Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1737-1737: Consider if 1-second sleep is appropriate for the benchmark fill loop.This sleep is in a tight loop that fills benchmark requests. While 1 second is more reasonable than some other sleep changes in this PR, it could still slow down the benchmark initialization if many iterations are needed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` at line 1737, The tight benchmark fill loop currently uses time.sleep(1), which can unnecessarily slow benchmark initialization; change this to a small, configurable sleep and/or remove the long fixed delay: replace the literal time.sleep(1) in the benchmark fill loop with a configurable constant (e.g., BENCH_FILL_SLEEP_SEC) set to a small value like 0.01 by default, or use time.sleep(0) to only yield the thread, and expose that constant as a parameter or config option in the surrounding function/class so callers can tune it for their workload (locate the literal time.sleep(1) in the benchmark fill loop and update it and the surrounding signature/config accordingly).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tensorrt_llm/batch_manager/microBatchScheduler.cpp`:
- Around line 27-44: reuse_adjusted_compute currently returns 0 when a reusable
prefix fully covers the last chunk, letting scheduled (non-zero) chunks become
free; update the logic so any non-zero scheduled chunk is floored to at least 1
token like the no-chunk path does. Modify reuse_adjusted_compute (and the
chunked call sites that consume its result around lines referenced) to: compute
the adjusted cost as now but then if the original chunkSize was > 0 and the
computed value is 0, return 1 instead of 0; ensure all callers that use
reuse_adjusted_compute (including the other chunked sections noted) rely on that
floored value so batchNumTokens stays consistent with the no-chunking path.
In `@tests/unittest/_torch/executor/test_py_executor.py`:
- Around line 251-370: Run the project formatter (ruff format) on the added test
block containing the helpers _make_ctx_request and _make_gen_request and the
TestComputeScheduledTokens class so the new tests (methods like test_no_reuse,
test_last_chunk_with_reuse, etc.) conform to the repo's ruff formatting rules;
re-stage the file after formatting and push the changes to fix the failing
ruff-format CI hook.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Line 1737: The tight benchmark fill loop currently uses time.sleep(1), which
can unnecessarily slow benchmark initialization; change this to a small,
configurable sleep and/or remove the long fixed delay: replace the literal
time.sleep(1) in the benchmark fill loop with a configurable constant (e.g.,
BENCH_FILL_SLEEP_SEC) set to a small value like 0.01 by default, or use
time.sleep(0) to only yield the thread, and expose that constant as a parameter
or config option in the surrounding function/class so callers can tune it for
their workload (locate the literal time.sleep(1) in the benchmark fill loop and
update it and the surrounding signature/config accordingly).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 84f31d79-e8ec-4ea6-b627-3e2929ba4f5c
📒 Files selected for processing (4)
cpp/tensorrt_llm/batch_manager/microBatchScheduler.cppcpp/tests/unit_tests/batch_manager/microBatchSchedulerTest.cpptensorrt_llm/_torch/pyexecutor/py_executor.pytests/unittest/_torch/executor/test_py_executor.py
|
/bot run --disable-fail-fast |
|
@liji-nv @SimengLiu-nv Hi, Could you please help to review this PR? Combined with #12878, it should be able to fix the MNT errors we encountered. |
|
PR_Github #42946 [ run ] triggered by Bot. Commit: |
…ext chunking When KV cache reuse is combined with context chunking, setPrepopulatedPromptLen shifts the chunk window right by the reused amount rather than shrinking it. Non-last chunks still process ~chunkSize tokens in the forward pass. The old formula (chunkSize - reusable) underestimated compute cost for non-last chunks, causing the scheduler to over-pack batches beyond the token budget. This adds reuse_adjusted_compute() to correctly account for chunk-shift behavior, and updates _waiting_requests in PyExecutor to subtract reusable tokens from scheduled token counts. Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
07e89ae to
4655598
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #43063 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #43140 [ run ] triggered by Bot. Commit: |
|
PR_Github #43140 [ run ] completed with state
|
…ext chunking Introduce _reuse_adjusted_compute() to correctly calculate forward-pass cost when setPrepopulatedPromptLen shifts the chunk window. Non-last chunks cost chunkSize; last chunks cost contextRemaining - reusable. Update C++ and Python tests to match the corrected behavior. Signed-off-by: Lanyu Liao <lancelly@users.noreply.github.com>
…into fix_resue_compute
|
/bot run --disable-fail-fast |
|
PR_Github #43249 [ run ] triggered by Bot. Commit: |
|
PR_Github #43249 [ run ] completed with state |
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #43322 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #43482 [ run ] triggered by Bot. Commit: |
|
PR_Github #43482 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43599 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #43740 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #43760 [ run ] triggered by Bot. Commit: |
|
PR_Github #43760 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43888 [ run ] triggered by Bot. Commit: |
|
PR_Github #43888 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43918 [ run ] triggered by Bot. Commit: |
|
PR_Github #43918 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #44095 [ run ] triggered by Bot. Commit: |
…ext chunking Squash-merge of upstream PR NVIDIA#12976 (author: @lancelly). setPrepopulatedPromptLen shifts the chunk window right by the reused amount rather than shrinking it, so non-last chunks still process ~chunk_size tokens; both microBatchScheduler.cpp and the Python PyMicroBatchScheduler are updated to go through a single reuse_adjusted_compute helper, and py_executor.py gets a new _compute_scheduled_tokens used by the batch-waiting heuristic. Fixes the 'total_num_tokens should be less than or equal to max_num_tokens' AssertionError at model_engine.py:2678 that appeared sporadically (2-7 h to reproduce at production QPS) on native-offload configs with enable_chunked_prefill + enable_block_reuse + host_cache_size. Upstream PR: NVIDIA#12976 Signed-off-by: Saddss <2872669061@qq.com> Made-with: Cursor
|
PR_Github #44095 [ run ] completed with state |
…t overflow Cherry-pick of Python portions from feat/bench_y PRs NVIDIA#12682 and NVIDIA#12806 that were not included in NVIDIA#12976 (which ported only the C++ fix to main). Adds a remaining_budget re-validation guard in KVCacheManager.prepare_resources() that re-probes the radix tree for actual reusable blocks after KV cache allocation and skips requests whose forward cost exceeds the remaining budget. This catches the estimation-vs-reality gap when cache eviction between scheduling and prepare_resources() reduces actual reuse below the scheduler's estimate. Original authors: Liao Lanyu (@lancelly), Jin Li (@liji-nv) Signed-off-by: Yuewei Na <nv-yna@users.noreply.github.com>
Summary
setPrepopulatedPromptLenshifts the chunk window right by the reused amount rather than shrinking it, so non-last chunks still process ~chunkSizetokensreuse_adjusted_compute()helper and replace all 6 manualmax(0, chunkSize - reusable)formulas inmicroBatchScheduler.cpp_compute_scheduled_tokenswith V1/V2 scheduler awareness and correct non-last-chunk handling inpy_executor.pyReusableTokensChunkShiftNonLastChunk) and Python (TestComputeScheduledTokens) covering the chunk-shift scenario