[None][fix] Fix errors in KV cache manager V2 and scheduler V2#13104
[None][fix] Fix errors in KV cache manager V2 and scheduler V2#13104jiaganc merged 8 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
Conversation
Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
When attention DP causes can_queue=False after scheduling, the forward pass is skipped but the V2 scheduler already grew each generation request's KV cache capacity. Add revert_allocate_generation() to undo that spurious growth so it does not accumulate across skipped iterations and overflow the host page-index buffer. Partial cherry-pick of d8ff758 from tekit (revert_allocate_generation parts only). Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
📝 WalkthroughWalkthroughUpdated the KV cache management system to revert generation-phase allocations when certain rank scheduling conditions occur. Modified the executor loop to detect when only some ranks can queue, and added a new method to reverse KV cache capacity growth allocated during generation attempts. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 2064-2069: The PP executor loop _executor_loop_pp() currently
drops scheduled batches when ADP sets can_queue=False without undoing the
temporary KV-cache capacity growth; replicate the same revert logic used in the
main executor loop: when you detect the skip condition (can_queue is False and
can_queue_this_rank is True and self._scheduler_manages_kv_suspend) iterate
scheduled_batch.generation_requests and call
self.kv_cache_manager.revert_allocate_generation(req) for each request (same as
the block around scheduled_batch.generation_requests in the non-PP executor),
ensuring skipped iterations do not leave allocated KV capacity; apply the same
change in the other similar block referenced (around the other occurrence).
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 1741-1743: The current allreduce on `quota` (using
`Distributed.get(mapping).allreduce(quota, op=ReduceOp.MIN)`) is operating over
world ranks and incorrectly syncs a PP-rank-dependent byte quota; instead either
perform the reduction only within the TP group or convert to a TP-consistent
metric before reducing: locate the reduction near
`mapping`/`Distributed.get(mapping)` and change it to use the appropriate TP
group (pass the group parameter to `allreduce`) or compute a token-capacity
value by dividing `quota` by the local `bytes_per_token` (from
`get_cache_bytes_per_token()` / `mapping.pp_layers()`), allreduce that
TP-consistent token count with `ReduceOp.MIN`, then convert back to per-rank
bytes using each rank’s `bytes_per_token`; ensure `ReduceOp.MIN` remains correct
for the chosen normalization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 186fe76f-d388-4c70-830a-b56e9cd2adc8
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_torch/pyexecutor/resource_manager.py
Check the return value of kv_cache.resize and raise RuntimeError with request ID and capacity details if it fails. Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Add the same KV cache capacity revert logic to _executor_loop_pp() so that PP+ADP scenarios also undo spurious growth when can_queue is False but this rank had a non-empty batch. Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
bytes_per_token varies across PP ranks (different local layers), so allreducing raw byte quotas with MIN would under-size stages with fewer layers. Convert to token capacity first (matching V1 behavior), take MIN, then convert back to bytes per rank. Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
|
/bot run |
|
PR_Github #43698 [ run ] triggered by Bot. Commit: |
|
PR_Github #43698 [ run ] completed with state
|
lancelly
left a comment
There was a problem hiding this comment.
Under the scheduler architecuture right now, we can workaround like this. The fix looks good to me, thanks~
The revert_allocate_generation guard was duplicated across three executor loops (PP, non-overlap, overlap). Extract into a single helper method _maybe_revert_kv_growth to reduce repetition. Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Extract the repeated revert_allocate_generation guard into a single helper method. Drop the redundant can_queue_this_rank check (when it is False the batch has no generation requests so the loop is a no-op). Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
|
/bot run |
|
PR_Github #43988 [ run ] triggered by Bot. Commit: |
Move the 'not can_queue' guard out of the helper into the call sites so the helper is unconditional; the V1/V2 check remains inside. Rename from _maybe_revert_gen_alloc to _revert_gen_alloc to reflect the new signature. Drop can_queue_this_rank captures where it's no longer used. Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
|
PR_Github #43988 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #44281 [ run ] triggered by Bot. Commit: |
|
PR_Github #44281 [ run ] completed with state
|
|
@lowsfer could you review this? thanks |
|
/bot run --disable-fail-fast |
|
PR_Github #44404 [ run ] triggered by Bot. Commit: |
|
PR_Github #44404 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #44593 [ run ] triggered by Bot. Commit: |
|
PR_Github #44593 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44669 [ run ] triggered by Bot. Commit: |
|
PR_Github #44669 [ run ] completed with state |
…A#13104) Signed-off-by: Jiagan Cheng <jiaganc@nvidia.com>
Summary by CodeRabbit
Description
Two fixes for KV cache manager V2 in multi-rank scenarios:
Sync KV cache quota across TP ranks: When
mapping.world_size > 1, useallreduce(MIN)to ensure all TP ranks allocate the same KV cache quota. Without this, ranks with different available memory would get different quotas, causing the scheduler to produce different batches across ranks.Revert spurious KV cache capacity growth: When attention DP causes
can_queue=Falseafter scheduling (another rank has an empty batch), the forward pass is skipped but the V2 scheduler already grew each generation request's KV cache capacity. Addrevert_allocate_generation()to undo that growth so it does not accumulate across skipped iterations and overflow the host page-index buffer. Applied in both the non-overlap and overlap executor loops.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.