[None][fix] Fix OOM issue/dummy request allocation/chunked prefill/pp for KV Cache Manager V2#11710
[None][fix] Fix OOM issue/dummy request allocation/chunked prefill/pp for KV Cache Manager V2#11710yizhang-nv merged 3 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom yizhang-nv:minor-fix-for-kv-cache-manager-v2yizhang-nv/TensorRT-LLM:minor-fix-for-kv-cache-manager-v2Copy head branch name to clipboard
Conversation
|
/bot run |
|
PR_Github #36767 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe changes make the KVCacheStats.allocated_bytes property writable from Python and refactor resource management in KVCacheManager and KVCacheManagerV2, switching to a unified KvCacheStats return type, adding resource cleanup helpers, and adjusting shutdown and resource quota logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
2079-2091:⚠️ Potential issue | 🟠 MajorResource leak on
resize()failure inis_genblock — inconsistent with other error paths.When
kv_cache.resize(new_capacity)fails (Line 2081) ordraft_kv_cache.resize(new_capacity)fails (Line 2087), aValueErroris raised. Unlike every other failure path in this function,release_resourcesis not called before raising, leaving the already-allocatedkv_cache(anddraft_kv_cache, if applicable) inkv_cache_mapwithout being freed.🐛 Proposed fix
if prepare_resource: new_capacity = kv_cache.capacity + max_num_draft_tokens + 1 success = kv_cache.resize(new_capacity) if not success: - raise ValueError( - f"Failed to resize capacity of KV cache for request {req.py_request_id} to {new_capacity} tokens for dummy request" - ) + release_resources(req) + return None if draft_kv_cache is not None: success = draft_kv_cache.resize(new_capacity) if not success: - raise ValueError( - f"Failed to resize capacity of draft KV cache for request {req.py_request_id} to {new_capacity} tokens for dummy request" - ) + release_resources(req, free_draft_resources=True) + return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2079 - 2091, In the is_gen branch where prepare_resource is true, ensure you free already-allocated resources before raising on resize failure: when kv_cache.resize(new_capacity) or draft_kv_cache.resize(new_capacity) returns False, call release_resources(req.py_request_id) (or the existing release_resources function that cleans up entries in kv_cache_map for this request) to remove and free kv_cache and draft_kv_cache, then raise the ValueError; update the error paths in the block that contains prepare_resource, kv_cache.resize, draft_kv_cache.resize and the existing raise statements so they always call release_resources(req.py_request_id) prior to raising.
2264-2266:⚠️ Potential issue | 🟡 Minor
get_max_resource_count()returning1is still a placeholder and may cause scheduling issues.Returning
1prevents division-by-zero but any scheduler that uses this to compute maximum capacity or admission thresholds will severely underestimate available resources forKVCacheManagerV2, potentially admitting far fewer requests than the hardware supports. The V1 equivalent returnsself.impl.max_num_blocks.Consider wiring through the actual page count, e.g.:
def get_max_resource_count(self) -> int: return max( self.impl.get_page_index_upper_bound(layer_id, Role.KEY) for layer_id in typed_range(LayerId(self.num_local_layers)) ) // self.kv_factor🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2264 - 2266, get_max_resource_count currently returns a placeholder 1 which underestimates capacity; replace it to compute the real max resource count by iterating local layers (use typed_range(LayerId(self.num_local_layers))) and calling self.impl.get_page_index_upper_bound(layer_id, Role.KEY) for each, take the maximum of those values and divide by self.kv_factor (use integer division if intended) to mirror the V1 behavior (self.impl.max_num_blocks) so schedulers get an accurate capacity for KVCacheManagerV2.
2046-2067:⚠️ Potential issue | 🔴 CriticalReplace
ValueErrorwithrelease_resources()pattern to prevent resource leak on resize failures.At lines 2082–2091, when
kv_cache.resize()ordraft_kv_cache.resize()fails in theis_genblock, the code raisesValueErrordirectly without freeing allocated resources. Unlike the error paths at lines 2051–2072 which correctly callrelease_resources(), these failures leak both the primary and draft KV cache objects. Userelease_resources()followed byreturn Noneinstead of raising.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py` around lines 2046 - 2067, In the is_gen branch, replace the direct raising of ValueError when kv_cache.resize(...) or draft_kv_cache.resize(...) fails with the same cleanup pattern used elsewhere: call release_resources(req) (and release_resources(req, free_draft_resources=True) where the draft cache was allocated) and then return None; update the failure paths around kw functions kv_cache.resize and draft_kv_cache.resize so both primary and draft KV cache objects are freed via release_resources() instead of raising an exception.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Line 19: Remove the direct class import "KvCacheStats" and instead reference
it through the module namespace
(tensorrt_llm.bindings.internal.batch_manager.KvCacheStats) wherever it's used;
follow the existing alias/import pattern used for KVCacheManagerCpp and other
bindings (reuse the module-level access rather than adding a new class-level
import) so all usages call the class as
tensorrt_llm.bindings.internal.batch_manager.KvCacheStats().
- Around line 1971-1975: KVCacheManagerV2.get_kv_cache_stats() currently only
sets allocated_bytes, causing callers (py_executor.py and scheduler.py) to read
zeros/empty dicts; update KVCacheManagerV2.get_kv_cache_stats to fully populate
the KvCacheStats object (set max_num_blocks, free_num_blocks, used_num_blocks,
tokens_per_block, alloc_total_blocks, alloc_new_blocks, reused_blocks,
missed_blocks, cache_hit_rate and num_free_blocks_per_window_size) by querying
the underlying implementation (self.impl) or the manager's internal
counters/metrics (e.g., use existing impl methods like get_quota, get_usage,
get_metrics or maintain/update counters on allocation/free events) so the
returned KvCacheStats contains correct values for all fields expected by
py_executor.py and scheduler.py.
---
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/resource_manager.py`:
- Around line 2079-2091: In the is_gen branch where prepare_resource is true,
ensure you free already-allocated resources before raising on resize failure:
when kv_cache.resize(new_capacity) or draft_kv_cache.resize(new_capacity)
returns False, call release_resources(req.py_request_id) (or the existing
release_resources function that cleans up entries in kv_cache_map for this
request) to remove and free kv_cache and draft_kv_cache, then raise the
ValueError; update the error paths in the block that contains prepare_resource,
kv_cache.resize, draft_kv_cache.resize and the existing raise statements so they
always call release_resources(req.py_request_id) prior to raising.
- Around line 2264-2266: get_max_resource_count currently returns a placeholder
1 which underestimates capacity; replace it to compute the real max resource
count by iterating local layers (use
typed_range(LayerId(self.num_local_layers))) and calling
self.impl.get_page_index_upper_bound(layer_id, Role.KEY) for each, take the
maximum of those values and divide by self.kv_factor (use integer division if
intended) to mirror the V1 behavior (self.impl.max_num_blocks) so schedulers get
an accurate capacity for KVCacheManagerV2.
- Around line 2046-2067: In the is_gen branch, replace the direct raising of
ValueError when kv_cache.resize(...) or draft_kv_cache.resize(...) fails with
the same cleanup pattern used elsewhere: call release_resources(req) (and
release_resources(req, free_draft_resources=True) where the draft cache was
allocated) and then return None; update the failure paths around kw functions
kv_cache.resize and draft_kv_cache.resize so both primary and draft KV cache
objects are freed via release_resources() instead of raising an exception.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpptensorrt_llm/_torch/pyexecutor/resource_manager.py
1f96043 to
461552e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #36810 [ run ] triggered by Bot. Commit: |
|
PR_Github #36810 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #36932 [ run ] triggered by Bot. Commit: |
|
PR_Github #36932 [ run ] completed with state
|
8917c62 to
ad9335f
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37016 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #37037 [ run ] triggered by Bot. Commit: |
|
PR_Github #37037 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #37147 [ run ] triggered by Bot. Commit: |
|
PR_Github #37147 [ run ] completed with state
|
444b782 to
455b8b4
Compare
|
@lowsfer could you review this? thanks! |
|
/bot run --disable-fail-fast |
|
PR_Github #37268 [ run ] triggered by Bot. Commit: |
|
PR_Github #37268 [ run ] completed with state
|
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
c1f1e35 to
4f1bd5e
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37334 [ run ] triggered by Bot. Commit: |
|
PR_Github #37334 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #37409 [ run ] triggered by Bot. Commit: |
|
PR_Github #37409 [ run ] completed with state |
… for KV Cache Manager V2 (NVIDIA#11710) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
… for KV Cache Manager V2 (NVIDIA#11710) Signed-off-by: Yi Zhang <187001205+yizhang-nv@users.noreply.github.com>
Summary by CodeRabbit
Bug Fixes
Improvements
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.