[https://nvbugs/5969206][fix] BREAKING: Setting default value of KV cache transfer timeout to 60s#12249
[https://nvbugs/5969206][fix] BREAKING: Setting default value of KV cache transfer timeout to 60s#12249pcastonguay merged 3 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom pcastonguay:default_kv_transfer_timeoutpcastonguay/TensorRT-LLM:default_kv_transfer_timeoutCopy head branch name to clipboard
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThe default value of Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can generate a title for your PR based on the changes.Add |
|
PR_Github #39111 [ run ] triggered by Bot. Commit: |
|
PR_Github #39111 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
f03a210 to
94efdf6
Compare
|
PR_Github #39209 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #39252 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #39270 [ run ] triggered by Bot. Commit: |
|
PR_Github #39270 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #39465 [ run ] triggered by Bot. Commit: |
|
PR_Github #39465 [ run ] completed with state |
…ache transfer timeout to 60s (NVIDIA#12249) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
…ache transfer timeout to 60s (NVIDIA#12249) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
… side checkContextTransferStatus retries stuck prefill-side KV cache transfers indefinitely using only the per-iteration kv_transfer_sender_future_timeout_ms. The per-request total timeout kv_transfer_timeout_ms is plumbed through config (CacheTransceiverConfig::getKvTransferTimeoutMs) but never read in batch_manager code — it is dead code. Under concurrent load with constrained cache, stuck transfers hold KV blocks forever, exhausting the pool. The prefill worker becomes permanently unresponsive while health probes continue returning 200 OK. Fix: After each per-iteration timeout in checkContextTransferStatus, check total elapsed time (via LlmRequest::getKvCacheTransferStart, already set by sendAsync) against kv_transfer_timeout_ms. When exceeded, mark the request as DISAGG_TRANS_ERROR, best-effort cancel via CacheSender, and remove from mSenderFutures so blocks can be freed. Reproducer: 1P1D disagg with Qwen3-0.6B, free_gpu_memory_fraction=0.2, NIXL over TCP, concurrency 16 with ISL 8000. Server hangs after ~2 minutes and never recovers. Related: NVIDIA#12249 (set default kv_transfer_timeout_ms=60s — config only) Related: NVIDIA#12313, NVIDIA#12314 (Python-level fixes — cannot fire due to race condition with C++ transfer completion removing requests from Python tracking before the 60s timeout elapses) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… side checkContextTransferStatus retries stuck prefill-side KV cache transfers indefinitely using only the per-iteration kv_transfer_sender_future_timeout_ms. The per-request total timeout kv_transfer_timeout_ms is plumbed through config (CacheTransceiverConfig::getKvTransferTimeoutMs) but never read in batch_manager code — it is dead code. Under concurrent load with constrained cache, stuck transfers hold KV blocks forever, exhausting the pool. The prefill worker becomes permanently unresponsive while health probes continue returning 200 OK. Fix: After each per-iteration timeout in checkContextTransferStatus, check total elapsed time (via LlmRequest::getKvCacheTransferStart, already set by sendAsync) against kv_transfer_timeout_ms. When exceeded, mark the request as DISAGG_TRANS_ERROR, best-effort cancel via CacheSender, and remove from mSenderFutures so blocks can be freed. Reproducer: 1P1D disagg with Qwen3-0.6B, free_gpu_memory_fraction=0.2, NIXL over TCP, concurrency 16 with ISL 8000. Server hangs after ~2 minutes and never recovers. Related: NVIDIA#12249 (set default kv_transfer_timeout_ms=60s — config only) Related: NVIDIA#12313, NVIDIA#12314 (Python-level fixes — cannot fire due to race condition with C++ transfer completion removing requests from Python tracking before the 60s timeout elapses) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.