[https://nvbugs/5969206][fix] BREAKING: Setting default value of KV cache transfer timeout to 60s by pcastonguay · Pull Request #12249 · NVIDIA/TensorRT-LLM

pcastonguay · Mar 16, 2026

Summary by CodeRabbit

Bug Fixes
- KV cache transfer now defaults to a 60-second timeout, preventing indefinite waiting in transfer scenarios.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

pcastonguay · Mar 16, 2026

/bot run --disable-fail-fast

coderabbitai · Mar 16, 2026

📝 Walkthrough

Walkthrough

The default value of kv_transfer_timeout_ms in CacheTransceiverConfig is changed from None to 60000 milliseconds, establishing an explicit 60-second default timeout for KV cache transfers when the parameter is not specified.

Changes

Cohort / File(s)	Summary
KV Cache Transfer Timeout Default `tensorrt_llm/llmapi/llm_args.py`	Modified default value of `kv_transfer_timeout_ms` field in `CacheTransceiverConfig` class from `None` to `60000` (milliseconds), changing implicit timeout behavior for KV cache transfer operations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely empty except for the template. No actual description, test coverage, or justification for the breaking change is provided.	Fill in the Description section explaining why the default timeout was changed to 60 seconds and what impact this breaking change has. Document any test coverage validating this change.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The title clearly and specifically describes the main change: setting a default KV cache transfer timeout to 60 seconds, which is the only modification in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can generate a title for your PR based on the changes.

Add @coderabbitai title placeholder anywhere in the title of your PR and CodeRabbit will replace it with a title based on the changes in the PR. You can change the placeholder by changing the reviews.auto_title_placeholder setting.

tensorrt-cicd · Mar 16, 2026

PR_Github #39111 [ run ] triggered by Bot. Commit: f03a210 Link to invocation

QiJune

LGTM

tensorrt-cicd · Mar 16, 2026

PR_Github #39111 [ run ] completed with state FAILURE. Commit: f03a210
/LLM/main/L0_MergeRequest_PR pipeline #30370 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Tabrizian · Mar 17, 2026

/bot run --disable-fail-fast

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

tensorrt-cicd · Mar 17, 2026

PR_Github #39209 [ run ] triggered by Bot. Commit: 94efdf6 Link to invocation

pcastonguay · Mar 17, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 17, 2026

PR_Github #39252 [ run ] triggered by Bot. Commit: 2d641b2 Link to invocation

pcastonguay · Mar 17, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 17, 2026

PR_Github #39270 [ run ] triggered by Bot. Commit: 30960da Link to invocation

tensorrt-cicd · Mar 17, 2026

PR_Github #39270 [ run ] completed with state SUCCESS. Commit: 30960da
/LLM/main/L0_MergeRequest_PR pipeline #30529 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pcastonguay · Mar 18, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 18, 2026

PR_Github #39465 [ run ] triggered by Bot. Commit: 30960da Link to invocation

tensorrt-cicd · Mar 18, 2026

PR_Github #39465 [ run ] completed with state SUCCESS. Commit: 30960da
/LLM/main/L0_MergeRequest_PR pipeline #30691 completed with status: 'SUCCESS'

CI Report

Link to invocation

…ache transfer timeout to 60s (NVIDIA#12249) Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

… side checkContextTransferStatus retries stuck prefill-side KV cache transfers indefinitely using only the per-iteration kv_transfer_sender_future_timeout_ms. The per-request total timeout kv_transfer_timeout_ms is plumbed through config (CacheTransceiverConfig::getKvTransferTimeoutMs) but never read in batch_manager code — it is dead code. Under concurrent load with constrained cache, stuck transfers hold KV blocks forever, exhausting the pool. The prefill worker becomes permanently unresponsive while health probes continue returning 200 OK. Fix: After each per-iteration timeout in checkContextTransferStatus, check total elapsed time (via LlmRequest::getKvCacheTransferStart, already set by sendAsync) against kv_transfer_timeout_ms. When exceeded, mark the request as DISAGG_TRANS_ERROR, best-effort cancel via CacheSender, and remove from mSenderFutures so blocks can be freed. Reproducer: 1P1D disagg with Qwen3-0.6B, free_gpu_memory_fraction=0.2, NIXL over TCP, concurrency 16 with ISL 8000. Server hangs after ~2 minutes and never recovers. Related: NVIDIA#12249 (set default kv_transfer_timeout_ms=60s — config only) Related: NVIDIA#12313, NVIDIA#12314 (Python-level fixes — cannot fire due to race condition with C++ transfer completion removing requests from Python tracking before the 60s timeout elapses) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pcastonguay requested a review from Tabrizian March 16, 2026 16:46

pcastonguay requested a review from a team as a code owner March 16, 2026 16:46

pcastonguay requested a review from hchings March 16, 2026 16:46

github-actions Bot assigned pcastonguay Mar 16, 2026

pcastonguay changed the title ~~[None][chore] BREAKING: Setting default value of KV cache transfer timeout to 60s~~ [https://nvbugs/5969206][fix] BREAKING: Setting default value of KV cache transfer timeout to 60s Mar 16, 2026

Tabrizian approved these changes Mar 16, 2026

View reviewed changes

QiJune approved these changes Mar 16, 2026

View reviewed changes

Setting default value of KV cache transfer timeout to 60s

94efdf6

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

Tabrizian force-pushed the default_kv_transfer_timeout branch from f03a210 to 94efdf6 Compare March 17, 2026 06:57

Merge branch 'main' into default_kv_transfer_timeout

2d641b2

Merge branch 'main' into default_kv_transfer_timeout

30960da

pcastonguay merged commit bd14845 into NVIDIA:main Mar 18, 2026
5 checks passed

tmonty12 mentioned this pull request Apr 1, 2026

[None][fix] Unpinned blocks in reuse tree evicted during disaggregated context transfer #12660

Open

yifjiang mentioned this pull request Apr 15, 2026

[https://nvbugspro.nvidia.com/bug/6104831][fix] Disagg transceiver: shared_ptr<LlmRequest>, RAII buffer holders, kv_transfer_timeout_ms enforcement #13056

Open

Search code, repositories, users, issues, pull requests...

Conversation

pcastonguay commented Mar 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

pcastonguay commented Mar 16, 2026

Uh oh!

coderabbitai Bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

Tabrizian commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

pcastonguay commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

pcastonguay commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

pcastonguay commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pcastonguay commented Mar 16, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 16, 2026 •

edited

Loading