Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests#12618

Merged
reasonsolo merged 6 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
reasonsolo:fix/nvbug-6018051-port-conflictreasonsolo/TensorRT-LLM:fix/nvbug-6018051-port-conflictCopy head branch name to clipboard
Apr 5, 2026
Merged

[https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests#12618
reasonsolo merged 6 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
reasonsolo:fix/nvbug-6018051-port-conflictreasonsolo/TensorRT-LLM:fix/nvbug-6018051-port-conflictCopy head branch name to clipboard

Conversation

@reasonsolo
Copy link
Copy Markdown
Collaborator

@reasonsolo reasonsolo commented Mar 31, 2026

The disaggregated multi-process tests (test_py_cache_transceiver_mp.py and test_kv_transfer_mp.py) use find_free_port() followed by mp.spawn() with gloo backend. There is a TOCTOU race where find_free_port() closes the socket after discovering a free port, but before gloo's TCPStore can bind to it, another CI process may claim that port, causing EADDRINUSE failures.

Add retry logic around mp.spawn() calls that detects port conflict errors (EADDRINUSE) and retries with a fresh port up to 5 times. This matches the existing retry pattern in auto_deploy/distributed/ common.py's spawn_multiprocess_job().

Summary by CodeRabbit

  • Chores
    • Enhanced test infrastructure reliability by implementing automatic retry logic for handling port-conflict errors during parallel test process execution.
    • Improved test process spawning with configurable retry attempts (default up to 5 attempts) and dynamic port selection to prevent address binding conflicts.
    • Cleaned up test waivers list by removing obsolete skip entries.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The disaggregated multi-process tests (test_py_cache_transceiver_mp.py
and test_kv_transfer_mp.py) use find_free_port() followed by mp.spawn()
with gloo backend. There is a TOCTOU race where find_free_port() closes
the socket after discovering a free port, but before gloo's TCPStore
can bind to it, another CI process may claim that port, causing
EADDRINUSE failures.

Add retry logic around mp.spawn() calls that detects port conflict
errors (EADDRINUSE) and retries with a fresh port up to 5 times.
This matches the existing retry pattern in auto_deploy/distributed/
common.py's spawn_multiprocess_job().

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo reasonsolo changed the title [NVBug/6018051][fix] Add port conflict retry for disaggregated MP tests [https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests Mar 31, 2026
…rt conflict fix

Remove the waive entry for test_v2_transceiver_mp[ctx_first-v2_mp_tp2_pp1_to_tp2_pp1]
since the EADDRINUSE port conflict has been fixed with retry logic.

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo reasonsolo marked this pull request as ready for review March 31, 2026 08:16
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

The PR removes a waiver entry for a failing test and adds port-conflict retry logic to two multiprocessing test utilities, enabling automatic re-spawning of processes on address-in-use errors up to a configurable retry limit.

Changes

Cohort / File(s) Summary
Waiver Removal
tests/integration/test_lists/waives.txt
Removed waiver for unittest/disaggregated/test_py_cache_transceiver_mp.py::test_v2_transceiver_mp[ctx_first-v2_mp_tp2_pp1_to_tp2_pp1] test case.
Port-Conflict Retry Logic
tests/unittest/disaggregated/test_kv_transfer_mp.py, tests/unittest/disaggregated/test_py_cache_transceiver_mp.py
Added _is_port_conflict_error() helper function to detect EADDRINUSE exceptions. Updated run_transfer_worker_mp() and run_v2_transceiver_mp() to accept max_port_retries parameter (default 5) and implement retry loops that select fresh ports and re-spawn processes on port-conflict failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is incomplete; missing Test Coverage and Description sections required by the template. Fill in the 'Description' section explaining the issue and solution, and 'Test Coverage' section listing relevant tests that safeguard the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding port conflict retry logic for disaggregated multiprocess tests, following the proper format with ticket ID and fix type.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unittest/disaggregated/test_kv_transfer_mp.py (1)

532-536: Consider extracting shared utility to reduce duplication.

This _is_port_conflict_error function is identical to the one in test_py_cache_transceiver_mp.py. While acceptable for test files, consider extracting it to a shared test utility module (e.g., tests/unittest/disaggregated/_test_utils.py) if more disaggregated tests need similar retry logic in the future.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/disaggregated/test_kv_transfer_mp.py` around lines 532 - 536,
The helper function _is_port_conflict_error currently duplicated in
test_kv_transfer_mp.py should be extracted into a shared test utility (e.g.,
create tests/unittest/disaggregated/_test_utils.py) and imported where needed;
move the function definition for _is_port_conflict_error into that module,
export it, then replace the local definition in test_kv_transfer_mp.py (and
remove the duplicate in test_py_cache_transceiver_mp.py if present) with an
import of _is_port_conflict_error from the new _test_utils module and run tests
to ensure no import or name changes are missed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/disaggregated/test_kv_transfer_mp.py`:
- Around line 532-536: The helper function _is_port_conflict_error currently
duplicated in test_kv_transfer_mp.py should be extracted into a shared test
utility (e.g., create tests/unittest/disaggregated/_test_utils.py) and imported
where needed; move the function definition for _is_port_conflict_error into that
module, export it, then replace the local definition in test_kv_transfer_mp.py
(and remove the duplicate in test_py_cache_transceiver_mp.py if present) with an
import of _is_port_conflict_error from the new _test_utils module and run tests
to ensure no import or name changes are missed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ba1f2a58-fa86-44e3-8582-f27de2e804ce

📥 Commits

Reviewing files that changed from the base of the PR and between f6db7e3 and a6239ad.

📒 Files selected for processing (3)
  • tests/integration/test_lists/waives.txt
  • tests/unittest/disaggregated/test_kv_transfer_mp.py
  • tests/unittest/disaggregated/test_py_cache_transceiver_mp.py
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt

@reasonsolo reasonsolo requested a review from chuangz0 April 1, 2026 05:19
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41114 [ run ] triggered by Bot. Commit: a6239ad Link to invocation

Remove waive entries for two additional test_v2_transceiver_mp variants
that also fail due to the same EADDRINUSE port conflict, now fixed by
the retry logic added in the previous commit:
- test_v2_transceiver_mp[ctx_first-v2_mp_tp1_pp2_to_tp1_pp2] (nvbugs/5989923)
- test_v2_transceiver_mp[ctx_first-v2_mp_tp1_pp1_to_tp1_pp2] (nvbugs/5996642)

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41114 [ run ] completed with state SUCCESS. Commit: a6239ad
/LLM/main/L0_MergeRequest_PR pipeline #32087 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@reasonsolo reasonsolo closed this Apr 2, 2026
@reasonsolo reasonsolo reopened this Apr 2, 2026
@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41361 [ run ] triggered by Bot. Commit: 1608255 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41362 [ run ] triggered by Bot. Commit: 1608255 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41361 [ run ] completed with state ABORTED. Commit: 1608255

Link to invocation

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
@reasonsolo reasonsolo enabled auto-merge (squash) April 2, 2026 09:38
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41362 [ run ] completed with state SUCCESS. Commit: 1608255
/LLM/main/L0_MergeRequest_PR pipeline #32306 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41431 [ run ] triggered by Bot. Commit: eb4fc22 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41431 [ run ] completed with state SUCCESS. Commit: eb4fc22
/LLM/main/L0_MergeRequest_PR pipeline #32363 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41680 [ run ] triggered by Bot. Commit: fb8c890 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41680 [ run ] completed with state SUCCESS. Commit: fb8c890
/LLM/main/L0_MergeRequest_PR pipeline #32585 completed with status: 'ABORTED'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@reasonsolo
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41868 [ run ] triggered by Bot. Commit: fb8c890 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41868 [ run ] completed with state SUCCESS. Commit: fb8c890
/LLM/main/L0_MergeRequest_PR pipeline #32734 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@reasonsolo reasonsolo merged commit 73a476a into NVIDIA:main Apr 5, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
…ed MP tests (NVIDIA#12618)

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
…ed MP tests (NVIDIA#12618)

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.