[https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests by reasonsolo · Pull Request #12618 · NVIDIA/TensorRT-LLM

reasonsolo · Mar 31, 2026

The disaggregated multi-process tests (test_py_cache_transceiver_mp.py and test_kv_transfer_mp.py) use find_free_port() followed by mp.spawn() with gloo backend. There is a TOCTOU race where find_free_port() closes the socket after discovering a free port, but before gloo's TCPStore can bind to it, another CI process may claim that port, causing EADDRINUSE failures.

Add retry logic around mp.spawn() calls that detects port conflict errors (EADDRINUSE) and retries with a fresh port up to 5 times. This matches the existing retry pattern in auto_deploy/distributed/ common.py's spawn_multiprocess_job().

Summary by CodeRabbit

Chores
- Enhanced test infrastructure reliability by implementing automatic retry logic for handling port-conflict errors during parallel test process execution.
- Improved test process spawning with configurable retry attempts (default up to 5 attempts) and dynamic port selection to prevent address binding conflicts.
- Cleaned up test waivers list by removing obsolete skip entries.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

The disaggregated multi-process tests (test_py_cache_transceiver_mp.py and test_kv_transfer_mp.py) use find_free_port() followed by mp.spawn() with gloo backend. There is a TOCTOU race where find_free_port() closes the socket after discovering a free port, but before gloo's TCPStore can bind to it, another CI process may claim that port, causing EADDRINUSE failures. Add retry logic around mp.spawn() calls that detects port conflict errors (EADDRINUSE) and retries with a fresh port up to 5 times. This matches the existing retry pattern in auto_deploy/distributed/ common.py's spawn_multiprocess_job(). Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

…rt conflict fix Remove the waive entry for test_v2_transceiver_mp[ctx_first-v2_mp_tp2_pp1_to_tp2_pp1] since the EADDRINUSE port conflict has been fixed with retry logic. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

coderabbitai · Mar 31, 2026

📝 Walkthrough

Walkthrough

The PR removes a waiver entry for a failing test and adds port-conflict retry logic to two multiprocessing test utilities, enabling automatic re-spawning of processes on address-in-use errors up to a configurable retry limit.

Changes

Cohort / File(s)	Summary
Waiver Removal `tests/integration/test_lists/waives.txt`	Removed waiver for `unittest/disaggregated/test_py_cache_transceiver_mp.py::test_v2_transceiver_mp[ctx_first-v2_mp_tp2_pp1_to_tp2_pp1]` test case.
Port-Conflict Retry Logic `tests/unittest/disaggregated/test_kv_transfer_mp.py`, `tests/unittest/disaggregated/test_py_cache_transceiver_mp.py`	Added `_is_port_conflict_error()` helper function to detect EADDRINUSE exceptions. Updated `run_transfer_worker_mp()` and `run_v2_transceiver_mp()` to accept `max_port_retries` parameter (default 5) and implement retry loops that select fresh ports and re-spawn processes on port-conflict failures.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is incomplete; missing Test Coverage and Description sections required by the template.	Fill in the 'Description' section explaining the issue and solution, and 'Test Coverage' section listing relevant tests that safeguard the changes.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding port conflict retry logic for disaggregated multiprocess tests, following the proper format with ticket ID and fix type.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unittest/disaggregated/test_kv_transfer_mp.py (1)
532-536: Consider extracting shared utility to reduce duplication.

This _is_port_conflict_error function is identical to the one in test_py_cache_transceiver_mp.py. While acceptable for test files, consider extracting it to a shared test utility module (e.g., tests/unittest/disaggregated/_test_utils.py) if more disaggregated tests need similar retry logic in the future.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/disaggregated/test_kv_transfer_mp.py` around lines 532 - 536,
The helper function _is_port_conflict_error currently duplicated in
test_kv_transfer_mp.py should be extracted into a shared test utility (e.g.,
create tests/unittest/disaggregated/_test_utils.py) and imported where needed;
move the function definition for _is_port_conflict_error into that module,
export it, then replace the local definition in test_kv_transfer_mp.py (and
remove the duplicate in test_py_cache_transceiver_mp.py if present) with an
import of _is_port_conflict_error from the new _test_utils module and run tests
to ensure no import or name changes are missed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unittest/disaggregated/test_kv_transfer_mp.py`:
- Around line 532-536: The helper function _is_port_conflict_error currently
duplicated in test_kv_transfer_mp.py should be extracted into a shared test
utility (e.g., create tests/unittest/disaggregated/_test_utils.py) and imported
where needed; move the function definition for _is_port_conflict_error into that
module, export it, then replace the local definition in test_kv_transfer_mp.py
(and remove the duplicate in test_py_cache_transceiver_mp.py if present) with an
import of _is_port_conflict_error from the new _test_utils module and run tests
to ensure no import or name changes are missed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ba1f2a58-fa86-44e3-8582-f27de2e804ce

📥 Commits

Reviewing files that changed from the base of the PR and between f6db7e3 and a6239ad.

📒 Files selected for processing (3)

tests/integration/test_lists/waives.txt
tests/unittest/disaggregated/test_kv_transfer_mp.py
tests/unittest/disaggregated/test_py_cache_transceiver_mp.py

💤 Files with no reviewable changes (1)

tests/integration/test_lists/waives.txt

reasonsolo · Apr 1, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 1, 2026

PR_Github #41114 [ run ] triggered by Bot. Commit: a6239ad Link to invocation

Remove waive entries for two additional test_v2_transceiver_mp variants that also fail due to the same EADDRINUSE port conflict, now fixed by the retry logic added in the previous commit: - test_v2_transceiver_mp[ctx_first-v2_mp_tp1_pp2_to_tp1_pp2] (nvbugs/5989923) - test_v2_transceiver_mp[ctx_first-v2_mp_tp1_pp1_to_tp1_pp2] (nvbugs/5996642) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

tensorrt-cicd · Apr 1, 2026

PR_Github #41114 [ run ] completed with state SUCCESS. Commit: a6239ad
/LLM/main/L0_MergeRequest_PR pipeline #32087 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo · Apr 2, 2026

/bot run --disable-fail-fast

reasonsolo · Apr 2, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 2, 2026

PR_Github #41361 [ run ] triggered by Bot. Commit: 1608255 Link to invocation

tensorrt-cicd · Apr 2, 2026

PR_Github #41362 [ run ] triggered by Bot. Commit: 1608255 Link to invocation

tensorrt-cicd · Apr 2, 2026

PR_Github #41361 [ run ] completed with state ABORTED. Commit: 1608255

Link to invocation

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

tensorrt-cicd · Apr 2, 2026

PR_Github #41362 [ run ] completed with state SUCCESS. Commit: 1608255
/LLM/main/L0_MergeRequest_PR pipeline #32306 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · Apr 2, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 2, 2026

PR_Github #41431 [ run ] triggered by Bot. Commit: eb4fc22 Link to invocation

tensorrt-cicd · Apr 2, 2026

PR_Github #41431 [ run ] completed with state SUCCESS. Commit: eb4fc22
/LLM/main/L0_MergeRequest_PR pipeline #32363 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · Apr 3, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 3, 2026

PR_Github #41680 [ run ] triggered by Bot. Commit: fb8c890 Link to invocation

tensorrt-cicd · Apr 4, 2026

PR_Github #41680 [ run ] completed with state SUCCESS. Commit: fb8c890
/LLM/main/L0_MergeRequest_PR pipeline #32585 completed with status: 'ABORTED'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

reasonsolo · Apr 5, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 5, 2026

PR_Github #41868 [ run ] triggered by Bot. Commit: fb8c890 Link to invocation

tensorrt-cicd · Apr 5, 2026

PR_Github #41868 [ run ] completed with state SUCCESS. Commit: fb8c890
/LLM/main/L0_MergeRequest_PR pipeline #32734 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

…ed MP tests (NVIDIA#12618) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

github-actions Bot assigned reasonsolo Mar 31, 2026

reasonsolo changed the title ~~[NVBug/6018051][fix] Add port conflict retry for disaggregated MP tests~~ [https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests Mar 31, 2026

reasonsolo marked this pull request as ready for review March 31, 2026 08:16

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

reasonsolo requested a review from chuangz0 April 1, 2026 05:19

Shixiaowei02 approved these changes Apr 2, 2026

View reviewed changes

Merge branch 'main' into fix/nvbug-6018051-port-conflict

1608255

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo closed this Apr 2, 2026

reasonsolo reopened this Apr 2, 2026

Merge branch 'main' into fix/nvbug-6018051-port-conflict

eb4fc22

Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

reasonsolo enabled auto-merge (squash) April 2, 2026 09:38

Merge branch 'main' into fix/nvbug-6018051-port-conflict

fb8c890

reasonsolo merged commit 73a476a into NVIDIA:main Apr 5, 2026
5 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026

[https://nvbugs/6018051][fix] Add port conflict retry for disaggregat…

e9cddcc

…ed MP tests (NVIDIA#12618) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[https://nvbugs/6018051][fix] Add port conflict retry for disaggregat…

b042cd6

…ed MP tests (NVIDIA#12618) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>

Search code, repositories, users, issues, pull requests...

Conversation

reasonsolo commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

reasonsolo commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

reasonsolo commented Apr 2, 2026

Uh oh!

reasonsolo commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

reasonsolo commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

reasonsolo commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

reasonsolo commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

tensorrt-cicd commented Apr 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reasonsolo commented Mar 31, 2026 •

edited

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading