[https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests#12618
[https://nvbugs/6018051][fix] Add port conflict retry for disaggregated MP tests#12618reasonsolo merged 6 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom reasonsolo:fix/nvbug-6018051-port-conflictreasonsolo/TensorRT-LLM:fix/nvbug-6018051-port-conflictCopy head branch name to clipboard
Conversation
The disaggregated multi-process tests (test_py_cache_transceiver_mp.py and test_kv_transfer_mp.py) use find_free_port() followed by mp.spawn() with gloo backend. There is a TOCTOU race where find_free_port() closes the socket after discovering a free port, but before gloo's TCPStore can bind to it, another CI process may claim that port, causing EADDRINUSE failures. Add retry logic around mp.spawn() calls that detects port conflict errors (EADDRINUSE) and retries with a fresh port up to 5 times. This matches the existing retry pattern in auto_deploy/distributed/ common.py's spawn_multiprocess_job(). Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…rt conflict fix Remove the waive entry for test_v2_transceiver_mp[ctx_first-v2_mp_tp2_pp1_to_tp2_pp1] since the EADDRINUSE port conflict has been fixed with retry logic. Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
📝 WalkthroughWalkthroughThe PR removes a waiver entry for a failing test and adds port-conflict retry logic to two multiprocessing test utilities, enabling automatic re-spawning of processes on address-in-use errors up to a configurable retry limit. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unittest/disaggregated/test_kv_transfer_mp.py (1)
532-536: Consider extracting shared utility to reduce duplication.This
_is_port_conflict_errorfunction is identical to the one intest_py_cache_transceiver_mp.py. While acceptable for test files, consider extracting it to a shared test utility module (e.g.,tests/unittest/disaggregated/_test_utils.py) if more disaggregated tests need similar retry logic in the future.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/disaggregated/test_kv_transfer_mp.py` around lines 532 - 536, The helper function _is_port_conflict_error currently duplicated in test_kv_transfer_mp.py should be extracted into a shared test utility (e.g., create tests/unittest/disaggregated/_test_utils.py) and imported where needed; move the function definition for _is_port_conflict_error into that module, export it, then replace the local definition in test_kv_transfer_mp.py (and remove the duplicate in test_py_cache_transceiver_mp.py if present) with an import of _is_port_conflict_error from the new _test_utils module and run tests to ensure no import or name changes are missed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unittest/disaggregated/test_kv_transfer_mp.py`:
- Around line 532-536: The helper function _is_port_conflict_error currently
duplicated in test_kv_transfer_mp.py should be extracted into a shared test
utility (e.g., create tests/unittest/disaggregated/_test_utils.py) and imported
where needed; move the function definition for _is_port_conflict_error into that
module, export it, then replace the local definition in test_kv_transfer_mp.py
(and remove the duplicate in test_py_cache_transceiver_mp.py if present) with an
import of _is_port_conflict_error from the new _test_utils module and run tests
to ensure no import or name changes are missed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ba1f2a58-fa86-44e3-8582-f27de2e804ce
📒 Files selected for processing (3)
tests/integration/test_lists/waives.txttests/unittest/disaggregated/test_kv_transfer_mp.pytests/unittest/disaggregated/test_py_cache_transceiver_mp.py
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
|
/bot run --disable-fail-fast |
|
PR_Github #41114 [ run ] triggered by Bot. Commit: |
Remove waive entries for two additional test_v2_transceiver_mp variants that also fail due to the same EADDRINUSE port conflict, now fixed by the retry logic added in the previous commit: - test_v2_transceiver_mp[ctx_first-v2_mp_tp1_pp2_to_tp1_pp2] (nvbugs/5989923) - test_v2_transceiver_mp[ctx_first-v2_mp_tp1_pp1_to_tp1_pp2] (nvbugs/5996642) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
|
PR_Github #41114 [ run ] completed with state
|
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #41361 [ run ] triggered by Bot. Commit: |
|
PR_Github #41362 [ run ] triggered by Bot. Commit: |
|
PR_Github #41361 [ run ] completed with state |
Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
|
PR_Github #41362 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41431 [ run ] triggered by Bot. Commit: |
|
PR_Github #41431 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41680 [ run ] triggered by Bot. Commit: |
|
PR_Github #41680 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #41868 [ run ] triggered by Bot. Commit: |
|
PR_Github #41868 [ run ] completed with state |
…ed MP tests (NVIDIA#12618) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
…ed MP tests (NVIDIA#12618) Signed-off-by: Lizhi Zhou <1432185+reasonsolo@users.noreply.github.com>
The disaggregated multi-process tests (test_py_cache_transceiver_mp.py and test_kv_transfer_mp.py) use find_free_port() followed by mp.spawn() with gloo backend. There is a TOCTOU race where find_free_port() closes the socket after discovering a free port, but before gloo's TCPStore can bind to it, another CI process may claim that port, causing EADDRINUSE failures.
Add retry logic around mp.spawn() calls that detects port conflict errors (EADDRINUSE) and retries with a fresh port up to 5 times. This matches the existing retry pattern in auto_deploy/distributed/ common.py's spawn_multiprocess_job().
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.