[https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics#12446
[https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics#12446Shixiaowei02 merged 2 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom Shixiaowei02:fix/nixl_import_errShixiaowei02/TensorRT-LLM:fix/nixl_import_errCopy head branch name to clipboard
Conversation
3333fe1 to
127c4ae
Compare
|
/bot run |
|
PR_Github #39904 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughUpdated Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py (2)
70-78: Consider validating all descriptor elements, not just the first.The check only validates
descs.descs[0]. If the list contains heterogeneous tuples (e.g., first has 4 elements, later ones have 3), malformed entries will pass validation and fail deeper inget_reg_descs.♻️ Proposed fix to validate all elements
- if isinstance(descs.descs[0], tuple) and len(descs.descs[0]) != 4: - raise ValueError( - f"Expected 4 elements per desc, got {len(descs.descs[0])}: {descs.descs[0]}" - ) + for i, desc in enumerate(descs.descs): + if isinstance(desc, tuple) and len(desc) != 4: + raise ValueError( + f"Expected 4 elements per desc at index {i}, got {len(desc)}: {desc}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py` around lines 70 - 78, The validation currently only checks descs.descs[0], which allows other malformed descriptor tuples to slip through; update the validation to iterate over all entries in descs.descs and ensure each is a tuple of length 4 (raise ValueError including the offending index and value if any fail), before calling self.agent.get_reg_descs(descs.descs, descs.type) so get_reg_descs always receives uniformly validated descriptors.
81-93: Consider extracting duplicated validation into a helper method.The validation logic in
deregister_memory(lines 81-93) is nearly identical toregister_memory(lines 67-79). A private helper could reduce duplication.♻️ Example helper extraction
def _get_validated_reg_descs(self, descs: RegMemoryDescs): """Validate and return registration descriptors.""" if not descs.descs: raise ValueError("descs.descs must not be empty") for i, desc in enumerate(descs.descs): if isinstance(desc, tuple) and len(desc) != 4: raise ValueError( f"Expected 4 elements per desc at index {i}, got {len(desc)}: {desc}" ) reg_descs = self.agent.get_reg_descs(descs.descs, descs.type) if reg_descs is None: raise RuntimeError( f"nixl get_reg_descs returned None for type={descs.type}, count={len(descs.descs)}" ) return reg_descs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py` around lines 81 - 93, Extract the duplicated validation and conversion logic used in register_memory and deregister_memory into a private helper (e.g., _get_validated_reg_descs) that accepts a RegMemoryDescs, validates descs.descs is non-empty, iterates through each desc and ensures tuple entries have length 4 (including index in error messages), calls self.agent.get_reg_descs(descs.descs, descs.type) and raises the same RuntimeError if None, and returns reg_descs; then replace the validation block in both register_memory and deregister_memory to call this helper and use its return value when calling self.agent.register_memory or self.agent.deregister_memory.tensorrt_llm/_torch/disaggregation/nixl/agent.py (1)
36-42: Optional: Consider cleaner f-string formatting for the error message.The conditional concatenation works but could be slightly more readable.
♻️ Alternative formatting
if _py_agent is None: - raise ImportError( - "Failed to load pure Python NIXL Transfer Agent." - + (f" Caused by: {_py_agent_err}" if _py_agent_err else "") - ) + cause = f" Caused by: {_py_agent_err}" if _py_agent_err else "" + raise ImportError(f"Failed to load pure Python NIXL Transfer Agent.{cause}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/disaggregation/nixl/agent.py` around lines 36 - 42, The ImportError construction using string concatenation around _py_agent and _py_agent_err is readable but can be simplified to a single f-string; update the raise ImportError to use one f-string that conditionally includes the "Caused by: {_py_agent_err}" segment (e.g. f"Failed to load pure Python NIXL Transfer Agent{f' Caused by: {_py_agent_err}' if _py_agent_err else ''}"), keeping the same variables (_py_agent, _py_agent_err) and preserving the current behavior and spacing, then leave the subsequent references to NixlTransferStatus and NixlTransferAgent unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py`:
- Around line 70-78: The validation currently only checks descs.descs[0], which
allows other malformed descriptor tuples to slip through; update the validation
to iterate over all entries in descs.descs and ensure each is a tuple of length
4 (raise ValueError including the offending index and value if any fail), before
calling self.agent.get_reg_descs(descs.descs, descs.type) so get_reg_descs
always receives uniformly validated descriptors.
- Around line 81-93: Extract the duplicated validation and conversion logic used
in register_memory and deregister_memory into a private helper (e.g.,
_get_validated_reg_descs) that accepts a RegMemoryDescs, validates descs.descs
is non-empty, iterates through each desc and ensures tuple entries have length 4
(including index in error messages), calls self.agent.get_reg_descs(descs.descs,
descs.type) and raises the same RuntimeError if None, and returns reg_descs;
then replace the validation block in both register_memory and deregister_memory
to call this helper and use its return value when calling
self.agent.register_memory or self.agent.deregister_memory.
In `@tensorrt_llm/_torch/disaggregation/nixl/agent.py`:
- Around line 36-42: The ImportError construction using string concatenation
around _py_agent and _py_agent_err is readable but can be simplified to a single
f-string; update the raise ImportError to use one f-string that conditionally
includes the "Caused by: {_py_agent_err}" segment (e.g. f"Failed to load pure
Python NIXL Transfer Agent{f' Caused by: {_py_agent_err}' if _py_agent_err else
''}"), keeping the same variables (_py_agent, _py_agent_err) and preserving the
current behavior and spacing, then leave the subsequent references to
NixlTransferStatus and NixlTransferAgent unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a67221e4-cefe-492a-9000-adaaaf88b4c5
📒 Files selected for processing (4)
requirements-dev.txttensorrt_llm/_torch/disaggregation/nixl/_agent_py.pytensorrt_llm/_torch/disaggregation/nixl/agent.pytests/unittest/disaggregated/test_agent_multi_backends.py
|
PR_Github #39904 [ run ] completed with state
|
242f3e5 to
e437dc6
Compare
|
/bot run |
|
PR_Github #40018 [ run ] triggered by Bot. Commit: |
8123890 to
6ad68b9
Compare
|
/bot run |
|
PR_Github #40031 [ run ] triggered by Bot. Commit: |
|
PR_Github #40031 [ run ] completed with state
|
6ad68b9 to
e3971e9
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #40068 [ run ] triggered by Bot. Commit: |
e3971e9 to
5088a76
Compare
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #40714 [ run ] triggered by Bot. Commit: |
|
PR_Github #40714 [ run ] completed with state
|
42ba387 to
652c65a
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #40837 [ run ] triggered by Bot. Commit: |
|
PR_Github #40837 [ run ] completed with state
|
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
652c65a to
7f92c96
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #41168 [ run ] triggered by Bot. Commit: |
|
PR_Github #41168 [ run ] completed with state
|
|
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1, DGX_H100-2_GPUs-PyTorch-Others-1, DGX_H100-2_GPUs-PyTorch-GptOss-1" --disable-fail-fast |
|
PR_Github #41360 [ run ] triggered by Bot. Commit: |
|
PR_Github #41360 [ run ] completed with state
|
|
/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" --disable-fail-fast |
|
PR_Github #41428 [ run ] triggered by Bot. Commit: |
|
PR_Github #41428 [ run ] completed with state |
|
/bot skip --comment "CI passed in the assembled form." |
|
PR_Github #41438 [ skip ] triggered by Bot. Commit: |
|
PR_Github #41438 [ skip ] completed with state |
…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Summary by CodeRabbit
Chores
Bug Fixes
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.