Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics#12446

Merged
Shixiaowei02 merged 2 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
Shixiaowei02:fix/nixl_import_errShixiaowei02/TensorRT-LLM:fix/nixl_import_errCopy head branch name to clipboard
Apr 2, 2026
Merged

[https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics#12446
Shixiaowei02 merged 2 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
Shixiaowei02:fix/nixl_import_errShixiaowei02/TensorRT-LLM:fix/nixl_import_errCopy head branch name to clipboard

Conversation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator

@Shixiaowei02 Shixiaowei02 commented Mar 23, 2026

Summary by CodeRabbit

  • Chores

    • Updated development dependency to version 0.9.0.
  • Bug Fixes

    • Enhanced error handling with explicit exceptions and improved error messages for transfer operations and module loading.
    • Added structured logging for error states and timeout conditions.
  • Tests

    • Added unit tests to verify module loading behavior and error handling.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run

@Shixiaowei02 Shixiaowei02 requested a review from bo-nv March 23, 2026 08:33
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39904 [ run ] triggered by Bot. Commit: 127c4ae Link to invocation

@Shixiaowei02 Shixiaowei02 marked this pull request as ready for review March 23, 2026 12:43
@Shixiaowei02 Shixiaowei02 requested review from a team as code owners March 23, 2026 12:43
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 23, 2026

📝 Walkthrough

Walkthrough

Updated nixl dependency from 0.8.0 to 0.9.0. Replaced assertions with explicit exceptions in agent modules, modified _load_agent to return a tuple containing module and error information, and added structured logging to transfer status monitoring. New unit tests validate the updated _load_agent behavior.

Changes

Cohort / File(s) Summary
Dependency Update
requirements-dev.txt
Upgraded nixl from 0.8.0 to 0.9.0.
Error Handling & Logging
tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py
Replaced assertions with explicit ValueError and RuntimeError exceptions in register_memory, deregister_memory, and submit_transfer_requests. Added structured logging to NixlTransferStatus.wait for error and timeout conditions with contextual details in exception messages.
Agent Loading Logic
tensorrt_llm/_torch/disaggregation/nixl/agent.py
Modified _load_agent to return tuple (module, error) instead of module or None. Replaced assertion-based failure detection with explicit ImportError raising. Added logging for missing attributes and import failures, propagating underlying errors upstream.
Unit Tests
tests/unittest/disaggregated/test_agent_multi_backends.py
Added three new tests validating _load_agent behavior: non-existent module returns error, missing attributes on valid module, and successful loading with attribute presence verification.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains the template structure with no actual content filled in the Description and Test Coverage sections, making it incomplete and failing to explain the issue, solution, or test coverage. Fill in the Description section explaining the issue and solution, and the Test Coverage section listing the relevant tests that safeguard the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: improving NIXL agent import error diagnostics, which aligns with the code modifications across multiple files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py (2)

70-78: Consider validating all descriptor elements, not just the first.

The check only validates descs.descs[0]. If the list contains heterogeneous tuples (e.g., first has 4 elements, later ones have 3), malformed entries will pass validation and fail deeper in get_reg_descs.

♻️ Proposed fix to validate all elements
-        if isinstance(descs.descs[0], tuple) and len(descs.descs[0]) != 4:
-            raise ValueError(
-                f"Expected 4 elements per desc, got {len(descs.descs[0])}: {descs.descs[0]}"
-            )
+        for i, desc in enumerate(descs.descs):
+            if isinstance(desc, tuple) and len(desc) != 4:
+                raise ValueError(
+                    f"Expected 4 elements per desc at index {i}, got {len(desc)}: {desc}"
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py` around lines 70 - 78,
The validation currently only checks descs.descs[0], which allows other
malformed descriptor tuples to slip through; update the validation to iterate
over all entries in descs.descs and ensure each is a tuple of length 4 (raise
ValueError including the offending index and value if any fail), before calling
self.agent.get_reg_descs(descs.descs, descs.type) so get_reg_descs always
receives uniformly validated descriptors.

81-93: Consider extracting duplicated validation into a helper method.

The validation logic in deregister_memory (lines 81-93) is nearly identical to register_memory (lines 67-79). A private helper could reduce duplication.

♻️ Example helper extraction
def _get_validated_reg_descs(self, descs: RegMemoryDescs):
    """Validate and return registration descriptors."""
    if not descs.descs:
        raise ValueError("descs.descs must not be empty")
    for i, desc in enumerate(descs.descs):
        if isinstance(desc, tuple) and len(desc) != 4:
            raise ValueError(
                f"Expected 4 elements per desc at index {i}, got {len(desc)}: {desc}"
            )
    reg_descs = self.agent.get_reg_descs(descs.descs, descs.type)
    if reg_descs is None:
        raise RuntimeError(
            f"nixl get_reg_descs returned None for type={descs.type}, count={len(descs.descs)}"
        )
    return reg_descs
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py` around lines 81 - 93,
Extract the duplicated validation and conversion logic used in register_memory
and deregister_memory into a private helper (e.g., _get_validated_reg_descs)
that accepts a RegMemoryDescs, validates descs.descs is non-empty, iterates
through each desc and ensures tuple entries have length 4 (including index in
error messages), calls self.agent.get_reg_descs(descs.descs, descs.type) and
raises the same RuntimeError if None, and returns reg_descs; then replace the
validation block in both register_memory and deregister_memory to call this
helper and use its return value when calling self.agent.register_memory or
self.agent.deregister_memory.
tensorrt_llm/_torch/disaggregation/nixl/agent.py (1)

36-42: Optional: Consider cleaner f-string formatting for the error message.

The conditional concatenation works but could be slightly more readable.

♻️ Alternative formatting
     if _py_agent is None:
-        raise ImportError(
-            "Failed to load pure Python NIXL Transfer Agent."
-            + (f" Caused by: {_py_agent_err}" if _py_agent_err else "")
-        )
+        cause = f" Caused by: {_py_agent_err}" if _py_agent_err else ""
+        raise ImportError(f"Failed to load pure Python NIXL Transfer Agent.{cause}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/nixl/agent.py` around lines 36 - 42, The
ImportError construction using string concatenation around _py_agent and
_py_agent_err is readable but can be simplified to a single f-string; update the
raise ImportError to use one f-string that conditionally includes the "Caused
by: {_py_agent_err}" segment (e.g. f"Failed to load pure Python NIXL Transfer
Agent{f' Caused by: {_py_agent_err}' if _py_agent_err else ''}"), keeping the
same variables (_py_agent, _py_agent_err) and preserving the current behavior
and spacing, then leave the subsequent references to NixlTransferStatus and
NixlTransferAgent unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py`:
- Around line 70-78: The validation currently only checks descs.descs[0], which
allows other malformed descriptor tuples to slip through; update the validation
to iterate over all entries in descs.descs and ensure each is a tuple of length
4 (raise ValueError including the offending index and value if any fail), before
calling self.agent.get_reg_descs(descs.descs, descs.type) so get_reg_descs
always receives uniformly validated descriptors.
- Around line 81-93: Extract the duplicated validation and conversion logic used
in register_memory and deregister_memory into a private helper (e.g.,
_get_validated_reg_descs) that accepts a RegMemoryDescs, validates descs.descs
is non-empty, iterates through each desc and ensures tuple entries have length 4
(including index in error messages), calls self.agent.get_reg_descs(descs.descs,
descs.type) and raises the same RuntimeError if None, and returns reg_descs;
then replace the validation block in both register_memory and deregister_memory
to call this helper and use its return value when calling
self.agent.register_memory or self.agent.deregister_memory.

In `@tensorrt_llm/_torch/disaggregation/nixl/agent.py`:
- Around line 36-42: The ImportError construction using string concatenation
around _py_agent and _py_agent_err is readable but can be simplified to a single
f-string; update the raise ImportError to use one f-string that conditionally
includes the "Caused by: {_py_agent_err}" segment (e.g. f"Failed to load pure
Python NIXL Transfer Agent{f' Caused by: {_py_agent_err}' if _py_agent_err else
''}"), keeping the same variables (_py_agent, _py_agent_err) and preserving the
current behavior and spacing, then leave the subsequent references to
NixlTransferStatus and NixlTransferAgent unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a67221e4-cefe-492a-9000-adaaaf88b4c5

📥 Commits

Reviewing files that changed from the base of the PR and between 7aa1383 and 127c4ae.

📒 Files selected for processing (4)
  • requirements-dev.txt
  • tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py
  • tensorrt_llm/_torch/disaggregation/nixl/agent.py
  • tests/unittest/disaggregated/test_agent_multi_backends.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #39904 [ run ] completed with state SUCCESS. Commit: 127c4ae
/LLM/main/L0_MergeRequest_PR pipeline #31073 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread requirements-dev.txt
Comment thread tensorrt_llm/_torch/disaggregation/nixl/agent.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/nixl/agent.py Outdated
Comment thread tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py
@Shixiaowei02 Shixiaowei02 force-pushed the fix/nixl_import_err branch 2 times, most recently from 242f3e5 to e437dc6 Compare March 24, 2026 02:25
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40018 [ run ] triggered by Bot. Commit: e437dc6 Link to invocation

@Shixiaowei02 Shixiaowei02 force-pushed the fix/nixl_import_err branch 2 times, most recently from 8123890 to 6ad68b9 Compare March 24, 2026 02:58
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40031 [ run ] triggered by Bot. Commit: 6ad68b9 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40031 [ run ] completed with state SUCCESS. Commit: 6ad68b9
/LLM/main/L0_MergeRequest_PR pipeline #31186 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40068 [ run ] triggered by Bot. Commit: e3971e9 Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40714 [ run ] triggered by Bot. Commit: 42ba387 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40714 [ run ] completed with state SUCCESS. Commit: 42ba387
/LLM/main/L0_MergeRequest_PR pipeline #31741 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40837 [ run ] triggered by Bot. Commit: 652c65a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #40837 [ run ] completed with state SUCCESS. Commit: 652c65a
/LLM/main/L0_MergeRequest_PR pipeline #31847 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>
@Shixiaowei02 Shixiaowei02 force-pushed the fix/nixl_import_err branch from 652c65a to 7f92c96 Compare April 1, 2026 09:08
@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41168 [ run ] triggered by Bot. Commit: 7f92c96 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41168 [ run ] completed with state SUCCESS. Commit: 7f92c96
/LLM/main/L0_MergeRequest_PR pipeline #32134 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1, DGX_H100-2_GPUs-PyTorch-Others-1, DGX_H100-2_GPUs-PyTorch-GptOss-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41360 [ run ] triggered by Bot. Commit: 7f92c96 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41360 [ run ] completed with state SUCCESS. Commit: 7f92c96
/LLM/main/L0_MergeRequest_PR pipeline #32304 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41428 [ run ] triggered by Bot. Commit: 7f92c96 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41428 [ run ] completed with state SUCCESS. Commit: 7f92c96
/LLM/main/L0_MergeRequest_PR pipeline #32359 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@Shixiaowei02
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "CI passed in the assembled form."

@Shixiaowei02 Shixiaowei02 enabled auto-merge (squash) April 2, 2026 13:49
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41438 [ skip ] triggered by Bot. Commit: 7f92c96 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41438 [ skip ] completed with state SUCCESS. Commit: 7f92c96
Skipping testing for commit 7f92c96

Link to invocation

@Shixiaowei02 Shixiaowei02 merged commit 4c97a03 into NVIDIA:main Apr 2, 2026
5 checks passed
@Shixiaowei02 Shixiaowei02 deleted the fix/nixl_import_err branch April 3, 2026 02:08
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
yuanjingx87 pushed a commit to yuanjingx87/TensorRT-LLM that referenced this pull request Apr 8, 2026
…tics (NVIDIA#12446)

Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
yuanjingx87 pushed a commit to yuanjingx87/TensorRT-LLM that referenced this pull request Apr 10, 2026
…tics (NVIDIA#12446)

Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
yuanjingx87 pushed a commit to yuanjingx87/TensorRT-LLM that referenced this pull request Apr 10, 2026
…tics (NVIDIA#12446)

Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.