[https://nvbugs/5979673][fix] improve NIXL agent import error diagnostics by Shixiaowei02 · Pull Request #12446 · NVIDIA/TensorRT-LLM

Shixiaowei02 · Mar 23, 2026

Summary by CodeRabbit

Chores
- Updated development dependency to version 0.9.0.
Bug Fixes
- Enhanced error handling with explicit exceptions and improved error messages for transfer operations and module loading.
- Added structured logging for error states and timeout conditions.
Tests
- Added unit tests to verify module loading behavior and error handling.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Shixiaowei02 · Mar 23, 2026

/bot run

tensorrt-cicd · Mar 23, 2026

PR_Github #39904 [ run ] triggered by Bot. Commit: 127c4ae Link to invocation

coderabbitai · Mar 23, 2026

📝 Walkthrough

Walkthrough

Updated nixl dependency from 0.8.0 to 0.9.0. Replaced assertions with explicit exceptions in agent modules, modified _load_agent to return a tuple containing module and error information, and added structured logging to transfer status monitoring. New unit tests validate the updated _load_agent behavior.

Changes

Cohort / File(s)	Summary
Dependency Update `requirements-dev.txt`	Upgraded `nixl` from 0.8.0 to 0.9.0.
Error Handling & Logging `tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py`	Replaced assertions with explicit `ValueError` and `RuntimeError` exceptions in `register_memory`, `deregister_memory`, and `submit_transfer_requests`. Added structured logging to `NixlTransferStatus.wait` for error and timeout conditions with contextual details in exception messages.
Agent Loading Logic `tensorrt_llm/_torch/disaggregation/nixl/agent.py`	Modified `_load_agent` to return tuple `(module, error)` instead of module or `None`. Replaced assertion-based failure detection with explicit `ImportError` raising. Added logging for missing attributes and import failures, propagating underlying errors upstream.
Unit Tests `tests/unittest/disaggregated/test_agent_multi_backends.py`	Added three new tests validating `_load_agent` behavior: non-existent module returns error, missing attributes on valid module, and successful loading with attribute presence verification.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description only contains the template structure with no actual content filled in the Description and Test Coverage sections, making it incomplete and failing to explain the issue, solution, or test coverage.	Fill in the Description section explaining the issue and solution, and the Test Coverage section listing the relevant tests that safeguard the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: improving NIXL agent import error diagnostics, which aligns with the code modifications across multiple files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (3)

tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py (2)

70-78: Consider validating all descriptor elements, not just the first.

The check only validates descs.descs[0]. If the list contains heterogeneous tuples (e.g., first has 4 elements, later ones have 3), malformed entries will pass validation and fail deeper in get_reg_descs.

♻️ Proposed fix to validate all elements

-        if isinstance(descs.descs[0], tuple) and len(descs.descs[0]) != 4:
-            raise ValueError(
-                f"Expected 4 elements per desc, got {len(descs.descs[0])}: {descs.descs[0]}"
-            )
+        for i, desc in enumerate(descs.descs):
+            if isinstance(desc, tuple) and len(desc) != 4:
+                raise ValueError(
+                    f"Expected 4 elements per desc at index {i}, got {len(desc)}: {desc}"
+                )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py` around lines 70 - 78,
The validation currently only checks descs.descs[0], which allows other
malformed descriptor tuples to slip through; update the validation to iterate
over all entries in descs.descs and ensure each is a tuple of length 4 (raise
ValueError including the offending index and value if any fail), before calling
self.agent.get_reg_descs(descs.descs, descs.type) so get_reg_descs always
receives uniformly validated descriptors.

81-93: Consider extracting duplicated validation into a helper method.

The validation logic in deregister_memory (lines 81-93) is nearly identical to register_memory (lines 67-79). A private helper could reduce duplication.

♻️ Example helper extraction

def _get_validated_reg_descs(self, descs: RegMemoryDescs):
    """Validate and return registration descriptors."""
    if not descs.descs:
        raise ValueError("descs.descs must not be empty")
    for i, desc in enumerate(descs.descs):
        if isinstance(desc, tuple) and len(desc) != 4:
            raise ValueError(
                f"Expected 4 elements per desc at index {i}, got {len(desc)}: {desc}"
            )
    reg_descs = self.agent.get_reg_descs(descs.descs, descs.type)
    if reg_descs is None:
        raise RuntimeError(
            f"nixl get_reg_descs returned None for type={descs.type}, count={len(descs.descs)}"
        )
    return reg_descs

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py` around lines 81 - 93,
Extract the duplicated validation and conversion logic used in register_memory
and deregister_memory into a private helper (e.g., _get_validated_reg_descs)
that accepts a RegMemoryDescs, validates descs.descs is non-empty, iterates
through each desc and ensures tuple entries have length 4 (including index in
error messages), calls self.agent.get_reg_descs(descs.descs, descs.type) and
raises the same RuntimeError if None, and returns reg_descs; then replace the
validation block in both register_memory and deregister_memory to call this
helper and use its return value when calling self.agent.register_memory or
self.agent.deregister_memory.

tensorrt_llm/_torch/disaggregation/nixl/agent.py (1)

36-42: Optional: Consider cleaner f-string formatting for the error message.

The conditional concatenation works but could be slightly more readable.

♻️ Alternative formatting

     if _py_agent is None:
-        raise ImportError(
-            "Failed to load pure Python NIXL Transfer Agent."
-            + (f" Caused by: {_py_agent_err}" if _py_agent_err else "")
-        )
+        cause = f" Caused by: {_py_agent_err}" if _py_agent_err else ""
+        raise ImportError(f"Failed to load pure Python NIXL Transfer Agent.{cause}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/disaggregation/nixl/agent.py` around lines 36 - 42, The
ImportError construction using string concatenation around _py_agent and
_py_agent_err is readable but can be simplified to a single f-string; update the
raise ImportError to use one f-string that conditionally includes the "Caused
by: {_py_agent_err}" segment (e.g. f"Failed to load pure Python NIXL Transfer
Agent{f' Caused by: {_py_agent_err}' if _py_agent_err else ''}"), keeping the
same variables (_py_agent, _py_agent_err) and preserving the current behavior
and spacing, then leave the subsequent references to NixlTransferStatus and
NixlTransferAgent unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py`:
- Around line 70-78: The validation currently only checks descs.descs[0], which
allows other malformed descriptor tuples to slip through; update the validation
to iterate over all entries in descs.descs and ensure each is a tuple of length
4 (raise ValueError including the offending index and value if any fail), before
calling self.agent.get_reg_descs(descs.descs, descs.type) so get_reg_descs
always receives uniformly validated descriptors.
- Around line 81-93: Extract the duplicated validation and conversion logic used
in register_memory and deregister_memory into a private helper (e.g.,
_get_validated_reg_descs) that accepts a RegMemoryDescs, validates descs.descs
is non-empty, iterates through each desc and ensures tuple entries have length 4
(including index in error messages), calls self.agent.get_reg_descs(descs.descs,
descs.type) and raises the same RuntimeError if None, and returns reg_descs;
then replace the validation block in both register_memory and deregister_memory
to call this helper and use its return value when calling
self.agent.register_memory or self.agent.deregister_memory.

In `@tensorrt_llm/_torch/disaggregation/nixl/agent.py`:
- Around line 36-42: The ImportError construction using string concatenation
around _py_agent and _py_agent_err is readable but can be simplified to a single
f-string; update the raise ImportError to use one f-string that conditionally
includes the "Caused by: {_py_agent_err}" segment (e.g. f"Failed to load pure
Python NIXL Transfer Agent{f' Caused by: {_py_agent_err}' if _py_agent_err else
''}"), keeping the same variables (_py_agent, _py_agent_err) and preserving the
current behavior and spacing, then leave the subsequent references to
NixlTransferStatus and NixlTransferAgent unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a67221e4-cefe-492a-9000-adaaaf88b4c5

📥 Commits

Reviewing files that changed from the base of the PR and between 7aa1383 and 127c4ae.

📒 Files selected for processing (4)

requirements-dev.txt
tensorrt_llm/_torch/disaggregation/nixl/_agent_py.py
tensorrt_llm/_torch/disaggregation/nixl/agent.py
tests/unittest/disaggregated/test_agent_multi_backends.py

tensorrt-cicd · Mar 23, 2026

PR_Github #39904 [ run ] completed with state SUCCESS. Commit: 127c4ae
/LLM/main/L0_MergeRequest_PR pipeline #31073 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · Mar 24, 2026

/bot run

tensorrt-cicd · Mar 24, 2026

PR_Github #40018 [ run ] triggered by Bot. Commit: e437dc6 Link to invocation

Shixiaowei02 · Mar 24, 2026

/bot run

tensorrt-cicd · Mar 24, 2026

PR_Github #40031 [ run ] triggered by Bot. Commit: 6ad68b9 Link to invocation

tensorrt-cicd · Mar 24, 2026

PR_Github #40031 [ run ] completed with state SUCCESS. Commit: 6ad68b9
/LLM/main/L0_MergeRequest_PR pipeline #31186 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · Mar 24, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 24, 2026

PR_Github #40068 [ run ] triggered by Bot. Commit: e3971e9 Link to invocation

Shixiaowei02 · Mar 24, 2026

/bot run --disable-fail-fast

Shixiaowei02 · Mar 30, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 30, 2026

PR_Github #40714 [ run ] triggered by Bot. Commit: 42ba387 Link to invocation

tensorrt-cicd · Mar 30, 2026

PR_Github #40714 [ run ] completed with state SUCCESS. Commit: 42ba387
/LLM/main/L0_MergeRequest_PR pipeline #31741 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · Mar 31, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Mar 31, 2026

PR_Github #40837 [ run ] triggered by Bot. Commit: 652c65a Link to invocation

tensorrt-cicd · Mar 31, 2026

PR_Github #40837 [ run ] completed with state SUCCESS. Commit: 652c65a
/LLM/main/L0_MergeRequest_PR pipeline #31847 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 · Apr 1, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 1, 2026

PR_Github #41168 [ run ] triggered by Bot. Commit: 7f92c96 Link to invocation

tensorrt-cicd · Apr 1, 2026

PR_Github #41168 [ run ] completed with state SUCCESS. Commit: 7f92c96
/LLM/main/L0_MergeRequest_PR pipeline #32134 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · Apr 2, 2026

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1, DGX_H100-2_GPUs-PyTorch-Others-1, DGX_H100-2_GPUs-PyTorch-GptOss-1" --disable-fail-fast

tensorrt-cicd · Apr 2, 2026

PR_Github #41360 [ run ] triggered by Bot. Commit: 7f92c96 Link to invocation

tensorrt-cicd · Apr 2, 2026

PR_Github #41360 [ run ] completed with state SUCCESS. Commit: 7f92c96
/LLM/main/L0_MergeRequest_PR pipeline #32304 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Shixiaowei02 · Apr 2, 2026

/bot run --stage-list "DGX_H100-4_GPUs-PyTorch-Others-1" --disable-fail-fast

tensorrt-cicd · Apr 2, 2026

PR_Github #41428 [ run ] triggered by Bot. Commit: 7f92c96 Link to invocation

tensorrt-cicd · Apr 2, 2026

PR_Github #41428 [ run ] completed with state SUCCESS. Commit: 7f92c96
/LLM/main/L0_MergeRequest_PR pipeline #32359 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

Shixiaowei02 · Apr 2, 2026

/bot skip --comment "CI passed in the assembled form."

tensorrt-cicd · Apr 2, 2026

PR_Github #41438 [ skip ] triggered by Bot. Commit: 7f92c96 Link to invocation

tensorrt-cicd · Apr 2, 2026

PR_Github #41438 [ skip ] completed with state SUCCESS. Commit: 7f92c96
Skipping testing for commit 7f92c96

Link to invocation

…tics (NVIDIA#12446)

…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>

github-actions Bot assigned Shixiaowei02 Mar 23, 2026

Shixiaowei02 force-pushed the fix/nixl_import_err branch from 3333fe1 to 127c4ae Compare March 23, 2026 08:32

Shixiaowei02 requested a review from chuangz0 March 23, 2026 08:33

Shixiaowei02 requested a review from bo-nv March 23, 2026 08:33

Shixiaowei02 marked this pull request as ready for review March 23, 2026 12:43

Shixiaowei02 requested review from a team as code owners March 23, 2026 12:43

Shixiaowei02 requested review from byshiue, pcastonguay and shaharmor98 March 23, 2026 12:43

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

pcastonguay reviewed Mar 23, 2026

View reviewed changes

Shixiaowei02 force-pushed the fix/nixl_import_err branch 2 times, most recently from 242f3e5 to e437dc6 Compare March 24, 2026 02:25

Shixiaowei02 force-pushed the fix/nixl_import_err branch 2 times, most recently from 8123890 to 6ad68b9 Compare March 24, 2026 02:58

Shixiaowei02 force-pushed the fix/nixl_import_err branch from 6ad68b9 to e3971e9 Compare March 24, 2026 06:08

Shixiaowei02 force-pushed the fix/nixl_import_err branch from e3971e9 to 5088a76 Compare March 24, 2026 10:53

Shixiaowei02 force-pushed the fix/nixl_import_err branch from 42ba387 to 652c65a Compare March 31, 2026 03:15

Shixiaowei02 added 2 commits April 1, 2026 17:08

fix: improve NIXL agent import error diagnostics

3ab65cb

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

fix according to the comments

7f92c96

Signed-off-by: Shixiaowei02 <39303645+Shixiaowei02@users.noreply.github.com>

Shixiaowei02 force-pushed the fix/nixl_import_err branch from 652c65a to 7f92c96 Compare April 1, 2026 09:08

Shixiaowei02 enabled auto-merge (squash) April 2, 2026 13:49

atrifex approved these changes Apr 2, 2026

View reviewed changes

Shixiaowei02 merged commit 4c97a03 into NVIDIA:main Apr 2, 2026
5 checks passed

Shixiaowei02 deleted the fix/nixl_import_err branch April 3, 2026 02:08

karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026

[https://nvbugs/5979673][fix] improve NIXL agent import error diagnos…

06b3c23

…tics (NVIDIA#12446)

yuanjingx87 pushed a commit to yuanjingx87/TensorRT-LLM that referenced this pull request Apr 8, 2026

[https://nvbugs/5979673][fix] improve NIXL agent import error diagnos…

2f5a980

…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>

yuanjingx87 pushed a commit to yuanjingx87/TensorRT-LLM that referenced this pull request Apr 10, 2026

[https://nvbugs/5979673][fix] improve NIXL agent import error diagnos…

346272b

…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>

yuanjingx87 pushed a commit to yuanjingx87/TensorRT-LLM that referenced this pull request Apr 10, 2026

[https://nvbugs/5979673][fix] improve NIXL agent import error diagnos…

ac9ad67

…tics (NVIDIA#12446) Signed-off-by: Yuanjing Xue <197832395+yuanjingx87@users.noreply.github.com>

Search code, repositories, users, issues, pull requests...

Conversation

Shixiaowei02 commented Mar 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

Shixiaowei02 commented Mar 23, 2026

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

coderabbitai Bot commented Mar 23, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Shixiaowei02 commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

Shixiaowei02 commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

Shixiaowei02 commented Mar 24, 2026

Uh oh!

tensorrt-cicd commented Mar 24, 2026

Uh oh!

Shixiaowei02 commented Mar 24, 2026

Uh oh!

Shixiaowei02 commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

tensorrt-cicd commented Mar 30, 2026

Uh oh!

Shixiaowei02 commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

tensorrt-cicd commented Mar 31, 2026

Uh oh!

Shixiaowei02 commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

tensorrt-cicd commented Apr 1, 2026

Uh oh!

Shixiaowei02 commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Shixiaowei02 commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Shixiaowei02 commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

Shixiaowei02 commented Mar 23, 2026 •

edited by coderabbitai Bot

Loading