Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to…#12778

Merged
dpitman-nvda merged 9 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
dpitman-nvda:feat/restart-on-node-crashesdpitman-nvda/TensorRT-LLM:feat/restart-on-node-crashesCopy head branch name to clipboard
Apr 24, 2026
Merged

[TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to…#12778
dpitman-nvda merged 9 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom
dpitman-nvda:feat/restart-on-node-crashesdpitman-nvda/TensorRT-LLM:feat/restart-on-node-crashesCopy head branch name to clipboard

Conversation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator

@dpitman-nvda dpitman-nvda commented Apr 6, 2026

… 3 times if we detect a failure that appears to be related to a test machine/cluster issue.

Summary by CodeRabbit

  • Chores
    • Enhanced test execution reliability in the CI/CD pipeline by implementing automatic retry logic for infrastructure-related failures. Transient infrastructure issues now trigger automatic retries with configurable limits and 60-second delays between attempts, reducing false test failures.

Description

Our test cases which run on more exotic hardware setups can fail for reasons outside of our control (eg - job preemption), but Jenkins doesn't recognize this and fails the entire test suite.

This PR adds a set of messages believed to be associated with node failures/preemptions and sets up retry logic in the event we hit one of these failures.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

… 3 times if we detect a failure that appears to be related to a test machine/cluster issue.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

📝 Walkthrough

Walkthrough

Modified runLLMTestlistOnSlurm function to implement retry infrastructure for Slurm job execution. Introduced classifyInfraFailure function and three new constants to enable pattern-based failure classification, attempt tracking, and configurable retry budgets with 60-second backoff intervals.

Changes

Cohort / File(s) Summary
Retry Infrastructure for Slurm Jobs
jenkins/L0_Test.groovy
Added classifyInfraFailure() function and three constants (SLURM_INFRA_FAILURE_PATTERNS, SLURM_INFRA_SINGLE_RETRY_PATTERNS, SLURM_INFRA_RETRY_MAX) to enable case-insensitive pattern-based infrastructure failure detection. Replaced single-pass control flow with a while(true) retry loop that manages attempt counting, enforces per-failure-type retry budgets, prevents retries on InterruptedException and abort markers, and implements 60-second backoff between attempts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is partially related to the changeset, referring to the SLURM retry logic enhancement but is incomplete (truncated with 'up to…'). Complete the title to fully convey the main change, e.g., '[TRTLLMINF-43][feat] Update SLURM job submission logic to retry on infrastructure failures'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description covers the main changes and rationale, with clear explanation of the problem and solution, though test coverage is marked as N/A for a CI change.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)

1453-1457: Consider adding cleanup state awareness for retry scenarios.

The retry mechanism delegates cleanup to the inner functions (runLLMTestlistWithSbatch and runLLMTestlistWithAgent), which have their own finally blocks. This is a good design.

One consideration: if the cleanup in the inner function fails or is incomplete (e.g., network issue during cleanup), the retry might encounter stale resources. The 60-second cooldown helps mitigate this, but you may want to add defensive logging or verification that resources from the previous attempt are cleaned up before starting a new attempt.

For the sbatch path, the scriptSubmit (lines 1279-1304) already handles this with:

if [ -f "${jobWorkspace}/slurm_job_id.txt" ]; then
    previous_job_id=$(cat "${jobWorkspace}/slurm_job_id.txt")
    scancel "${previous_job_id}" || true

This is a good defensive pattern already in place.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 1453 - 1457, The retry block calling
runLLMTestlistWithSbatch or runLLMTestlistWithAgent can start a new attempt
while stale resources remain if inner cleanup failed; before re-invoking those
functions add defensive verification and logging to ensure previous attempt
artifacts are cleared (for sbatch mirror the existing scriptSubmit check: look
for "${jobWorkspace}/slurm_job_id.txt", read and scancel the previous job id,
and log failures but continue), and for agent path implement equivalent checks
(e.g., detect leftover agent processes, temp dirs or lock files, attempt safe
teardown, log outcomes) so each retry confirms cleanup or records warnings
before proceeding.

102-134: Pattern list looks comprehensive; consider potential false positives.

The infrastructure failure patterns are well-documented. One observation:

  • The pattern "is no longer active" (line 116) is generic and could potentially match test output that contains this phrase. Consider using a more specific pattern like "SLURM job is no longer active" or "Slurm job.*is no longer active" if regex matching is supported.

  • "Evicted" (line 121) is also broad and might match test logs mentioning Kubernetes eviction in contexts other than the actual pod running the test.

The current implementation uses case-insensitive substring matching (via .toLowerCase().contains()), so these patterns will match anywhere in the exception chain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 102 - 134, The generic substrings "is no
longer active" and "Evicted" in SLURM_INFRA_FAILURE_PATTERNS (and duplicate
"Permission denied, please try again" in SLURM_INFRA_SINGLE_RETRY_PATTERNS) can
produce false positives; update SLURM_INFRA_FAILURE_PATTERNS and
SLURM_INFRA_SINGLE_RETRY_PATTERNS to use more specific patterns (e.g., "Slurm
job.*is no longer active" or "SLURM job is no longer active", and "Pod
.*Evicted" or "Evicted:") or convert the matching logic to use case-insensitive
regex matching so you can anchor or require surrounding context, and remove the
duplicate "Permission denied, please try again" entry from
SLURM_INFRA_SINGLE_RETRY_PATTERNS; ensure the matching code (where .contains()
is used) is updated to apply regex matching when evaluating these arrays.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1453-1457: The retry block calling runLLMTestlistWithSbatch or
runLLMTestlistWithAgent can start a new attempt while stale resources remain if
inner cleanup failed; before re-invoking those functions add defensive
verification and logging to ensure previous attempt artifacts are cleared (for
sbatch mirror the existing scriptSubmit check: look for
"${jobWorkspace}/slurm_job_id.txt", read and scancel the previous job id, and
log failures but continue), and for agent path implement equivalent checks
(e.g., detect leftover agent processes, temp dirs or lock files, attempt safe
teardown, log outcomes) so each retry confirms cleanup or records warnings
before proceeding.
- Around line 102-134: The generic substrings "is no longer active" and
"Evicted" in SLURM_INFRA_FAILURE_PATTERNS (and duplicate "Permission denied,
please try again" in SLURM_INFRA_SINGLE_RETRY_PATTERNS) can produce false
positives; update SLURM_INFRA_FAILURE_PATTERNS and
SLURM_INFRA_SINGLE_RETRY_PATTERNS to use more specific patterns (e.g., "Slurm
job.*is no longer active" or "SLURM job is no longer active", and "Pod
.*Evicted" or "Evicted:") or convert the matching logic to use case-insensitive
regex matching so you can anchor or require surrounding context, and remove the
duplicate "Permission denied, please try again" entry from
SLURM_INFRA_SINGLE_RETRY_PATTERNS; ensure the matching code (where .contains()
is used) is updated to apply regex matching when evaluating these arrays.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 249d5970-ea39-4b9e-84f0-330b0a953269

📥 Commits

Reviewing files that changed from the base of the PR and between 9457816 and d1ed7d3.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43780 [ run ] triggered by Bot. Commit: 165a766 Link to invocation

Copy link
Copy Markdown
Collaborator

@yuanjingx87 yuanjingx87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change LGTM, but the logic looks very similar to llmExecStepWithRetry defined here: https://gitlab-master.nvidia.com/ftp/infra/trtllm-jenkins-shared-lib/-/blob/main/vars/trtllm_utils.groovy?ref_type=heads#L164
. As a follow-up, it might be worth consolidating the two to reduce duplication.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43780 [ run ] completed with state FAILURE. Commit: 165a766
/LLM/main/L0_MergeRequest_PR pipeline #34261 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44037 [ run ] triggered by Bot. Commit: 165a766 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44037 [ run ] completed with state SUCCESS. Commit: 165a766
/LLM/main/L0_MergeRequest_PR pipeline #34475 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44462 [ run ] triggered by Bot. Commit: 165a766 Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

So the latest CI run actually did encounter failures it believed were infra-related and auto-retried them.

One small bug: the upload results stage fails on the re-run because "it has already been uploaded".

I will work on a fix.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44462 [ run ] completed with state SUCCESS. Commit: 165a766
/LLM/main/L0_MergeRequest_PR pipeline #34867 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44515 [ run ] triggered by Bot. Commit: 6508ce2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44515 [ run ] completed with state SUCCESS. Commit: 6508ce2
/LLM/main/L0_MergeRequest_PR pipeline #34915 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44737 [ run ] triggered by Bot. Commit: 6508ce2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44737 [ run ] completed with state SUCCESS. Commit: 6508ce2
/LLM/main/L0_MergeRequest_PR pipeline #35100 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44747 [ run ] triggered by Bot. Commit: 6508ce2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44747 [ run ] completed with state SUCCESS. Commit: 6508ce2
/LLM/main/L0_MergeRequest_PR pipeline #35108 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44759 [ run ] triggered by Bot. Commit: 5d54913 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44759 [ run ] completed with state SUCCESS. Commit: 5d54913
/LLM/main/L0_MergeRequest_PR pipeline #35118 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44774 [ run ] triggered by Bot. Commit: 5d54913 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44774 [ run ] completed with state ABORTED. Commit: 5d54913

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44790 [ run ] triggered by Bot. Commit: b4aa508 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44790 [ run ] completed with state SUCCESS. Commit: b4aa508
/LLM/main/L0_MergeRequest_PR pipeline #35142 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44801 [ run ] triggered by Bot. Commit: b4aa508 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44801 [ run ] completed with state SUCCESS. Commit: b4aa508
/LLM/main/L0_MergeRequest_PR pipeline #35151 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44968 [ run ] triggered by Bot. Commit: b4aa508 Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44999 [ run ] triggered by Bot. Commit: 12a0cdd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #44999 [ run ] completed with state SUCCESS. Commit: 12a0cdd
/LLM/main/L0_MergeRequest_PR pipeline #35318 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45196 [ run ] triggered by Bot. Commit: 12a0cdd Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45196 [ run ] completed with state SUCCESS. Commit: 12a0cdd
/LLM/main/L0_MergeRequest_PR pipeline #35464 completed with status: 'SUCCESS'

CI Report

Link to invocation

@dpitman-nvda dpitman-nvda merged commit 5efc816 into NVIDIA:main Apr 24, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.