[TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to… by dpitman-nvda · Pull Request #12778 · NVIDIA/TensorRT-LLM

dpitman-nvda · Apr 6, 2026

… 3 times if we detect a failure that appears to be related to a test machine/cluster issue.

Summary by CodeRabbit

Chores
- Enhanced test execution reliability in the CI/CD pipeline by implementing automatic retry logic for infrastructure-related failures. Transient infrastructure issues now trigger automatic retries with configurable limits and 60-second delays between attempts, reducing false test failures.

Description

Our test cases which run on more exotic hardware setups can fail for reasons outside of our control (eg - job preemption), but Jenkins doesn't recognize this and fails the entire test suite.

This PR adds a set of messages believed to be associated with node failures/preemptions and sets up retry logic in the event we hit one of these failures.

Test Coverage

N/A, this is a CI change

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

… 3 times if we detect a failure that appears to be related to a test machine/cluster issue. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

coderabbitai · Apr 6, 2026

📝 Walkthrough

Walkthrough

Modified runLLMTestlistOnSlurm function to implement retry infrastructure for Slurm job execution. Introduced classifyInfraFailure function and three new constants to enable pattern-based failure classification, attempt tracking, and configurable retry budgets with 60-second backoff intervals.

Changes

Cohort / File(s) Summary

Retry Infrastructure for Slurm Jobs
jenkins/L0_Test.groovy Added classifyInfraFailure() function and three constants (SLURM_INFRA_FAILURE_PATTERNS, SLURM_INFRA_SINGLE_RETRY_PATTERNS, SLURM_INFRA_RETRY_MAX) to enable case-insensitive pattern-based infrastructure failure detection. Replaced single-pass control flow with a while(true) retry loop that manages attempt counting, enforces per-failure-type retry budgets, prevents retries on InterruptedException and abort markers, and implements 60-second backoff between attempts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title is partially related to the changeset, referring to the SLURM retry logic enhancement but is incomplete (truncated with 'up to…').	Complete the title to fully convey the main change, e.g., '[TRTLLMINF-43][feat] Update SLURM job submission logic to retry on infrastructure failures'.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The description covers the main changes and rationale, with clear explanation of the problem and solution, though test coverage is marked as N/A for a CI change.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

jenkins/L0_Test.groovy (2)
1453-1457: Consider adding cleanup state awareness for retry scenarios.

The retry mechanism delegates cleanup to the inner functions (runLLMTestlistWithSbatch and runLLMTestlistWithAgent), which have their own finally blocks. This is a good design.

One consideration: if the cleanup in the inner function fails or is incomplete (e.g., network issue during cleanup), the retry might encounter stale resources. The 60-second cooldown helps mitigate this, but you may want to add defensive logging or verification that resources from the previous attempt are cleaned up before starting a new attempt.

For the sbatch path, the scriptSubmit (lines 1279-1304) already handles this with:
if [ -f "${jobWorkspace}/slurm_job_id.txt" ]; then
    previous_job_id=$(cat "${jobWorkspace}/slurm_job_id.txt")
    scancel "${previous_job_id}" || true
This is a good defensive pattern already in place.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 1453 - 1457, The retry block calling
runLLMTestlistWithSbatch or runLLMTestlistWithAgent can start a new attempt
while stale resources remain if inner cleanup failed; before re-invoking those
functions add defensive verification and logging to ensure previous attempt
artifacts are cleared (for sbatch mirror the existing scriptSubmit check: look
for "${jobWorkspace}/slurm_job_id.txt", read and scancel the previous job id,
and log failures but continue), and for agent path implement equivalent checks
(e.g., detect leftover agent processes, temp dirs or lock files, attempt safe
teardown, log outcomes) so each retry confirms cleanup or records warnings
before proceeding.
102-134: Pattern list looks comprehensive; consider potential false positives.

The infrastructure failure patterns are well-documented. One observation:

The pattern "is no longer active" (line 116) is generic and could potentially match test output that contains this phrase. Consider using a more specific pattern like "SLURM job is no longer active" or "Slurm job.*is no longer active" if regex matching is supported.

"Evicted" (line 121) is also broad and might match test logs mentioning Kubernetes eviction in contexts other than the actual pod running the test.

The current implementation uses case-insensitive substring matching (via .toLowerCase().contains()), so these patterns will match anywhere in the exception chain.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/L0_Test.groovy` around lines 102 - 134, The generic substrings "is no
longer active" and "Evicted" in SLURM_INFRA_FAILURE_PATTERNS (and duplicate
"Permission denied, please try again" in SLURM_INFRA_SINGLE_RETRY_PATTERNS) can
produce false positives; update SLURM_INFRA_FAILURE_PATTERNS and
SLURM_INFRA_SINGLE_RETRY_PATTERNS to use more specific patterns (e.g., "Slurm
job.*is no longer active" or "SLURM job is no longer active", and "Pod
.*Evicted" or "Evicted:") or convert the matching logic to use case-insensitive
regex matching so you can anchor or require surrounding context, and remove the
duplicate "Permission denied, please try again" entry from
SLURM_INFRA_SINGLE_RETRY_PATTERNS; ensure the matching code (where .contains()
is used) is updated to apply regex matching when evaluating these arrays.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1453-1457: The retry block calling runLLMTestlistWithSbatch or
runLLMTestlistWithAgent can start a new attempt while stale resources remain if
inner cleanup failed; before re-invoking those functions add defensive
verification and logging to ensure previous attempt artifacts are cleared (for
sbatch mirror the existing scriptSubmit check: look for
"${jobWorkspace}/slurm_job_id.txt", read and scancel the previous job id, and
log failures but continue), and for agent path implement equivalent checks
(e.g., detect leftover agent processes, temp dirs or lock files, attempt safe
teardown, log outcomes) so each retry confirms cleanup or records warnings
before proceeding.
- Around line 102-134: The generic substrings "is no longer active" and
"Evicted" in SLURM_INFRA_FAILURE_PATTERNS (and duplicate "Permission denied,
please try again" in SLURM_INFRA_SINGLE_RETRY_PATTERNS) can produce false
positives; update SLURM_INFRA_FAILURE_PATTERNS and
SLURM_INFRA_SINGLE_RETRY_PATTERNS to use more specific patterns (e.g., "Slurm
job.*is no longer active" or "SLURM job is no longer active", and "Pod
.*Evicted" or "Evicted:") or convert the matching logic to use case-insensitive
regex matching so you can anchor or require surrounding context, and remove the
duplicate "Permission denied, please try again" entry from
SLURM_INFRA_SINGLE_RETRY_PATTERNS; ensure the matching code (where .contains()
is used) is updated to apply regex matching when evaluating these arrays.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 249d5970-ea39-4b9e-84f0-330b0a953269

📥 Commits

Reviewing files that changed from the base of the PR and between 9457816 and d1ed7d3.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy

Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda · Apr 15, 2026

/bot run

dpitman-nvda · Apr 16, 2026

/bot run

tensorrt-cicd · Apr 16, 2026

PR_Github #43780 [ run ] triggered by Bot. Commit: 165a766 Link to invocation

yuanjingx87

The change LGTM, but the logic looks very similar to llmExecStepWithRetry defined here: https://gitlab-master.nvidia.com/ftp/infra/trtllm-jenkins-shared-lib/-/blob/main/vars/trtllm_utils.groovy?ref_type=heads#L164
. As a follow-up, it might be worth consolidating the two to reduce duplication.

tensorrt-cicd · Apr 17, 2026

PR_Github #43780 [ run ] completed with state FAILURE. Commit: 165a766
/LLM/main/L0_MergeRequest_PR pipeline #34261 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 17, 2026

/bot run

tensorrt-cicd · Apr 17, 2026

PR_Github #44037 [ run ] triggered by Bot. Commit: 165a766 Link to invocation

tensorrt-cicd · Apr 18, 2026

PR_Github #44037 [ run ] completed with state SUCCESS. Commit: 165a766
/LLM/main/L0_MergeRequest_PR pipeline #34475 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 20, 2026

/bot run

tensorrt-cicd · Apr 20, 2026

PR_Github #44462 [ run ] triggered by Bot. Commit: 165a766 Link to invocation

dpitman-nvda · Apr 20, 2026

So the latest CI run actually did encounter failures it believed were infra-related and auto-retried them.

One small bug: the upload results stage fails on the re-run because "it has already been uploaded".

I will work on a fix.

tensorrt-cicd · Apr 20, 2026

PR_Github #44462 [ run ] completed with state SUCCESS. Commit: 165a766
/LLM/main/L0_MergeRequest_PR pipeline #34867 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

…rious stage failures Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda · Apr 20, 2026

/bot run

tensorrt-cicd · Apr 20, 2026

PR_Github #44515 [ run ] triggered by Bot. Commit: 6508ce2 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44515 [ run ] completed with state SUCCESS. Commit: 6508ce2
/LLM/main/L0_MergeRequest_PR pipeline #34915 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 21, 2026

/bot run

tensorrt-cicd · Apr 21, 2026

PR_Github #44737 [ run ] triggered by Bot. Commit: 6508ce2 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44737 [ run ] completed with state SUCCESS. Commit: 6508ce2
/LLM/main/L0_MergeRequest_PR pipeline #35100 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 21, 2026

/bot run

tensorrt-cicd · Apr 21, 2026

PR_Github #44747 [ run ] triggered by Bot. Commit: 6508ce2 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44747 [ run ] completed with state SUCCESS. Commit: 6508ce2
/LLM/main/L0_MergeRequest_PR pipeline #35108 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 21, 2026

/bot run

tensorrt-cicd · Apr 21, 2026

PR_Github #44759 [ run ] triggered by Bot. Commit: 5d54913 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44759 [ run ] completed with state SUCCESS. Commit: 5d54913
/LLM/main/L0_MergeRequest_PR pipeline #35118 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 21, 2026

/bot run

tensorrt-cicd · Apr 21, 2026

PR_Github #44774 [ run ] triggered by Bot. Commit: 5d54913 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44774 [ run ] completed with state ABORTED. Commit: 5d54913

Link to invocation

dpitman-nvda · Apr 21, 2026

/bot run

tensorrt-cicd · Apr 21, 2026

PR_Github #44790 [ run ] triggered by Bot. Commit: b4aa508 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44790 [ run ] completed with state SUCCESS. Commit: b4aa508
/LLM/main/L0_MergeRequest_PR pipeline #35142 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 21, 2026

/bot run

tensorrt-cicd · Apr 21, 2026

PR_Github #44801 [ run ] triggered by Bot. Commit: b4aa508 Link to invocation

tensorrt-cicd · Apr 21, 2026

PR_Github #44801 [ run ] completed with state SUCCESS. Commit: b4aa508
/LLM/main/L0_MergeRequest_PR pipeline #35151 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 22, 2026

/bot run

tensorrt-cicd · Apr 22, 2026

PR_Github #44968 [ run ] triggered by Bot. Commit: b4aa508 Link to invocation

dpitman-nvda · Apr 22, 2026

/bot run

tensorrt-cicd · Apr 22, 2026

PR_Github #44999 [ run ] triggered by Bot. Commit: 12a0cdd Link to invocation

tensorrt-cicd · Apr 23, 2026

PR_Github #44999 [ run ] completed with state SUCCESS. Commit: 12a0cdd
/LLM/main/L0_MergeRequest_PR pipeline #35318 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

dpitman-nvda · Apr 23, 2026

/bot run

tensorrt-cicd · Apr 23, 2026

PR_Github #45196 [ run ] triggered by Bot. Commit: 12a0cdd Link to invocation

tensorrt-cicd · Apr 24, 2026

PR_Github #45196 [ run ] completed with state SUCCESS. Commit: 12a0cdd
/LLM/main/L0_MergeRequest_PR pipeline #35464 completed with status: 'SUCCESS'

CI Report

Link to invocation

[TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to…

d1ed7d3

… 3 times if we detect a failure that appears to be related to a test machine/cluster issue. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda requested review from a team as code owners April 6, 2026 19:51

dpitman-nvda requested review from ZhanruiSunCh and mlefeb01 April 6, 2026 19:51

github-actions Bot assigned dpitman-nvda Apr 6, 2026

coderabbitai Bot reviewed Apr 6, 2026

View reviewed changes

dpitman-nvda added 2 commits April 6, 2026 16:16

Clean up infra failure messages to avoid false positives.

519df49

Signed-off-by: Derek Pitman <dpitman@nvidia.com>

Merge branch 'main' into feat/restart-on-node-crashes

42153fc

Merge branch 'main' into feat/restart-on-node-crashes

165a766

yuanjingx87 approved these changes Apr 16, 2026

View reviewed changes

dpitman-nvda added 2 commits April 20, 2026 16:57

Fix Artifactory uploads for retries using the same name, creating spu…

72ac080

…rious stage failures Signed-off-by: Derek Pitman <dpitman@nvidia.com>

Merge branch 'main' into feat/restart-on-node-crashes

6508ce2

Merge branch 'main' into feat/restart-on-node-crashes

5d54913

Merge branch 'main' into feat/restart-on-node-crashes

b4aa508

Merge branch 'main' into feat/restart-on-node-crashes

12a0cdd

dpitman-nvda merged commit 5efc816 into NVIDIA:main Apr 24, 2026
5 checks passed

Search code, repositories, users, issues, pull requests...

Conversation

dpitman-nvda commented Apr 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Apr 6, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

dpitman-nvda commented Apr 15, 2026

Uh oh!

dpitman-nvda commented Apr 16, 2026

Uh oh!

tensorrt-cicd commented Apr 16, 2026

Uh oh!

yuanjingx87 left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

dpitman-nvda commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 17, 2026

Uh oh!

tensorrt-cicd commented Apr 18, 2026

Uh oh!

dpitman-nvda commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

dpitman-nvda commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

dpitman-nvda commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 20, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

dpitman-nvda commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

dpitman-nvda commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

dpitman-nvda commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

dpitman-nvda commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

dpitman-nvda commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

tensorrt-cicd commented Apr 21, 2026

Uh oh!

dpitman-nvda commented Apr 6, 2026 •

edited by coderabbitai Bot

Loading