[TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to…#12778
[TRTLLMINF-43][feat] Update SLURM job submission logic to retry up to…#12778dpitman-nvda merged 9 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom dpitman-nvda:feat/restart-on-node-crashesdpitman-nvda/TensorRT-LLM:feat/restart-on-node-crashesCopy head branch name to clipboard
Conversation
… 3 times if we detect a failure that appears to be related to a test machine/cluster issue. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
📝 WalkthroughWalkthroughModified Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
jenkins/L0_Test.groovy (2)
1453-1457: Consider adding cleanup state awareness for retry scenarios.The retry mechanism delegates cleanup to the inner functions (
runLLMTestlistWithSbatchandrunLLMTestlistWithAgent), which have their ownfinallyblocks. This is a good design.One consideration: if the cleanup in the inner function fails or is incomplete (e.g., network issue during cleanup), the retry might encounter stale resources. The 60-second cooldown helps mitigate this, but you may want to add defensive logging or verification that resources from the previous attempt are cleaned up before starting a new attempt.
For the sbatch path, the
scriptSubmit(lines 1279-1304) already handles this with:if [ -f "${jobWorkspace}/slurm_job_id.txt" ]; then previous_job_id=$(cat "${jobWorkspace}/slurm_job_id.txt") scancel "${previous_job_id}" || trueThis is a good defensive pattern already in place.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/L0_Test.groovy` around lines 1453 - 1457, The retry block calling runLLMTestlistWithSbatch or runLLMTestlistWithAgent can start a new attempt while stale resources remain if inner cleanup failed; before re-invoking those functions add defensive verification and logging to ensure previous attempt artifacts are cleared (for sbatch mirror the existing scriptSubmit check: look for "${jobWorkspace}/slurm_job_id.txt", read and scancel the previous job id, and log failures but continue), and for agent path implement equivalent checks (e.g., detect leftover agent processes, temp dirs or lock files, attempt safe teardown, log outcomes) so each retry confirms cleanup or records warnings before proceeding.
102-134: Pattern list looks comprehensive; consider potential false positives.The infrastructure failure patterns are well-documented. One observation:
The pattern
"is no longer active"(line 116) is generic and could potentially match test output that contains this phrase. Consider using a more specific pattern like"SLURM job is no longer active"or"Slurm job.*is no longer active"if regex matching is supported.
"Evicted"(line 121) is also broad and might match test logs mentioning Kubernetes eviction in contexts other than the actual pod running the test.The current implementation uses case-insensitive substring matching (via
.toLowerCase().contains()), so these patterns will match anywhere in the exception chain.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@jenkins/L0_Test.groovy` around lines 102 - 134, The generic substrings "is no longer active" and "Evicted" in SLURM_INFRA_FAILURE_PATTERNS (and duplicate "Permission denied, please try again" in SLURM_INFRA_SINGLE_RETRY_PATTERNS) can produce false positives; update SLURM_INFRA_FAILURE_PATTERNS and SLURM_INFRA_SINGLE_RETRY_PATTERNS to use more specific patterns (e.g., "Slurm job.*is no longer active" or "SLURM job is no longer active", and "Pod .*Evicted" or "Evicted:") or convert the matching logic to use case-insensitive regex matching so you can anchor or require surrounding context, and remove the duplicate "Permission denied, please try again" entry from SLURM_INFRA_SINGLE_RETRY_PATTERNS; ensure the matching code (where .contains() is used) is updated to apply regex matching when evaluating these arrays.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1453-1457: The retry block calling runLLMTestlistWithSbatch or
runLLMTestlistWithAgent can start a new attempt while stale resources remain if
inner cleanup failed; before re-invoking those functions add defensive
verification and logging to ensure previous attempt artifacts are cleared (for
sbatch mirror the existing scriptSubmit check: look for
"${jobWorkspace}/slurm_job_id.txt", read and scancel the previous job id, and
log failures but continue), and for agent path implement equivalent checks
(e.g., detect leftover agent processes, temp dirs or lock files, attempt safe
teardown, log outcomes) so each retry confirms cleanup or records warnings
before proceeding.
- Around line 102-134: The generic substrings "is no longer active" and
"Evicted" in SLURM_INFRA_FAILURE_PATTERNS (and duplicate "Permission denied,
please try again" in SLURM_INFRA_SINGLE_RETRY_PATTERNS) can produce false
positives; update SLURM_INFRA_FAILURE_PATTERNS and
SLURM_INFRA_SINGLE_RETRY_PATTERNS to use more specific patterns (e.g., "Slurm
job.*is no longer active" or "SLURM job is no longer active", and "Pod
.*Evicted" or "Evicted:") or convert the matching logic to use case-insensitive
regex matching so you can anchor or require surrounding context, and remove the
duplicate "Permission denied, please try again" entry from
SLURM_INFRA_SINGLE_RETRY_PATTERNS; ensure the matching code (where .contains()
is used) is updated to apply regex matching when evaluating these arrays.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 249d5970-ea39-4b9e-84f0-330b0a953269
📒 Files selected for processing (1)
jenkins/L0_Test.groovy
Signed-off-by: Derek Pitman <dpitman@nvidia.com>
|
/bot run |
|
/bot run |
|
PR_Github #43780 [ run ] triggered by Bot. Commit: |
yuanjingx87
left a comment
There was a problem hiding this comment.
The change LGTM, but the logic looks very similar to llmExecStepWithRetry defined here: https://gitlab-master.nvidia.com/ftp/infra/trtllm-jenkins-shared-lib/-/blob/main/vars/trtllm_utils.groovy?ref_type=heads#L164
. As a follow-up, it might be worth consolidating the two to reduce duplication.
|
PR_Github #43780 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44037 [ run ] triggered by Bot. Commit: |
|
PR_Github #44037 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44462 [ run ] triggered by Bot. Commit: |
|
So the latest CI run actually did encounter failures it believed were infra-related and auto-retried them. One small bug: the upload results stage fails on the re-run because "it has already been uploaded". I will work on a fix. |
|
PR_Github #44462 [ run ] completed with state
|
…rious stage failures Signed-off-by: Derek Pitman <dpitman@nvidia.com>
|
/bot run |
|
PR_Github #44515 [ run ] triggered by Bot. Commit: |
|
PR_Github #44515 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44737 [ run ] triggered by Bot. Commit: |
|
PR_Github #44737 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44747 [ run ] triggered by Bot. Commit: |
|
PR_Github #44747 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44759 [ run ] triggered by Bot. Commit: |
|
PR_Github #44759 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44774 [ run ] triggered by Bot. Commit: |
|
PR_Github #44774 [ run ] completed with state |
|
/bot run |
|
PR_Github #44790 [ run ] triggered by Bot. Commit: |
|
PR_Github #44790 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44801 [ run ] triggered by Bot. Commit: |
|
PR_Github #44801 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44968 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #44999 [ run ] triggered by Bot. Commit: |
|
PR_Github #44999 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45196 [ run ] triggered by Bot. Commit: |
|
PR_Github #45196 [ run ] completed with state |
… 3 times if we detect a failure that appears to be related to a test machine/cluster issue.
Summary by CodeRabbit
Description
Our test cases which run on more exotic hardware setups can fail for reasons outside of our control (eg - job preemption), but Jenkins doesn't recognize this and fails the entire test suite.
This PR adds a set of messages believed to be associated with node failures/preemptions and sets up retry logic in the event we hit one of these failures.
Test Coverage
N/A, this is a CI change
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.