[None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash#12868
[None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash#12868Tabrizian merged 5 commits intoNVIDIA:mainNVIDIA/TensorRT-LLM:mainfrom yifjiang:fix/perf-metrics-cuda-event-crashyifjiang/TensorRT-LLM:fix/perf-metrics-cuda-event-crashCopy head branch name to clipboard
Conversation
📝 WalkthroughWalkthroughUpdated the Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py (1)
175-180: Optional: add low-noise logging for timing failures.Consider emitting a throttled debug/warn log in this
exceptpath so persistent CUDA event-timing failures can be detected and diagnosed without impacting runtime behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py` around lines 175 - 180, In the except RuntimeError block (where batch_gpu_forward_time and batch_gpu_sample_time are set to 0.0) add a low-noise, throttled log message (debug or warn) that records the RuntimeError occurrence; implement throttling via a simple rate-limit (e.g., last_log_time timestamp or an occurrence counter with a reset interval) so repeated CUDA event-timing failures don’t spam logs, and use the module/class logger (e.g., perf_metrics_manager logger) and include minimal context like "CUDA event timing failed for batch" plus the exception message when actually logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py`:
- Around line 175-180: In the except RuntimeError block (where
batch_gpu_forward_time and batch_gpu_sample_time are set to 0.0) add a
low-noise, throttled log message (debug or warn) that records the RuntimeError
occurrence; implement throttling via a simple rate-limit (e.g., last_log_time
timestamp or an occurrence counter with a reset interval) so repeated CUDA
event-timing failures don’t spam logs, and use the module/class logger (e.g.,
perf_metrics_manager logger) and include minimal context like "CUDA event timing
failed for batch" plus the exception message when actually logging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8f32e4df-ac83-4bcf-b46a-7b7d57518c0b
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py
|
@yifjiang I think if we see the error here doesn't mean that it is a metric error. It can be other issues but we only see it here since we have synchronized the event. Given that there was a PR to fix the original issue (assertion for not finding the block), is it possible to try that and see if you are still running into the issue you were seeing? Please let me know if I'm missing any details, thanks. #12667 |
|
@Tabrizian Thanks for the pointer — will test with PR #12667 to see if it addresses the root cause. That said, regardless of what triggers the |
7d065e9 to
f6fb1d3
Compare
|
The issue you described in "Broader fix: executor resilience" — where the executor thread dies silently and the server becomes a zombie (HTTP alive, health checks passing, but all inference requests hang forever) — looks like a real bug worth tracking. The root cause is that _event_loop_wrapper runs on a daemon thread, and when it dies, no one notifies the blocked await_responses() callers or updates the health status. Would you consider filing an NVBug for this so it can be addressed independently? Thanks! |
… executor crash Wrap the elapsed_time() calls in compute_batch_gpu_times() with try/except RuntimeError. If a CUDA event was not recorded on the current stream, elapsed_time() raises RuntimeError, which propagates up through the executor event loop and kills the executor thread. The main process and Dynamo runtime continue running (serving HTTP, responding to health probes), but with no executor thread, every inference request hangs forever. With this fix, a CUDA event timing failure logs 0.0 for that batch metrics instead of crashing the executor. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
f6fb1d3 to
9d557b9
Compare
Log the RuntimeError details so the failure is visible in logs rather than silently zeroing the metrics. This helps operators diagnose whether the timing failure indicates a deeper issue with the forward pass or stream synchronization. Uses tensorrt_llm.logger (codebase convention) instead of stdlib logging.getLogger. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
9d557b9 to
9bd2416
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #43122 [ run ] triggered by Bot. Commit: |
|
Filed as NVBug 6074898 — covers all three defects (no error responses to in-flight requests, |
|
PR_Github #43122 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43247 [ run ] triggered by Bot. Commit: |
|
PR_Github #43247 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #43310 [ run ] triggered by Bot. Commit: |
|
PR_Github #43310 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #43499 [ run ] triggered by Bot. Commit: |
|
PR_Github #43499 [ run ] completed with state |
…prevent executor crash (NVIDIA#12868) Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com> Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
Summary
compute_batch_gpu_times()inPerfMetricsManagercallselapsed_time()on CUDA events without error handlingelapsed_time()raisesRuntimeError_event_loop_wrapper→ kills the executor threadFix
Wrap
elapsed_time()calls intry/except RuntimeError. On failure, log0.0for that batch's GPU timing metrics instead of crashing the executor. This is safe because:Changed file
tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py—compute_batch_gpu_times()methodTest plan
ast.parse)dynamo-trtllm:pr7391-trtllm12476-v4-x86-b200) — confirmedtry/except RuntimeErroris present viainspect.getsourceSigned-off-by: Yifan Jiang 19356972+yifjiang@users.noreply.github.com
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
Stack Trace (reproduced on B200, disaggregated serving with
tensorrtllm-runtime:1.0.0)The executor thread dies silently. The main process and Dynamo runtime continue running (HTTP server, health probes return 200), but no inference requests can be processed — they all hang indefinitely waiting for the dead executor.
Broader fix: executor resilience
The immediate fix in this PR guards
compute_batch_gpu_times(). However, the deeper issue is that_event_loop_wrapperinpy_executor.pyre-raises all exceptions, killing the executor thread on any error:A more resilient approach would be to catch and log non-fatal errors (like metrics failures) while only terminating the executor for truly unrecoverable errors (OOM, CUDA device lost). Currently, a single metrics computation
RuntimeErrorpermanently kills the executor thread, leaving the server in a zombie state — HTTP alive, health probes passing, but zero inference throughput.