[None][fix] Guard CUDA event elapsed_time in perf_metrics_manager to prevent executor crash by yifjiang · Pull Request #12868 · NVIDIA/TensorRT-LLM

yifjiang · Apr 9, 2026

Summary

compute_batch_gpu_times() in PerfMetricsManager calls elapsed_time() on CUDA events without error handling
If a CUDA event was not recorded on the current stream, elapsed_time() raises RuntimeError
This exception propagates through _event_loop_wrapper → kills the executor thread
The main process and Dynamo runtime continue running (serving HTTP, responding to health probes), but with no executor thread, every inference request hangs forever — the server appears alive but is silently broken

Fix

Wrap elapsed_time() calls in try/except RuntimeError. On failure, log 0.0 for that batch's GPU timing metrics instead of crashing the executor. This is safe because:

Performance metrics are informational, not correctness-critical
The executor loop must not die from a metrics collection failure
The next batch will attempt to collect metrics normally

Changed file

tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py — compute_batch_gpu_times() method

Test plan

Syntax verified (ast.parse)
Included in E2E-tested image (dynamo-trtllm:pr7391-trtllm12476-v4-x86-b200) — confirmed try/except RuntimeError is present via inspect.getsource

Signed-off-by: Yifan Jiang 19356972+yifjiang@users.noreply.github.com

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved GPU performance metrics reliability by gracefully handling timing computation failures. The system now continues operation instead of aborting metric collection when GPU event timing errors occur.

Stack Trace (reproduced on B200, disaggregated serving with `tensorrtllm-runtime:1.0.0`)

[TRT-LLM] [E] Error in event loop: Both events must be completed before calculating elapsed time.

Traceback (most recent call last):
  File "tensorrt_llm/_torch/pyexecutor/py_executor.py", line 579, in _event_loop_wrapper
    self.event_loop()
  File "tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1930, in _executor_loop
    self.perf_manager.compute_batch_gpu_times(...)
  File "tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py", line 166, in compute_batch_gpu_times
    batch_gpu_forward_time = perf.gpu_forward_start_event.elapsed_time(...)
  File "torch/cuda/streams.py", line 233, in elapsed_time
    return super().elapsed_time(end_event)
RuntimeError: Both events must be completed before calculating elapsed time.

Exception in thread Thread-3 (_event_loop_wrapper) -> thread dies -> executor dead

The executor thread dies silently. The main process and Dynamo runtime continue running (HTTP server, health probes return 200), but no inference requests can be processed — they all hang indefinitely waiting for the dead executor.

Broader fix: executor resilience

The immediate fix in this PR guards compute_batch_gpu_times(). However, the deeper issue is that _event_loop_wrapper in py_executor.py re-raises all exceptions, killing the executor thread on any error:

# tensorrt_llm/_torch/pyexecutor/py_executor.py, line ~570
def _event_loop_wrapper(self):
    try:
        ...
        self.event_loop()
    except Exception as e:
        logger.error(f"Error in event loop: {e}")
        logger.error(traceback.format_exc())
        raise e          # <-- kills the thread on ANY exception
    finally:
        self._executor_loop_cleanup()

A more resilient approach would be to catch and log non-fatal errors (like metrics failures) while only terminating the executor for truly unrecoverable errors (OOM, CUDA device lost). Currently, a single metrics computation RuntimeError permanently kills the executor thread, leaving the server in a zombie state — HTTP alive, health probes passing, but zero inference throughput.

coderabbitai · Apr 9, 2026

📝 Walkthrough

Walkthrough

Updated the compute_batch_gpu_times method to handle GPU event timing failures gracefully. When CUDA event timing computation fails, the method now sets batch GPU times to 0.0 instead of raising an exception, allowing metric calculation to continue.

Changes

Cohort / File(s)	Summary
GPU Timing Error Handling `tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py`	Wrapped GPU event `elapsed_time()` computations in try/except RuntimeError block. On timing errors, sets `batch_gpu_forward_time` and `batch_gpu_sample_time` to `0.0` rather than propagating the exception.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding error handling (guarding) for CUDA event elapsed_time calls in perf_metrics_manager to prevent executor crashes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	PR description thoroughly explains the issue, root cause, fix, rationale, changed file, and test plan with supporting evidence.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py (1)

175-180: Optional: add low-noise logging for timing failures.

Consider emitting a throttled debug/warn log in this except path so persistent CUDA event-timing failures can be detected and diagnosed without impacting runtime behavior.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py` around lines 175 -
180, In the except RuntimeError block (where batch_gpu_forward_time and
batch_gpu_sample_time are set to 0.0) add a low-noise, throttled log message
(debug or warn) that records the RuntimeError occurrence; implement throttling
via a simple rate-limit (e.g., last_log_time timestamp or an occurrence counter
with a reset interval) so repeated CUDA event-timing failures don’t spam logs,
and use the module/class logger (e.g., perf_metrics_manager logger) and include
minimal context like "CUDA event timing failed for batch" plus the exception
message when actually logging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py`:
- Around line 175-180: In the except RuntimeError block (where
batch_gpu_forward_time and batch_gpu_sample_time are set to 0.0) add a
low-noise, throttled log message (debug or warn) that records the RuntimeError
occurrence; implement throttling via a simple rate-limit (e.g., last_log_time
timestamp or an occurrence counter with a reset interval) so repeated CUDA
event-timing failures don’t spam logs, and use the module/class logger (e.g.,
perf_metrics_manager logger) and include minimal context like "CUDA event timing
failed for batch" plus the exception message when actually logging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8f32e4df-ac83-4bcf-b46a-7b7d57518c0b

📥 Commits

Reviewing files that changed from the base of the PR and between 2fe39c1 and 2240164.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py

Tabrizian · Apr 9, 2026

@yifjiang I think if we see the error here doesn't mean that it is a metric error. It can be other issues but we only see it here since we have synchronized the event. Given that there was a PR to fix the original issue (assertion for not finding the block), is it possible to try that and see if you are still running into the issue you were seeing? Please let me know if I'm missing any details, thanks. #12667

yifjiang · Apr 9, 2026

@Tabrizian Thanks for the pointer — will test with PR #12667 to see if it addresses the root cause.

That said, regardless of what triggers the RuntimeError, I think a defensive try/except here is still necessary — perf metrics shouldn't be able to take down the executor thread. If elapsed_time() fails for any reason, logging 0.0 and continuing is safer than letting the exception propagate through _event_loop_wrapper and silently kill the executor.

luyiyun1021 · Apr 13, 2026

The issue you described in "Broader fix: executor resilience" — where the executor thread dies silently and the server becomes a zombie (HTTP alive, health checks passing, but all inference requests hang forever) — looks like a real bug worth tracking. The root cause is that _event_loop_wrapper runs on a daemon thread, and when it dies, no one notifies the blocked await_responses() callers or updates the health status. Would you consider filing an NVBug for this so it can be addressed independently? Thanks!

… executor crash Wrap the elapsed_time() calls in compute_batch_gpu_times() with try/except RuntimeError. If a CUDA event was not recorded on the current stream, elapsed_time() raises RuntimeError, which propagates up through the executor event loop and kills the executor thread. The main process and Dynamo runtime continue running (serving HTTP, responding to health probes), but with no executor thread, every inference request hangs forever. With this fix, a CUDA event timing failure logs 0.0 for that batch metrics instead of crashing the executor. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

Log the RuntimeError details so the failure is visible in logs rather than silently zeroing the metrics. This helps operators diagnose whether the timing failure indicates a deeper issue with the forward pass or stream synchronization. Uses tensorrt_llm.logger (codebase convention) instead of stdlib logging.getLogger. Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

Tabrizian · Apr 14, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 14, 2026

PR_Github #43122 [ run ] triggered by Bot. Commit: 9bd2416 Link to invocation

yifjiang · Apr 14, 2026

Filed as NVBug 6074898 — covers all three defects (no error responses to in-flight requests, _await_single_response missing is_shutdown check, health endpoint not detecting dead executor thread).

tensorrt-cicd · Apr 14, 2026

PR_Github #43122 [ run ] completed with state SUCCESS. Commit: 9bd2416
/LLM/main/L0_MergeRequest_PR pipeline #33756 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pcastonguay · Apr 14, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 14, 2026

PR_Github #43247 [ run ] triggered by Bot. Commit: ea1fbd0 Link to invocation

tensorrt-cicd · Apr 14, 2026

PR_Github #43247 [ run ] completed with state DISABLED
Freeze main and open the PR merge only after CI is back to healthy https://nvidia.slack.com/archives/C059LSY62BT/p1776141760843319?thread_ts=1775985925.442509&cid=C059LSY62BT

Link to invocation

pcastonguay · Apr 15, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 15, 2026

PR_Github #43310 [ run ] triggered by Bot. Commit: 4eaecc8 Link to invocation

tensorrt-cicd · Apr 15, 2026

PR_Github #43310 [ run ] completed with state SUCCESS. Commit: 4eaecc8
/LLM/main/L0_MergeRequest_PR pipeline #33852 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pcastonguay · Apr 15, 2026

/bot run --disable-fail-fast

tensorrt-cicd · Apr 15, 2026

PR_Github #43499 [ run ] triggered by Bot. Commit: d7a4d4c Link to invocation

tensorrt-cicd · Apr 15, 2026

PR_Github #43499 [ run ] completed with state SUCCESS. Commit: d7a4d4c
/LLM/main/L0_MergeRequest_PR pipeline #34015 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

…prevent executor crash (NVIDIA#12868) Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com> Co-authored-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>

yifjiang requested a review from a team as a code owner April 9, 2026 01:14

yifjiang requested a review from dongxuy04 April 9, 2026 01:14

github-actions Bot assigned yifjiang Apr 9, 2026

coderabbitai Bot reviewed Apr 9, 2026

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 9, 2026

yifjiang mentioned this pull request Apr 9, 2026

[None][fix] Combined disagg fixes: gen transfer timeout + perf metrics crash + KV cache block reuse hang #12889

Closed

Tabrizian reviewed Apr 9, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py Outdated

yifjiang force-pushed the fix/perf-metrics-cuda-event-crash branch from 7d065e9 to f6fb1d3 Compare April 9, 2026 22:03

Tabrizian requested a review from luyiyun1021 April 10, 2026 19:56

yifjiang force-pushed the fix/perf-metrics-cuda-event-crash branch from f6fb1d3 to 9d557b9 Compare April 13, 2026 23:54

yifjiang force-pushed the fix/perf-metrics-cuda-event-crash branch from 9d557b9 to 9bd2416 Compare April 14, 2026 00:03

Tabrizian approved these changes Apr 14, 2026

View reviewed changes

Merge branch 'main' into fix/perf-metrics-cuda-event-crash

ea1fbd0

Merge branch 'main' into fix/perf-metrics-cuda-event-crash

4eaecc8

Merge branch 'main' into fix/perf-metrics-cuda-event-crash

d7a4d4c

Tabrizian merged commit 1001070 into NVIDIA:main Apr 15, 2026
5 checks passed

glaziermag mentioned this pull request May 9, 2026

[Bug] /v1/responses does not populate /perf_metrics request entries #13949

Open

Search code, repositories, users, issues, pull requests...

Conversation

yifjiang commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Changed file

Test plan

Summary by CodeRabbit

Release Notes

Stack Trace (reproduced on B200, disaggregated serving with tensorrtllm-runtime:1.0.0)

Broader fix: executor resilience

Uh oh!

coderabbitai Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Tabrizian commented Apr 9, 2026

Uh oh!

yifjiang commented Apr 9, 2026

Uh oh!

Uh oh!

luyiyun1021 commented Apr 13, 2026

Uh oh!

Tabrizian commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

yifjiang commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

pcastonguay commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

pcastonguay commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

pcastonguay commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

tensorrt-cicd commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yifjiang commented Apr 9, 2026 •

edited

Loading

Stack Trace (reproduced on B200, disaggregated serving with `tensorrtllm-runtime:1.0.0`)

coderabbitai Bot commented Apr 9, 2026 •

edited

Loading