Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[PROF-15201] fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614

Draft
r1viollet wants to merge 11 commits into
mainDataDog/java-profiler:mainfrom
r1viollet/fix-sigprof-jfr-racesDataDog/java-profiler:r1viollet/fix-sigprof-jfr-racesCopy head branch name to clipboard
Draft

[PROF-15201] fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614
r1viollet wants to merge 11 commits into
mainDataDog/java-profiler:mainfrom
r1viollet/fix-sigprof-jfr-racesDataDog/java-profiler:r1viollet/fix-sigprof-jfr-racesCopy head branch name to clipboard

Conversation

@r1viollet

Copy link
Copy Markdown
Contributor

What does this PR do?:

Closes two TOCTOU races between the SIGPROF signal handler and JFR lifecycle transitions that could cause SIGSEGV or hangs in the test JVM during the 60-second recording cycle rotation.

Race 1 — stop() side (ctimer_linux.cpp):

disableEngines() sets _enabled=false, but a handler that already passed the _enabled=true check could still be executing inside recordSample() when _jfr.stop() freed JFR buffers → use-after-free → SIGSEGV (or hang if the crash is caught by crashtracking).

Fix: add an _inflight counter, incremented on every handler entry before the _enabled check, decremented on every exit path. CTimer::stop() calls drainInflight() after deleting per-thread timers, spinning until _inflight==0 before returning. The caller (Profiler::stop) then proceeds to _jfr.stop() only once all handlers have fully exited.

Race 2 — start() side (profiler.cpp):

enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF delivered in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures.

Fix: move enableEngines() to after both _jfr.start() and _cpu_engine->start() have returned successfully (immediately before _state.store(RUNNING)).

Motivation:

Discovered while investigating intermittent SIGSEGV (exit 139) and hang failures in DataDog/profiling-backend CI. Bisected to a dd-trace-java commit that changed instrumentation initialization timing, shifting when the 60-second recording cycle boundary fell relative to test thread activity — exposing both races reliably enough to isolate.

How to test the change?:

Controlled reproducer in DataDog/profiling-backend using AnalysisEndpointTest.testResourceExhausted with the bad dd-trace-java agent (0e13e90dac) and a patched libjavaProfiler.so:

  • Without fix: ~60% failure rate per iteration (SIGSEGV / hang)
  • Race 1 fix only (drainInflight): ~20% failure rate — Race 2 still active
  • Race 2 fix only (move enableEngines): ~40% failure rate — Race 1 still active
  • Both fixes together: 12/12 iterations clean against v_1.44.0 baseline

Additional Notes:

  • drainInflight() is an unbounded spin. In practice recordSample() completes in microseconds so this is safe, but a bounded spin with a log warning could be added as a follow-up.
  • The _inflight counter is incremented even when CriticalSection fails (handler returns early without touching JFR). This is intentional: it makes the drain conservative and guarantees the counter reaches zero only after all code paths between the counter increment and any potential JFR access have completed.
  • Related: Revert "Ignore capturing connection continuation for armeria (#11657)" dd-trace-java#11685 (revert of the dd-trace-java commit that exposed these races).

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: [PROF-XXXX]

… lifecycle

The CPU profiler sends SIGPROF to all threads via per-thread kernel timers.
The signal handler checks _enabled and, if true, calls recordSample() which
accesses JFR buffers. Two races existed around the recording cycle transition
(default every 60 s) where JFR structures could be in mid-init or mid-teardown
while the handler was active:

Race 1 — stop() side (TOCTOU on _enabled vs _jfr.stop()):
  A handler that passed the _enabled=true check could still be executing
  inside recordSample() when disableEngines() set _enabled=false and
  _jfr.stop() freed JFR buffers — use-after-free → SIGSEGV.

  Fix: add an _inflight counter (incremented on handler entry, decremented
  on all exits). CTimer::stop() calls drainInflight() after deleting per-
  thread timers, spinning until _inflight==0 before returning to the caller
  that proceeds to _jfr.stop(). Any handler that fires after disableEngines()
  sees _enabled=false and returns early without touching JFR.

Race 2 — start() side (enableEngines() before _jfr.start()):
  enableEngines() set _enabled=true before _jfr.start() had completed.
  A SIGPROF in that window would see _enabled=true and call recordSample()
  on partially-initialized JFR structures.

  Fix: move enableEngines() to after _jfr.start() and _cpu_engine->start()
  have both returned successfully (just before _state.store(RUNNING)).

Validated empirically: a controlled reproducer in DataDog/profiling-backend
(AnalysisEndpointTest.testResourceExhausted with a 60 s recording period)
showed ~60% failure rate without the fix (SIGSEGV / hang), 0% with both
fixes applied (12/12 iterations clean). Each fix alone only partially
addressed the failures, confirming both races were independently active.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@datadog-prod-us1-3

datadog-prod-us1-3 Bot commented Jun 23, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 15 Pipeline jobs failed

DataDog/java-profiler | benchmarks-candidate-aarch64: [alloc]   View in Datadog   GitLab

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall,alloc,memleak]   View in Datadog   GitLab

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall]   View in Datadog   GitLab

View all 15 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 8b069cb | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #28387185356 | Commit: 3d12b45 | Duration: 16m 41s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-06-29 16:48:47 UTC

@dd-octo-sts

dd-octo-sts Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Reliability & Chaos Results

All reliability & chaos checks passed Pipeline: https://gitlab.ddbuild.io/DataDog/java-profiler/-/pipelines/121657817

@r1viollet r1viollet changed the title fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle [PROF-15201] fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle Jun 24, 2026
…iveness

Without a timeout, drainInflight() spins indefinitely if a SIGPROF handler
is stuck (e.g. deadlock inside recordSample()). This would prevent _jfr.stop()
from running and hang JVM shutdown.

Use clock_gettime(CLOCK_MONOTONIC) for a real wall-clock bound of 200ms.
If the deadline fires, log a warning and proceed; in that pathological case
the caller's JFR teardown may race with the stuck handler, but the JVM is
not permanently deadlocked. In normal operation (handlers complete in
microseconds) the timeout is never reached.

Refs: PROF-15201

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@r1viollet

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81184abc5c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/ctimer_linux.cpp Outdated
…Guard

Manual increment/decrement of _inflight in signal handlers was error-prone
and led to a counter leak in CTimerJvmti::signalHandler's inInitWindow()
early return path. This would cause drainInflight() to spin for 200ms on
every profiler stop and emit spurious warnings about stuck handlers.

Solution: Add InflightGuard RAII class following the existing pattern in
guards.h (SignalHandlerScope, CriticalSection). The guard increments on
construction and decrements on destruction, making it impossible to forget
the decrement on any exit path.

Benefits:
- Eliminates the entire class of counter-leak bugs
- Makes all exit paths safe by construction
- Follows established patterns in the codebase (SignalHandlerScope)
- Self-documenting: InflightGuard at the start of a handler clearly
  signals its purpose

Changes:
- guards.h: Add InflightGuard class declaration
- guards.cpp: Implement InflightGuard with #ifdef __linux__ (no-op on
  non-Linux where CTimer doesn't exist)
- ctimer_linux.cpp: Replace all manual __atomic_fetch_add/_sub with
  a single InflightGuard declaration at the start of both signal handlers
@r1viollet

Copy link
Copy Markdown
Contributor Author

I'm worried about perf of such an approach.

@dd-octo-sts

dd-octo-sts Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Scan-Build Report

User:runner@runnervmmklqx
Working Directory:/home/runner/work/java-profiler/java-profiler/ddprof-lib/src/test/make
Command Line:make -j4 all
Clang Version:Ubuntu clang version 18.1.3 (1ubuntu1)
Date:Thu Jun 25 07:41:45 2026

Analyzer Failures

The analyzer had problems processing the following files:

ProblemSource FilePreprocessed FileSTDERR Output
Other Errorctimer_linux.cpp

Please consider submitting preprocessed files as .

CI failure revealed _inflight was protected, preventing InflightGuard access.
Also, having _inflight adjacent to _enabled creates false sharing:
- _enabled is read (ACQUIRE) by every signal handler on every thread
- _inflight is written (modify-release) by every signal handler
Sharing a cache line causes unnecessary cache line bouncing.

Solution: Move _inflight to public and align to 64-byte cache line boundary
(alignas(64)). This separates the read-only _enabled hot path from the
read-write _inflight counter, reducing cross-thread traffic.

Note: The counter is still globally updated, but the separate cache line
means _enabled reads no longer compete with _inflight writes.
@r1viollet r1viollet force-pushed the r1viollet/fix-sigprof-jfr-races branch from d28b55d to 938e91d Compare June 25, 2026 07:44
@r1viollet

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 938e91da7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"This may indicate a stuck signal handler.",
DRAIN_TIMEOUT_NS / 1000000L,
__atomic_load_n(&_inflight, __ATOMIC_ACQUIRE));
break;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep JFR alive while SIGPROF handlers are in flight

When a SIGPROF handler is still inside recordSample()/recordSampleDelegated() for more than 200 ms, this break lets CTimer::stop() return with _inflight > 0; Profiler::stop() can then call _jfr.stop() and free the buffers that the handler is still using, recreating the use-after-free this guard is meant to prevent. If a liveness bound is required, the stop path should avoid tearing down JFR while the counter is non-zero rather than proceeding after the warning.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

…use-after-free

Codex P2: The 200ms timeout in drainInflight() previously broke with a warning
but still allowed Profiler::stop() to call _jfr.stop(), recreating the
use-after-free this guard is meant to prevent.

Solution: drainInflight() returns bool; Profiler::stop() skips _jfr.stop() if
timeout fires. This leaks ~few MB but prevents SIGSEGV. Timeout is pathological
(stuck handler), so leak is acceptable. Tested: forcing skip-stop works cleanly.

Normal operation unchanged: _jfr.stop() is called when drain succeeds (<1ms).
Only skips on timeout (200ms expired, handler still in-flight).
@r1viollet

Copy link
Copy Markdown
Contributor Author

@codex review

@r1viollet r1viollet marked this pull request as ready for review June 26, 2026 14:28
@r1viollet r1viollet requested a review from a team as a code owner June 26, 2026 14:28

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36ce1bda9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment on lines +1584 to +1588
if (drained) {
rotateDictsAndRun([&]{ _jfr.stop(); });
} else {
Log::error("Profiler::stop: skipping JFR teardown due to stuck SIGPROF handlers. "
"JFR resources (~few MB) will leak. Recording file may be incomplete.");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep stop non-restartable when drain fails

When drainInflight() times out, this branch leaves the existing FlightRecorder recording alive but Profiler::stop() still falls through to _state.store(IDLE) and returns OK, so the rotation path or a user can immediately call start() while an old SIGPROF handler is still in flight. FlightRecorder::start()/newRecording() then installs a new _rec without having stopped the old one, and the old handler can later record into the new recording (or leak/overwrite the old one). Fresh evidence in this revision is that the false branch now skips _jfr.stop() but still treats the profiler as fully stopped; in the timeout case it should return an error or otherwise keep the profiler non-restartable until the in-flight handler is gone.

Useful? React with 👍 / 👎.

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
// If draining times out (handlers are stuck), skip JFR teardown to prevent
// use-after-free. This leaks JFR resources (~few MB) but is far safer than
// freeing structures that stuck handlers are still accessing.
bool drained = CTimer::drainInflight();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drainInflight() only tracks CTimer::_inflight. When selectCpuEngine() returns &perf_events or &itimer (the standard fallback when ctimer.check() fails — e.g. kernels without timer_create, or -Devent=itimer), those handlers call recordSample() and write JFR buffers but never construct an InflightGuard. _inflight stays 0, so drainInflight() returns true immediately — _jfr.stop() can free JFR buffers while a live handler is still inside recordSample(). The TOCTOU/UAF this PR closes for CTimer remains open for the common fallback engines.

Fix options: (a) add InflightGuard to PerfEvents::signalHandler, ITimer::signalHandler, and ITimerJvmti::signalHandler; or (b) implement virtual Engine::drainInflight() and call _cpu_engine->drainInflight() so the drain dispatches to whichever engine is active.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes the PR more complex. I'll give it a shot to cover all variants.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 1e75102 (rename/move) + 97e9f99 (extend coverage).

Two commits:

  1. 1e75102 — moves the inflight counter and drain logic out of CTimer into a new SignalInflight module (header + impl). No behavior change; pure rename/move. The InflightGuard RAII type lives there too. Profiler::stop() now calls SignalInflight::drain().

  2. 97e9f99 — adds InflightGuard at handler entry (post foreign-signal filter, pre JFR-touching code) for the previously-uncovered handlers: PerfEvents, ITimer, ITimerJvmti, WallClockASGCT::sharedSignalHandler, WallClockJvmti::sharedSignalHandler. For wall, guarding at the dispatcher level transitively covers the inner per-engine signalHandler methods.

Went with the option-(a) extension — InflightGuard everywhere — rather than a virtual Engine::drainInflight(). The question 'can JFR be torn down right now?' is a single global question, not a per-engine question, so one shared counter + one drain site keeps it simple.

Risk-profile note: CTimer / PerfEvents are the high-probability paths (per-thread streams, many concurrent in-flight handlers). Wall is medium (up to ~reservoir_size handlers in flight from the last tgkill batch). ITimer is mostly theoretical (single in-flight max). All five plug into the same counter regardless, since the cost is one extra atomic per signal.

This is a larger surface than the original PR set out to fix — happy to split B2 (97e9f99) into a follow-up PR if you'd rather keep this one focused on the CTimer-specific fix. WDYT?

Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated
_thread_info.reportCounters();

rotateDictsAndRun([&]{ _jfr.stop(); });
if (drained) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When drainInflight() times out, _jfr.stop() is skipped but the profiler unconditionally sets _state = IDLE and returns Error::OK. The caller has no indication teardown was partial. A subsequent start() will call _jfr.start() against un-torn-down JFR structures while stuck handlers may still be live.

Consider returning a non-OK Error on the timeout path, or avoiding the IDLE transition, so callers know teardown was skipped and cannot blindly restart.

Comment thread ddprof-lib/src/main/cpp/ctimer_linux.cpp Outdated
}

struct timespec deadline;
clock_gettime(CLOCK_MONOTONIC, &deadline);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clock_gettime() return value not checked; on failure deadline is uninitialized and the timeout loop behaviour is incorrect. Defensive fix: if (clock_gettime(CLOCK_MONOTONIC, &deadline) != 0) { return false; } (and similarly for the in-loop call).

@rkennke rkennke left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one. I have a few suggestions.

Comment thread ddprof-lib/src/main/cpp/ctimer.h Outdated
Comment thread ddprof-lib/src/main/cpp/profiler.cpp Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes dead code and can be removed (including its declaration).

Comment thread ddprof-lib/src/main/cpp/guards.h Outdated
Address Roman's review comments:

1. Make _inflight private with accessors (enterHandler/exitHandler/
   hasInflightHandlers). Restore protected access for _interval, _cstack,
   _signal, _max_timers, _timers (subclass CTimerJvmti needs them).

2. Move InflightGuard into ctimer.h as header-only. Removes engine coupling
   from guards module: guards.cpp no longer #ifdef-includes ctimer.h.
   Non-Linux variant is an empty stub class; the _active bool disappears
   since separate class definitions per platform handle the no-op case.

3. Remove dead Profiler::enableEngines() (replaced by explicit
   _cpu_engine->enableEvents(true) / _wall_engine->enableEvents(true)
   in profiler.cpp). disableEngines() is still used and kept.
@r1viollet r1viollet marked this pull request as draft June 29, 2026 14:44
Three small fixes:

1. ctimer.h: _enabled was moved from protected to private in the previous
   refactor (e7ca730). CTimerJvmti::signalHandler is a subclass member that
   reads _enabled directly, so the symbol must be protected. GCC enforces
   this strictly; clang on macOS did not catch it because the offending
   code is inside #ifdef __linux__. _inflight stays private (only accessed
   via the public enterHandler/exitHandler accessors).

2. ctimer_linux.cpp: remove duplicate #include <time.h> (jbachorik).
   Check clock_gettime() return value in drainInflight(); on failure,
   conservatively report timeout so the caller skips JFR teardown rather
   than spinning against an uninitialized deadline (jbachorik).

3. profiler.cpp: rewrite the wall/CPU enableEvents() comments. The old
   wording claimed wall clock 'doesn't access JFR in signal handlers',
   which is false (both engines write JFR events from signal handlers).
   Correct framing: wall's pthread checks _enabled once at startup and
   exits if false, so it must be enabled before _wall_engine->start();
   CPU handlers re-check _enabled on every signal, so the order vs
   _cpu_engine->start() is incidental (jbachorik).
@r1viollet

Copy link
Copy Markdown
Contributor Author

Todo: check code paths for crashes and signal handlers. Could we hit a point where we forget to decrement a counter ?

Address jbachorik review: when drainInflight() times out we previously
proceeded to set _state=IDLE and return Error::OK, leaving the caller free
to call start() against partially-torn-down structures and potentially
deadlocking on _rec_lock if a stuck handler still held it.

Move drainInflight() to immediately after disableEngines(), before any
other engine stop. On timeout, return Error and leave _state at RUNNING:

- caller cannot start() (state guard blocks it)
- caller can retry stop(); disableEngines() is an idempotent atomic store
  and no other engine stop has run yet, so there is no pthread_join double
  invocation hazard from BaseWallClock::stop()
- on retry, if handlers have finally drained, the teardown proceeds for
  the first time and completes normally

Safety of the reorder: after disableEngines() any signal that fires from
now on enters the handler, fails the _enabled check, decrements the
inflight counter and exits without ever touching JFR. So brief
inflight > 0 spikes between drainInflight() returning true and _jfr.stop()
do not race against JFR teardown.

Trade-off: on the timeout path, wall/alloc/native engines keep running
an extra ~200 ms before the caller retries. Acceptable for a pathological
path; the alternative (proceed and deadlock or crash) is worse.
Preparation for extending the JFR-teardown drain to all signal handlers
that write JFR (jbachorik review). The counter and drain logic were tied
to CTimer, but the question 'can JFR be torn down right now?' is a single
global question that any JFR-writing signal handler must answer. Tying it
to CTimer made it awkward to extend.

No behavior change. Pure rename/move:

- new signalInflight.h / signalInflight.cpp: SignalInflight class owns the
  cache-line-aligned counter, enter()/exit()/hasInflight()/drain() static
  methods, and the InflightGuard RAII type
- ctimer.h: drop _inflight, enterHandler, exitHandler, hasInflightHandlers,
  drainInflight, and the embedded InflightGuard class
- ctimer_linux.cpp: drop the counter definition and drainInflight()
  implementation
- profiler.cpp: Profiler::stop() now calls SignalInflight::drain() instead
  of CTimer::drainInflight()

Existing CTimer signal handlers still construct InflightGuard at entry;
that type now lives in signalInflight.h and the existing include of
ctimer.h is enough for them (ctimer_linux.cpp also includes signalInflight.h
explicitly for clarity).

Follow-up commit (B2) will add InflightGuard to PerfEvents, ITimer,
ITimerJvmti, WallClockASGCT, WallClockJvmti so they participate in the
drain too.
jbachorik review (profiler.cpp:1578): the inflight drain previously only
covered CTimer/CTimerJvmti, so the TOCTOU/UAF this PR closes for those
handlers remains open for the other CPU and wall fallback engines.

Add InflightGuard at signal-handler entry (after the foreign-signal
filter passes, before any JFR-writing code) for:

- PerfEvents::signalHandler         (CPU fallback, same risk shape as CTimer)
- ITimer::signalHandler             (CPU fallback when timer_create absent)
- ITimerJvmti::signalHandler        (CPU + jvmtistacks fallback)
- WallClockASGCT::sharedSignalHandler  (default wall handler)
- WallClockJvmti::sharedSignalHandler  (wall + jvmtistacks handler)

For the wall handlers the guard is placed in sharedSignalHandler — the
engine-specific inner signalHandler is dispatched from the shared scope,
so a single guard in the dispatcher covers every JFR-touching code path
in both wall variants.

Risk profile (highest to lowest probability of hitting the race in
practice): CTimer / PerfEvents (per-thread streams, ~10ms interval,
N concurrent in-flight handlers) > Wall (signals from the wall pthread,
~reservoir_size concurrent) > ITimer / ITimerJvmti (process-wide
setitimer, single in-flight at a time). All five plug into the same
SignalInflight counter so Profiler::stop() drains them uniformly.
InflightGuard's destructor can be bypassed if the signal handler frame is
unwound by a longjmp. In this codebase the only realistic trigger is J9's
SIGSEGV null-pointer-check handler chaining out of our segvHandler via
siglongjmp to a setjmp in normal Java code. Probability is very low (J9
would have to claim a fault our stack walker produced, which lies in our
own C++ code at addresses J9 has no reason to recognise) and the failure
mode is graceful: a stuck counter makes every subsequent stop() hit the
drain timeout and skip JFR teardown — the safety net this counter exists
to provide.

Document this explicitly rather than build out a workaround. Matches the
existing precedent in guards.h, where SignalHandlerScope's own depth
counter has the same limitation on the same J9 chain path.
@r1viollet

Copy link
Copy Markdown
Contributor Author

Reviewer call-out — longjmp leak limitation (cc @jbachorik @rkennke)

While reviewing the new SignalInflight counter for ways it could go wrong, I noticed InflightGuard's destructor can be bypassed by a longjmp that unwinds a signal handler frame. In this codebase the only realistic trigger is J9's SIGSEGV null-pointer-check handler: our segvHandler chains to J9 for unclaimed faults, and J9 may siglongjmp to a setjmp in normal Java code, unwinding past every active InflightGuard.

Documented in 8b069cb (signalInflight.h header comment). Summary:

  • Probability: very low. J9 only siglongjmps when the fault matches an annotated null-check site in Java bytecode. Our stack-walker faults are at addresses inside our own C++ code that J9 has no reason to claim — its handler should reject and let the next handler in the chain process the fault.
  • Failure mode if it ever fires: counter pinned > 0 permanently. Every subsequent Profiler::stop() hits drain()'s 200 ms timeout, returns Error, and refuses to tear down JFR. The JVM keeps running; the profiler is wedged until process exit. No crash, no use-after-free — the timeout path that the counter exists to gate is exactly the safety net for this case.
  • Precedent: SignalHandlerScope (guards.h) has the same limitation on the same J9 chain path for its _signal_depth counter, and the codebase accepts it. I followed that precedent rather than introducing a new mechanism.

Options if either of you considers this insufficient:

  • Watchdog: track whether _counter moved during the drain spin; log a more pointed warning ("counter pinned at N — possible longjmp leak") when it didn't. ~10 lines, diagnostic only.
  • Per-thread counter: move the counter into ProfiledThread::_inflight, zero it in segvHandler / busHandler before chaining (mirroring SIGNAL_HANDLER_GUARD_RELEASE). Actually fixes the leak, but drain() would iterate threads. ~30-50 lines plus a ProfiledThread field. Would also incidentally remove the global cache-line contention on the counter.

My lean is to ship with the documented limitation and revisit only if it actually fires in production. Happy to take either of the alternatives if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.