Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

test(jfr): regression test for dump+restart race (SIGSEGV in _Rb_tree_increment)#523

Closed
jbachorik wants to merge 19 commits into
mainDataDog/java-profiler:mainfrom
muse/crash-sigsegv-in-std-rb-tree-increment-cleanDataDog/java-profiler:muse/crash-sigsegv-in-std-rb-tree-increment-cleanCopy head branch name to clipboard
Closed

test(jfr): regression test for dump+restart race (SIGSEGV in _Rb_tree_increment)#523
jbachorik wants to merge 19 commits into
mainDataDog/java-profiler:mainfrom
muse/crash-sigsegv-in-std-rb-tree-increment-cleanDataDog/java-profiler:muse/crash-sigsegv-in-std-rb-tree-increment-cleanCopy head branch name to clipboard

Conversation

@jbachorik

@jbachorik jbachorik commented May 11, 2026

Copy link
Copy Markdown
Collaborator

What does this PR do?:
Adds a regression test: ConcurrentDumpRestartTest exercises the race between Recording::writeClasses (called during dump()) and Profiler::start(reset=true) which clears _class_map concurrently. Without the fix in the base branch (muse/sigsegv-in-profiler-dump), this race corrupts rbtree node pointers and causes a SIGSEGV in std::_Rb_tree_increment. A crash aborts the JVM, which JUnit reports as a test failure.

Motivation:
Three production crash events (2026-05-06 to 2026-05-08, fingerprint v10.DAECC680F0728EAB44F26DB0B91B703F) on Amazon Corretto 21 showed SIGSEGV in std::_Rb_tree_increment reached via Recording::writeCpoolRecording::writeClassesDictionary::collect. The fix (holding classMapSharedGuard() for the full duration of writeClasses) is in the base PR #516. This PR adds the regression test so the fix is verified and the race cannot regress silently.

Additional Notes:
Stacked on PR #516 (muse/sigsegv-in-profiler-dump). Merge this after PR #516 lands.

How to test the change?:

  • Run under ASAN build: ./gradlew :ddprof-test:testRelease --tests "*.ConcurrentDumpRestartTest". Without the fix the test crashes the JVM; with the fix it completes cleanly.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-14583

Unsure? Have a question? Request a review!

jbachorik and others added 14 commits May 7, 2026 12:19
walkVM ran the inserting Dictionary::lookup from a signal handler, calling
malloc/calloc — async-signal-unsafe. Recording::writeClasses also walked
_class_map without holding _class_map_lock, racing the exclusive clear()
that follows in Profiler::dump.

Switch the vtable-target site to bounded_lookup(..., 0) and skip the
synthetic frame on miss; hold _class_map_lock shared while writeClasses
recurses into Dictionary::collect; document the signal-safety hazard on
the inserting overloads; add a concurrent test for the lock discipline.

Fixes PROF-14549.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex flagged a remaining race after the initial fix: walkVM's read-only
bounded_lookup still traverses row->next and reads row->keys[j] without
holding _class_map_lock, while Profiler::dump's exclusive _class_map.clear()
runs in the unlockAll -> lock window and free()s those same nodes/strings.

Take _class_map_lock shared via OptionalSharedLockGuard (try-lock so the
signal handler never blocks). When an exclusive clear() is in progress,
drop the synthetic vtable-target frame instead of dereferencing freed
memory.

Also add the missing <cstring> include in dictionary_concurrent_ut.cpp;
strlen() previously worked only via a transitive include through
gtest_crash_handler.h.

Addresses review comments on PR #512.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
flightRecorder.cpp: refer to classMap()/classMapLock() rather
than the underlying _class_map/_class_map_lock fields.
dictionary_concurrent_ut.cpp: drop PROF-14549 ticket reference.

Co-Authored-By: muse <muse@noreply>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
… vs dump clear

Adds SignalHandlerBoundedLookupVsDumpClear: concurrent signal-handler threads
(OptionalSharedLockGuard + bounded_lookup(0)) vs a dump thread (exclusive lock
+ dict.clear()), reproducing the PROF-14550 crash signature without the fix.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fter-unlock in writeClasses; cap tryLockShared at 5 attempts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…unded variant

- ddprof-lib/src/main/cpp/spinLock.h:49 — restored tryLockShared() to spin until exclusive lock released; added tryLockSharedBounded() for signal-handler paths
- ddprof-lib/src/main/cpp/spinLock.h:102 — OptionalSharedLockGuard updated to use tryLockSharedBounded()

Co-Authored-By: muse <muse@noreply>
…(), not tryLockShared()

The previous commit added tryLockSharedBounded() and claimed OptionalSharedLockGuard
was updated to use it, but the constructor still called tryLockShared() — the spinning
(potentially unbounded) variant. Since OptionalSharedLockGuard is used in signal-handler
paths, it must use the bounded variant to be async-signal-safe.

Agent-Logs-Url: https://github.com/DataDog/java-profiler/sessions/4b50411f-3c52-4938-9d54-a43ada34b94e

Co-authored-by: jbachorik <738413+jbachorik@users.noreply.github.com>
- ddprof-lib/src/main/cpp/spinLock.h:115 — OptionalSharedLockGuard reverted to tryLockShared() (non-spurious) for FlightRecorder hot paths; BoundedOptionalSharedLockGuard added for signal-handler-only callers
- ddprof-lib/src/main/cpp/profiler.h:208 — classMapTrySharedGuard() now returns BoundedOptionalSharedLockGuard

Co-Authored-By: muse <muse@noreply>
…ounded unit tests

- spinlock_bounded_ut.cpp: 6 tests covering tryLockSharedBounded() and
  BoundedOptionalSharedLockGuard in isolation (free lock, exclusive-held
  failure, multiple shared holders, RAII destruction, concurrent stress)
- dictionary_concurrent_ut.cpp: update SignalHandlerBoundedLookupVsDumpClear
  to use BoundedOptionalSharedLockGuard matching the actual hotspotSupport.cpp
  code path (classMapTrySharedGuard())

Co-Authored-By: muse <muse@noreply>
…nalSharedLockGuard

Restore OptionalSharedLockGuard in SignalHandlerBoundedLookupVsDumpClear and
add SignalHandlerBoundedOptionalLookupVsDumpClear for BoundedOptionalSharedLockGuard,
which is what classMapTrySharedGuard() returns in hotspotSupport.cpp.

Co-Authored-By: muse <muse@noreply>
Exercises the race where Profiler::start(reset=true) clears _class_map
concurrently with Recording::writeClasses iterating its row keys, which
corrupted rbtree nodes and caused SIGSEGV in std::_Rb_tree_increment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik added the AI label May 11, 2026
@dd-octo-sts

dd-octo-sts Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #25728877379 | Commit: 4eacf6c | Duration: 3h 0m 19s (longest job)

1 of 54 test jobs failed

Status Overview

JDK glibc-aarch64/asan glibc-aarch64/debug glibc-amd64/asan glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - - -
8-ibm - - - -
8-j9 - -
8-librca - - - -
8-orcl - - - -
11 - - - -
11-j9 - -
11-librca - - - -
17 - -
17-graal - -
17-j9 - -
17-librca - - - -
21 - -
21-graal - -
21-librca - - - -
25 - -
25-graal 🚫 - -
25-librca - - - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

glibc-aarch64/asan / 21-graal

Job: View logs

No detailed failure information available. Check the job logs.

Summary: Total: 54 | Passed: 52 | Failed: 1 | Cancelled: 1


Updated: 2026-05-12 13:42:13 UTC

@jbachorik jbachorik added the test:asan Run CI tests with AddressSanitizer configuration label May 11, 2026
@jbachorik jbachorik requested a review from Copilot May 11, 2026 21:16
@jbachorik

Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c814ce0ddf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a JUnit regression/stress test in ddprof-test to exercise the previously observed race between dump() (JFR constant-pool/class writing) and start(reset=true) (_class_map clearing), which historically could corrupt RB-tree iterators and crash the JVM (PROF-14583). This is intended to ensure the fix from stacked PR #516 doesn’t regress.

Changes:

  • Introduce ConcurrentDumpRestartTest that runs allocator worker threads while repeatedly dumping JFR recordings and concurrently stop/starting the profiler.
  • Add basic progress assertions (minimum number of dumps/restarts) to avoid trivially-successful runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jbachorik jbachorik changed the title test(jfr): regression test for PROF-14583 dump+restart race (SIGSEGV in _Rb_tree_increment) test(jfr): regression test for dump+restart race (SIGSEGV in _Rb_tree_increment) May 11, 2026
- Add Datadog Apache 2.0 copyright header
- Add volatile allocSink to prevent JIT from eliminating test allocations
- Build restart command from getProfilerCommand() + injected cstack
- Capture restarter thread exceptions; InterruptedException is expected
- Count only non-empty dumps (profiler.dump() is no-op when IDLE)

Co-Authored-By: muse <muse@noreply>
@jbachorik

Copy link
Copy Markdown
Collaborator Author

@codex review

@jbachorik

Copy link
Copy Markdown
Collaborator Author

@copilot review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46ebfb9b51

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI commented May 11, 2026

Copy link
Copy Markdown
Contributor

@codex review

Reviewed the latest changes at 46ebfb9; no additional issues found and no further code changes are needed.

@codex review

Completed another pass on 46ebfb9; the previously requested fixes are present and I did not find new issues requiring changes.

@copilot review

I re-reviewed the branch at 46ebfb9. Current test logic looks consistent with prior feedback and no extra patch is required.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 46ebfb9b51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

jbachorik and others added 3 commits May 12, 2026 00:04
…locations

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
alloc=1ms is not recognized by Arguments::parse — only event=alloc and
memory=<interval>[:<config>] are valid. Switch to memory=262144:a so
allocating threads actually populate _class_map via ObjectSampler.

- ddprof-test/src/test/java/com/datadoghq/profiler/jfr/ConcurrentDumpRestartTest.java:68 — memory=262144:a,wall=5ms

Co-Authored-By: muse <muse@noreply>
Base automatically changed from muse/sigsegv-in-profiler-dump to muse/sigsegv-in-recording-writeclasses May 12, 2026 10:20
Base automatically changed from muse/sigsegv-in-recording-writeclasses to main May 12, 2026 11:02
@jbachorik

Copy link
Copy Markdown
Collaborator Author

Closing: ConcurrentDumpRestartTest moved to PR #522. The C++ test additions (dictionary_concurrent_ut.cpp, spinlock_bounded_ut.cpp) were already merged into main via PR #512.

@jbachorik jbachorik closed this May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI test:asan Run CI tests with AddressSanitizer configuration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.