Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Fix spark download deadlock#1475

Merged
wisechengyi merged 1 commit intomasterpolynote/polynote:masterfrom
fixrace3polynote/polynote:fixrace3Copy head branch name to clipboard
Nov 7, 2025
Merged

Fix spark download deadlock#1475
wisechengyi merged 1 commit intomasterpolynote/polynote:masterfrom
fixrace3polynote/polynote:fixrace3Copy head branch name to clipboard

Conversation

@wisechengyi
Copy link
Collaborator

@wisechengyi wisechengyi commented Nov 4, 2025

Problem

Race condition in Spark distribution download: The test setup lock file mechanism in build.sbt had several critical issues:

  1. Stale lock files were left behind when processes crashed or were interrupted, causing subsequent test runs to fail after waiting 10 minutes
  2. deleteOnExit() was unreliable in forked JVM processes
  3. Lock files weren't cleaned up properly on failures
  4. The wait logic didn't check if the actual work (extracting Spark) completed successfully
  5. Lock file naming only included Scala version, not Spark version

This resulted in CI failures like:

  [error] java.lang.Exception: Lock file /home/runner/work/polynote/polynote/target/spark/spark_2.12_test_setup_is_running.lock still exists after 600000 ms. Aborting.

Solution

Improved lock file mechanism with robust cleanup

1. Early exit optimization:

  • Check if destDir exists before any lock operations to avoid unnecessary locking

2. Atomic lock acquisition:

  • Use createNewFile() which is atomic on most filesystems
  • Wrap in try-catch to handle edge cases gracefully

3. Smarter wait logic:

  • Check both lockFile.exists() AND !destDir.exists() in the wait loop
  • Exit early if destination directory appears (setup completed successfully)
  • Reduced check interval from 10s to 5s for faster response
  • Display elapsed time for better debugging

4. Stale lock detection and recovery:

  • If timeout occurs but destDir still doesn't exist, assume stale lock
  • Automatically clean up stale lock files
  • Provide clear error message if setup truly failed

5. Guaranteed lock cleanup:

  • Move lock file deletion to finally block to ensure cleanup even on exceptions
  • Additional check to verify lock file exists before deletion
  • Log when lock is released for better observability

6. Version-specific lock files:

  • Include both Spark and Scala versions in lock file name: spark_3.1.2_scala_2.12_test_setup_is_running.lock
  • Allows different Spark/Scala combinations to run in parallel without conflicts

7. Double-check pattern:

  • After acquiring lock, check destDir again to handle race conditions where another process completed between checks

Benefits

  • More reliable tests: Eliminates stale lock file timeouts in CI
  • Self-healing: Automatically recovers from stale locks
  • Better observability: Clear logging of lock acquisition, waiting, and release
  • Parallel builds: Different Spark/Scala versions can safely build in parallel
  • Graceful failure: Proper cleanup even when exceptions occur

@wisechengyi wisechengyi changed the title lock Fix spark download deadlock Nov 4, 2025
@wisechengyi wisechengyi marked this pull request as ready for review November 4, 2025 20:25
@wisechengyi
Copy link
Collaborator Author

Additional testing: #1477

Copy link
Collaborator Author

wisechengyi commented Nov 5, 2025

Copy link
Collaborator

@jonathanindig jonathanindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for tackling this!

@jonathanindig jonathanindig requested a review from Copilot November 7, 2025 16:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wisechengyi wisechengyi merged commit 36307bc into master Nov 7, 2025
4 checks passed
@wisechengyi wisechengyi deleted the fixrace3 branch November 7, 2025 18:39
wisechengyi added a commit that referenced this pull request Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.