Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BUG: Make np.nonzero threading safe #28361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Feb 22, 2025
Merged

Conversation

eendebakpt
Copy link
Contributor

@eendebakpt eendebakpt commented Feb 20, 2025

We add a test for np.nonzero under multi-threading and make np.nonzero safe under the cpython free-threading build. We want to ensure concurrent invocations of the method on the same array do not corrupt the system. Correct results are not guaranteed:

  • If the underlying data is changing, the non-zero indices are not well-defined
  • If the underlying data is changing, we can get indices outside the size of the array due to a part of the return array not being initialized.

Also see #27519. An alternative approach would be to use locks to make the array read-only during the operation.

@eendebakpt eendebakpt changed the title ENH: Add multi-threading test for np.nonzero BUG: Make np.nonzero threading safe Feb 20, 2025
@eendebakpt eendebakpt requested a review from ngoldbaum February 20, 2025 19:34
@ngoldbaum
Copy link
Member

It looks like the TSAN tests are failing on the new test you added.

@eendebakpt
Copy link
Contributor Author

It looks like the TSAN tests are failing on the new test you added.

Hmmm. For np.nonzero the result is not well-defined if the underlying data is changing. So should we care about a data race between the thread executing the np.nonzero, and some other thread modifying the data? If the data is modified, our results are not valid anyway.

(I am not familiar enough with TSAN do definitely conclude that this is happening though)

@ngoldbaum
Copy link
Member

ngoldbaum commented Feb 20, 2025

NumPy doesn't guarantee anything about concurrently modifying a shared array. We decided not to add locking for the free-threaded build since the same is true on the GIL-enabled build.

IMO in the long run we need something like an immutable (maybe copy-on-write?) and thread-safe object similar to ndarray.

Can you trigger the crash you saw without repeatedly writing to x in the test closure?

Also, out of curiosity, how did you run across this?

@seberg
Copy link
Member

seberg commented Feb 20, 2025

Unless we have a way to more generally make array access like this safer, I think the test is just about "don't segfault when it happens". Is it maybe right to just skip the test in TSAN?!

@seberg
Copy link
Member

seberg commented Feb 20, 2025

Also, out of curiosity, how did you run across this?

This wasn't actual code, it was just a realization when looking at the code that it is clearly not thread safe.

@ngoldbaum
Copy link
Member

ngoldbaum commented Feb 20, 2025

The TSAN tests only run in all the test files that do "import threading":

`find numpy -name "test*.py" | xargs grep -l "import threading" | tr '\n' ' '` \

You could make that logic a little smarter to pick up tests we want to intentionally exclude. We don't want to run the full test suite because it takes more than an hour with TSAN on the github runners.

If you do exclude the test, maybe add some comments saying that we know this is racy and shouldn't be run under TSAN?

Another thing we could do is create a suppressions file and use it.

@ngoldbaum
Copy link
Member

Hmm, it's not great that the TSAN tests ran for 6 hours before failing. I'm going to push a patch to the github actions configuration for that job to check if we can avoid that and still fail the test if there's a TSAN failure. Sorry for pushing to your PR branch...

@charris charris added 00 - Bug 09 - Backport-Candidate PRs tagged should be backported 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703) and removed 01 - Enhancement labels Feb 21, 2025
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this and figuring out the suppressions. A few comments.

numpy/_core/src/multiarray/item_selection.c Outdated Show resolved Hide resolved
numpy/_core/tests/test_multithreading.py Outdated Show resolved Hide resolved
x = np.random.randint(4, size=10_000).astype(dtype)

def func(seed):
x[::2] = np.random.randint(2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure what the best thing to do here is. It's probably better for each thread to get its own private RNG so there's no cross-talk or shared RNG state. I know the new numpy RNG infrastructure had a lot of careful thought put into it to make it possible to do stuff like this.

Maybe @rkern has a suggestion here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think the randomization is really important, we could also just flip between many non-zeros and many zeros in the array. I removed the seed argument

numpy/_core/tests/test_multithreading.py Show resolved Hide resolved
@@ -121,7 +121,7 @@ jobs:
- name: Test
run: |
# These tests are slow, so only run tests in files that do "import threading" to make them count
TSAN_OPTIONS=allocator_may_return_null=1:halt_on_error=1 \
TSAN_OPTIONS="allocator_may_return_null=1:suppressions=tools\ci\tsan_suppressions.txt" \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe try using an absolute path on the runner?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that, assuming the absolute path is relative stable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, both GITHUB_WORKSPACE and RUNNER_WORKSPACE seem to point to the right location. Picking the first option.

@eendebakpt
Copy link
Contributor Author

There might be two data races going on here: one in np.nonzero, the other one because multiple threads are writing to the same array. To avoid the second one we can either add more suppression entries to the TSAN configuration, or refactor the test so that there is 1 thread modifying the array and multiple threads performing the np.nonzero operation. I am leaning towards the latter option.

@ngoldbaum
Copy link
Member

The latter sounds better indeed.

@ngoldbaum
Copy link
Member

Thanks @eendebakpt!

@ngoldbaum ngoldbaum merged commit a1f2d58 into numpy:main Feb 22, 2025
68 checks passed
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 39 - free-threading PRs and issues related to support for free-threading CPython (a.k.a. no-GIL, PEP 703)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.