-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
BUG: Make np.nonzero threading safe #28361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It looks like the TSAN tests are failing on the new test you added. |
Hmmm. For np.nonzero the result is not well-defined if the underlying data is changing. So should we care about a data race between the thread executing the np.nonzero, and some other thread modifying the data? If the data is modified, our results are not valid anyway. (I am not familiar enough with TSAN do definitely conclude that this is happening though) |
NumPy doesn't guarantee anything about concurrently modifying a shared array. We decided not to add locking for the free-threaded build since the same is true on the GIL-enabled build. IMO in the long run we need something like an immutable (maybe copy-on-write?) and thread-safe object similar to ndarray. Can you trigger the crash you saw without repeatedly writing to Also, out of curiosity, how did you run across this? |
Unless we have a way to more generally make array access like this safer, I think the test is just about "don't segfault when it happens". Is it maybe right to just skip the test in TSAN?! |
This wasn't actual code, it was just a realization when looking at the code that it is clearly not thread safe. |
The TSAN tests only run in all the test files that do "import threading":
You could make that logic a little smarter to pick up tests we want to intentionally exclude. We don't want to run the full test suite because it takes more than an hour with TSAN on the github runners. If you do exclude the test, maybe add some comments saying that we know this is racy and shouldn't be run under TSAN? Another thing we could do is create a suppressions file and use it. |
Hmm, it's not great that the TSAN tests ran for 6 hours before failing. I'm going to push a patch to the github actions configuration for that job to check if we can avoid that and still fail the test if there's a TSAN failure. Sorry for pushing to your PR branch... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for working on this and figuring out the suppressions. A few comments.
x = np.random.randint(4, size=10_000).astype(dtype) | ||
|
||
def func(seed): | ||
x[::2] = np.random.randint(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not actually sure what the best thing to do here is. It's probably better for each thread to get its own private RNG so there's no cross-talk or shared RNG state. I know the new numpy RNG infrastructure had a lot of careful thought put into it to make it possible to do stuff like this.
Maybe @rkern has a suggestion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think the randomization is really important, we could also just flip between many non-zeros and many zeros in the array. I removed the seed argument
@@ -121,7 +121,7 @@ jobs: | ||
- name: Test | ||
run: | | ||
# These tests are slow, so only run tests in files that do "import threading" to make them count | ||
TSAN_OPTIONS=allocator_may_return_null=1:halt_on_error=1 \ | ||
TSAN_OPTIONS="allocator_may_return_null=1:suppressions=tools\ci\tsan_suppressions.txt" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe try using an absolute path on the runner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try that, assuming the absolute path is relative stable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use an environment variable to get the base directory: https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/store-information-in-variables#default-environment-variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, both GITHUB_WORKSPACE and RUNNER_WORKSPACE seem to point to the right location. Picking the first option.
There might be two data races going on here: one in |
The latter sounds better indeed. |
Thanks @eendebakpt! |
We add a test for np.nonzero under multi-threading and make np.nonzero safe under the cpython free-threading build. We want to ensure concurrent invocations of the method on the same array do not corrupt the system. Correct results are not guaranteed:
Also see #27519. An alternative approach would be to use locks to make the array read-only during the operation.