-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Add 'with_python' (i.e. with_gil) to prange and parallel #6562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is mostly targetted at freethreading builds, but should run on normal builds (with a lot of deadlock avoidance). It's designed to be a more efficient alternative to ``` for i in prange(...): with gil: ... ```
Cython/Compiler/Nodes.py
Outdated
warning( | ||
self.pos, | ||
"'with_gil' is experimental. Cython does almost nothing to ensure " | ||
"thread safe use of Python variables (including reference counting) " | ||
"and so it is completely your responsibility. Use with caution!", 999) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't intend to keep this warning forever. But I think we need more control of PyObject variables in parallel blocks before we can really start recommending it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of questions. Maybe also the name of the paremeter should be changed from with_gil
to something more hinting at the free-threaded build?
Cython/Compiler/Nodes.py
Outdated
# Any firstprivate creates a barrier at least on GCC (and thus | ||
# a deadlock if we're using the GIL). And there's also an implicit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you examined whether this would also be relevant if someone's using critical sections inside an openmp loop? If so, should we document that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming you mean with cython.critical_section
? (I ask because OpenMP has a "critical_section" feature that we use internally to guard certain small blocks... I think we use that safely though).
I don't think that's cython.critical_section
should be an issue here. The barrier is just that it waits for all threads to be ready to go before the absolutely first iteration. So it's just trying not to hold the GIL when that happens. There shouldn't be an issue with anyone using critical sections or other kinds of locks within the loop itself.
However, there is now some experimental support for running with the GIL in | ||
freethreading builds. You must either use these functions in a no-gil block, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that this wording will be confusing for most people that are not intimately familiar with how the free-threaded build works internally. Maybe we could reword it to something else like "with the thread state saved"? Although that seems suboptimal as well.
Ditto for other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point.
I think we're stuck with the naming for the existing with gil:
/with nogil
features. But I can add an explanation if the free-threading docs about what that means the the context of a freethreading build.
But since this is a new feature I can probably pick a different name for the parameter. And I'll try to rephrase the docs to be a bit clearer.
I've renamed |
That sounds better. What do you think about something like |
Probably too technical. I assume most users would rather not know about thread state at all. (Although I know the existing I prefer |
It opens the posibility of a deadock, seems like the wrong thing to do, and I don't think releasing the GIL is too expensive.
It seems more reliable than a complicated jugling of the GIL with a barrier
So, I looked over this and read through the documentation snippets, but I still feel unsure about the goal and effects of this change. I gathered that the new Could you maybe extend the PR description with some kind of problem description and motivation? |
It's pretty close to writing
It's slightly more efficient (just by not using the
Mainly because it's a bit of a pitfall with the non-freethreading interpreter. It means that people that accidentally omit |
So … can't we make the idiom above more efficient then? Is it only
But isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a |
I'm possibly not explaining very well
is transformed to something pretty close to
(but a little faster)
Right now you get a compile-time error if you try to write a But yes - in principle we could just try to speed up |
Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code. But I can see that it can be useful to do some kind of Python level initialisation inside of the thread first. We also have parallel sections for stuff like that, but well, it's still a use case. And I think the obvious idiom for it is Do you have an example of real world code where this is used? |
BTW, we don't have any benchmarks for our parallel code, neither micro nor macro. Since you're working on the freethreading and parallel code, maybe you could come up with at least some microbenchmarks for this kind of mixture between Python and nogil code? I'm not sure if they'll run well in CI (could be that we have a single CPU core there), but we'll see. At least sequential locking and critical sections might still be candidates for measuring their overhead. |
The use case is that in freethreaded Python this code is parallel. This is mainly targeted at people using freethreaded Python who want to use The main limitation is that it isn't currently much better than writing I subsequently had a thought about how to avoid needing to release the GIL on each loop (at least in some cases) so maybe I should look at that first - it's easier to justify something new if it genuinely does lead to better performance |
Make the loop "nowait" and force synchronization after it
So I think I've been able to drop the "release the GIL on every go round", which should be an improvement. A quick timing test:
On freethreading Python 3.13
On GIL Python 3.13
(i.e. when the GIL is contended then constantly swapping it is absolutely awful). So the upshot is, it's a mild improvement for the freethreading builds (for which it's intended) and a dramatic improvement for the GIL build (where it isn't intended) |
freethreading_compatible=True
Isn't this a good indicator for whether we should warn about the GIL being released or not when we enter a prange loop? Even with your speed improvement, it still matters in non-FT builds, right?
|
I think what you're suggesting is:
? If so, that seems reasonable to me.
Yes - on non-FT builds it's running sequentially, just not fighting over the GIL quite so often. |
Yes, that's what I was thinking. |
This is mostly targetted at freethreading builds, but should run on normal builds (with a lot of deadlock avoidance).
It's designed to be a more efficient alternative to