Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add 'with_python' (i.e. with_gil) to prange and parallel #6562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
Loading
from

Conversation

da-woods
Copy link
Contributor

This is mostly targetted at freethreading builds, but should run on normal builds (with a lot of deadlock avoidance).

It's designed to be a more efficient alternative to

for i in prange(...):
  with gil:
    ...

This is mostly targetted at freethreading builds, but should
run on normal builds (with a lot of deadlock avoidance).

It's designed to be a more efficient alternative to

```
for i in prange(...):
  with gil:
    ...
```
Comment on lines 9485 to 9489
warning(
self.pos,
"'with_gil' is experimental. Cython does almost nothing to ensure "
"thread safe use of Python variables (including reference counting) "
"and so it is completely your responsibility. Use with caution!", 999)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't intend to keep this warning forever. But I think we need more control of PyObject variables in parallel blocks before we can really start recommending it.

Copy link
Contributor

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of questions. Maybe also the name of the paremeter should be changed from with_gil to something more hinting at the free-threaded build?

Comment on lines 10515 to 10516
# Any firstprivate creates a barrier at least on GCC (and thus
# a deadlock if we're using the GIL). And there's also an implicit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you examined whether this would also be relevant if someone's using critical sections inside an openmp loop? If so, should we document that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming you mean with cython.critical_section? (I ask because OpenMP has a "critical_section" feature that we use internally to guard certain small blocks... I think we use that safely though).

I don't think that's cython.critical_section should be an issue here. The barrier is just that it waits for all threads to be ready to go before the absolutely first iteration. So it's just trying not to hold the GIL when that happens. There shouldn't be an issue with anyone using critical sections or other kinds of locks within the loop itself.

Comment on lines +23 to +24
However, there is now some experimental support for running with the GIL in
freethreading builds. You must either use these functions in a no-gil block,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that this wording will be confusing for most people that are not intimately familiar with how the free-threaded build works internally. Maybe we could reword it to something else like "with the thread state saved"? Although that seems suboptimal as well.

Ditto for other places.

Copy link
Contributor Author

@da-woods da-woods Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point.

I think we're stuck with the naming for the existing with gil:/with nogil features. But I can add an explanation if the free-threading docs about what that means the the context of a freethreading build.

But since this is a new feature I can probably pick a different name for the parameter. And I'll try to rephrase the docs to be a bit clearer.

@da-woods da-woods changed the title Add 'with_gil' to prange and parallel Add 'with_python' (i.e. with_gil) to prange and parallel Jan 8, 2025
@da-woods
Copy link
Contributor Author

da-woods commented Jan 8, 2025

I've renamed with_gil in this context to with_python and tried to clarify the docs here.

@lysnikolaou
Copy link
Contributor

I've renamed with_gil in this context to with_python and tried to clarify the docs here.

That sounds better. What do you think about something like with_thread_state_saved?

@da-woods
Copy link
Contributor Author

da-woods commented Jan 8, 2025

I've renamed with_gil in this context to with_python and tried to clarify the docs here.

That sounds better. What do you think about something like with_thread_state_saved?

Probably too technical. I assume most users would rather not know about thread state at all. (Although I know the existing with gil/with nogil are similarly technical).

I prefer with_python because it really just describes what you can do.

@da-woods da-woods added this to the 3.2 milestone May 3, 2025
da-woods added 2 commits May 3, 2025 11:57
It opens the posibility of a deadock, seems like the wrong thing
to do, and I don't think releasing the GIL is too expensive.
Cython/Compiler/Nodes.py Outdated Show resolved Hide resolved
da-woods added 2 commits May 29, 2025 19:11
It seems more reliable than a complicated jugling of the GIL
with a barrier
@scoder
Copy link
Contributor

scoder commented May 31, 2025

So, I looked over this and read through the documentation snippets, but I still feel unsure about the goal and effects of this change. I gathered that the new with_python=True option assures a valid thread state for the OpenMP threads but does not itself control the GIL state, is that correct? Why can't all OpenMP threads have Python thread states right away, without a special option?

Could you maybe extend the PR description with some kind of problem description and motivation?

@da-woods
Copy link
Contributor Author

It's pretty close to writing

for i in prange(N, nogil=True):
    with gil:
        ... # code

It's slightly more efficient (just by not using the PyGILState API as much) but not as much as I hoped when I first started looking at it - the big limitation is that Cython does have to release and re-acquire the GIL each time you go round the loop to avoid deadlock (or somehow eliminate last_private from the loop, which is is what requires the barrier).

Why can't all OpenMP threads have Python thread states right away, without a special option?

Mainly because it's a bit of a pitfall with the non-freethreading interpreter. It means that people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL. So at least for me it seems better to make it explicit

@scoder
Copy link
Contributor

scoder commented May 31, 2025

It's pretty close to writing

for i in prange(N, nogil=True):
    with gil:
        ... # code

It's slightly more efficient

So … can't we make the idiom above more efficient then? Is it only for in prange() directly followed by with gil or also a with gil appearing somewhere later in the (otherwise nogil) loop? The first should be easy to detect.

people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL

But isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a prange loop without releasing the GIL?

@da-woods
Copy link
Contributor Author

I'm possibly not explaining very well

for i in prange(N, with_python=True):
    ... # code

is transformed to something pretty close to

for i in prange(N, nogil=True):
    with gil:
        ... # code

(but a little faster)

people that accidentally omit nogil=True will not get any parallelism even if they aren't using the GIL

But isn't that expected? I think the documentation states pretty clearly that parallelism requires releasing the GIL. Wasn't it always the case that you get sequential "threads" if you run a prange loop without releasing the GIL?

Right now you get a compile-time error if you try to write a prange loop without releasing the GIL.


But yes - in principle we could just try to speed up prange(N, nogil=True): with gil: instead and just document that as the idiomatic approach. Given that it doesn't seem possible to safely go round the loop without releasing the GIL the benefit of this PR is fairly minor anyway

@scoder
Copy link
Contributor

scoder commented May 31, 2025

Right now you get a compile-time error if you try to write a prange loop without releasing the GIL.

Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code. But I can see that it can be useful to do some kind of Python level initialisation inside of the thread first. We also have parallel sections for stuff like that, but well, it's still a use case. And I think the obvious idiom for it is for in prange: with gil:. If there's a way to optimise such code, why not. I don't see why we should bother users with yet another option.

Do you have an example of real world code where this is used?

@scoder
Copy link
Contributor

scoder commented May 31, 2025

BTW, we don't have any benchmarks for our parallel code, neither micro nor macro. Since you're working on the freethreading and parallel code, maybe you could come up with at least some microbenchmarks for this kind of mixture between Python and nogil code? I'm not sure if they'll run well in CI (could be that we have a single CPU core there), but we'll see. At least sequential locking and critical sections might still be candidates for measuring their overhead.

@da-woods
Copy link
Contributor Author

Ah, so, we basically push people into releasing the GIL even if they directly need it afterwards, is that it? Ok, that use case doesn't seem hugely important to me – if you want parallelism, you probably want it for parallel code, not sequential code

The use case is that in freethreaded Python this code is parallel.


This is mainly targeted at people using freethreaded Python who want to use prange loops without manually messing around with the GIL. (On non-freethreaded Python the intention is that it works, but clearly the GIL means it won't work well).

The main limitation is that it isn't currently much better than writing for i in prange(N, nogil=True): with gil: ..., so it's mainly a better expression of intent than better C code.

I subsequently had a thought about how to avoid needing to release the GIL on each loop (at least in some cases) so maybe I should look at that first - it's easier to justify something new if it genuinely does lead to better performance

Make the loop "nowait" and force synchronization after it
@da-woods
Copy link
Contributor Author

So I think I've been able to drop the "release the GIL on every go round", which should be an improvement.

A quick timing test:

# cython: freethreading_compatible=True

# distutils: extra_compile_args=-fopenmp
# distutils: extra_link_args=-fopenmp

from cython.parallel import prange

def f(i):
    return i

def test_prange_with_python(int N):
    cdef int i = 0

    out = [None]*N

    for i in prange(N, with_python=True):
        out[i] = f(i)
    
def test_old_way(int N):
    cdef int i = 0

    out = [None]*N

    for i in prange(N, nogil=True):
        with gil:
            out[i] = f(i)

def do_timing(N):
    from datetime import datetime
    t0 = datetime.now()
    test_old_way(N)
    t1 = datetime.now()
    test_prange_with_python(N)
    t2 = datetime.now()

    print(f"old way:", t1-t0)
    print(f"new way:", t2-t1)

On freethreading Python 3.13 N=1000000 the with_python version is consistently about 10% faster (i.e. noticeable but not huge):

old way: 0:00:00.249138
new way: 0:00:00.221372

On GIL Python 3.13 N=1000000 the with_python version is dramatically faster

old way: 0:00:28.616594
new way: 0:00:00.073102

(i.e. when the GIL is contended then constantly swapping it is absolutely awful).


So the upshot is, it's a mild improvement for the freethreading builds (for which it's intended) and a dramatic improvement for the GIL build (where it isn't intended)

@scoder
Copy link
Contributor

scoder commented Jun 1, 2025 via email

@da-woods
Copy link
Contributor Author

da-woods commented Jun 1, 2025

Isn't this a good indicator for whether we should warn about the GIL being released or not when we enter a prange loop?

I think what you're suggesting is:

  • Remove the explicit with_gil directive,
  • If the user omits nogil=True then don't error - let them use the GIL.
  • Warn if they don't release the GIL and don't mark it as freethreading_compatible

?

If so, that seems reasonable to me.

Even with your speed improvement, it still matters in non-FT builds, right?

Yes - on non-FT builds it's running sequentially, just not fighting over the GIL quite so often.

@scoder
Copy link
Contributor

scoder commented Jun 2, 2025

  • Remove the explicit with_gil directive,
  • If the user omits nogil=True then don't error - let them use the GIL.
  • Warn if they don't release the GIL and don't mark it as freethreading_compatible

Yes, that's what I was thinking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.