Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH: speed up matmul for non-contiguous operands #23588 #23752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 9, 2025

Conversation

xor2k
Copy link
Contributor

@xor2k xor2k commented May 11, 2023

Should provide a solution for #23588

Further speed/memory optimizations are still possible (I'll work on it these days), but initial working solution that does the trick. Open for feedback already.

Closes gh-23123, gh-23588

@charris charris changed the title speed up matmul #23588 MAINT: speed up matmul #23588 May 11, 2023
@MatteoRaso
Copy link
Contributor

Thanks for the PR. Please add benchmark results so we can measure the speedup.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, and much simpler than I imagined. Nice!

#if @USEBLAS@ && defined(HAVE_CBLAS)
free(tmp_ip1);
free(tmp_ip2);
free(tmp_op);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking for NULL would save a syscall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked free on https://godbolt.org and indeed, it always creates a syscall, even with the latest GCC and Clang and -O3. Quite disappointing. I've thus added checks for NULL.

if(
tmp_ip1 == NULL || tmp_ip2 == NULL || tmp_op == NULL
) {
@TYPE@_matmul_inner_noblas(ip1, is1_m, is1_n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to refactor this to error out rather than take the slow path. The case where malloc fails is the case that will be really slow in the slow path. Erroring would require refactoring this to be a new-style inner loop function (with a return value), which could be done later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work to grab the GIL, set the memory error and then just return. It would be nicer with a new-style loop of course, but a larger refactor indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution was quite simple, should be done

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should indeed make sure there are tests, although code-coverage suggest they exist, so I guess it should be good (malloc errors cannot be covered reasonably).

LGTM mostly, I think we should make it a hard error: just try once that the error actually works by changing the code. And a bit of refactor/double checking iteration order would be a good.

We should indeed have a small benchmark to proof its worth it. Probably just add one that uses striding in one or two places is enough. Alternatively, one that works on arr.real or arr.imag (for a complex array), which also uses strides implicitly.

if(
tmp_ip1 == NULL || tmp_ip2 == NULL || tmp_op == NULL
) {
@TYPE@_matmul_inner_noblas(ip1, is1_m, is1_n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should work to grab the GIL, set the memory error and then just return. It would be nicer with a new-style loop of course, but a larger refactor indeed.

}
if(tmp_op == NULL) {
tmp_op = (@typ@ *)malloc(sizeof(@typ@) * dm * dp);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, most of the time we don't have to copy all ops, just some. The info should already be available from the earlier probing code.

For example, especially the output is very unlikely to require copying, because its typically a freshly created array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noticed that, too & I've fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copying of the output when it is alread aligned still seems to be there!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would indeed be good nice to fix, I think. The arrays could be:

  1. Already blas-able (either C or F).
  2. We should also deal with memory order nicely (this could make a huge speed difference when it kicks in!):
    • If it is already blasable, we should not copy. Right now that isn't the case, because when the branch is taken, it is always taken for all 3 arguments.
    • Even when we copying, you should check whether the array is F or C memory order (but not contiguous). I.e. F order means that strides[2] > strides[1] (strides[0] is outer-loop). In that case: 1. the copy should swap iteration order and 2. pass the transpose flag to blas.

Copy link
Contributor

@mhvk mhvk Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess somewhat related, we have to ensure that all paths get hit by the tests.

EDIT: Note that one could, e.g., create a single array and use the axes argument to swap axes, etc.

Copy link
Contributor Author

@xor2k xor2k Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would indeed be good nice to fix, I think. The arrays could be:

  1. Already blas-able (either C or F).

  2. We should also deal with memory order nicely (this could make a huge speed difference when it kicks in!):

    • If it is already blasable, we should not copy. Right now that isn't the case, because when the branch is taken, it is always taken for all 3 arguments.
    • Even when we copying, you should check whether the array is F or C memory order (but not contiguous). I.e. F order means that strides[2] > strides[1] (strides[0] is outer-loop). In that case: 1. the copy should swap iteration order and 2. pass the transpose flag to blas.

2.1 is done by the latest version of the PR. However, I have a question about that transpose flag: it's about the BLAS matrix matrix multiplication (can't affect copying, since copying does not have a transpose flag). Is there a need to modify the signature of @TYPE@_matmul_matrixmatrix or is it sufficient to just swap some strides? The copying will definitely normalize the matrix strides, so strides[2] < strides[1] will be ensured after the copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copying of the output when it is alread aligned still seems to be there!

I've implemented that in the most recent version, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess somewhat related, we have to ensure that all paths get hit by the tests.

EDIT: Note that one could, e.g., create a single array and use the axes argument to swap axes, etc.

You mean test or benchmark? During development I had the impression that the test coverage is quite good. The benchmark could need an update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, C-coverage isn't run in the test suite anymore. FWIW, I suspect you are right and test coverage is fairly good. The one thing I could imagine missing would be a strided out= argument.

numpy/core/src/umath/matmul.c.src Outdated Show resolved Hide resolved
}
else {
for (n = 0; n < dn; n++) {
for (m = 0; m < dm; m++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check and it is probably true. But you should ensure that is_m < is_n (which depends on the memory order). If that is not true, the loop order needs to be swapped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you go a little more into detail what should be done if that is the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would probably make sense to cover this with another test case, would it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess lapack supports a.T @ b.T = o.T mainly, and for that it would make sense to swap things early on (m and n). It might not matter too much in practice, since it may be that for larger matrices (things don't fit into L1 comfortably?), the actual matmul dominates anyway.

There is one other things which could make sense. IIRC when n is small (mainl?) its not really worthwhile to do the copy at all, very rough timing I think n < 10 might just be a reasonable thing.

@seberg
Copy link
Member

seberg commented May 18, 2023

@xor2k you asked yesterday about what to do here.
First, we should have very basic additional benchmarks I think. Nothing fancy, just one that shows a good speedup. That is largely because it is de-facto policy to have a chance of avoiding future regression.

This is maybe mainly a general improvement, on 1.23.4 (probably with openblas), I get:

In [1]: a = np.ones((10000, 2, 4))[:, :, ::2]
In [2]: %timeit a @ a
78.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: a = np.ones((10000, 2, 2))
In [4]: %timeit a @ a
507 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

with your branch, the first one goes into the same as the contiguous one:

In [2]: %timeit a @ a
615 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We can see that the blas overhead is significant.

I don't want to map that out exactly, and yes different blas versions will behave very differently. Honestly, we also can opt to ignore it, because currently we already run into the "slow" path when the input is contiguous.
But maybe we can find a very rough heuritistc to avoid the 10x slowdown?!

As for the discussion about order. Lets ignore it, I think it is irrelevant, sorry. If arrays are tiny, the cache will take care of things. If (working) arrays are large, the matmul should dominate a huge amount anyway.


The other point is that we still need to fix the error handling if the malloc fails. If it fails, you need to grab the GIL and set a memory error, you can grep for NPY_ALLOW_C_API_DEF to see the pattern. Admittedly, its slightly terrifying, because without using the new API (which is a lot more work), we may call the inner-loop multiple times and rely on malloc failing the same way each time.
It should be safe in the sense that we definitely get a memory error at the end, even if the malloc later decides to be successful.

As far as I understood Matti, he also thought that going into the (potentially super slow) fallback path when we are so low on memory, is probably not useful.

@xor2k
Copy link
Contributor Author

xor2k commented May 19, 2023

Okay, I understand. However, this barely affects this pull request as this is an issue that affects the use of BLAS in general, so it has affected Numpy even before my pull request. Other programming languages have reported that "issue" (won't call it bug) as well, see a long discussion with benchmarks going back to 2013 e.g.

JuliaLang/julia#3239

So one can basically compile Numpy without BLAS support, e.g. by running

NPY_BLAS_ORDER= NPY_LAPACK_ORDER= pip install -e .

in a test venv/conda environment and many small matrix multiplications will be much faster. I suggest to open a fresh issue for the many-small-matmul performance. I have some ideas already.

I'll be on vacation next week but will be available the week after.

@xor2k
Copy link
Contributor Author

xor2k commented May 31, 2023

@xor2k you asked yesterday about what to do here. First, we should have very basic additional benchmarks I think. Nothing fancy, just one that shows a good speedup. That is largely because it is de-facto policy to have a chance of avoiding future regression.

This is maybe mainly a general improvement, on 1.23.4 (probably with openblas), I get:

In [1]: a = np.ones((10000, 2, 4))[:, :, ::2]
In [2]: %timeit a @ a
78.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: a = np.ones((10000, 2, 2))
In [4]: %timeit a @ a
507 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

with your branch, the first one goes into the same as the contiguous one:

In [2]: %timeit a @ a
615 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We can see that the blas overhead is significant.

The main issue is that BLAS does not support "batched matrix multiply" BMM, which would be really useful in cases like that, compare

https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-batch-gemm-operations.html

In PyTorch, they have it:

https://pytorch.org/docs/stable/generated/torch.bmm.html
https://stackoverflow.com/questions/62015172/how-to-find-c-source-code-of-torch-bmm-of-pytorch

BMM would allow to skip the outer loop and take the batch dimension (10000) as an argument into the accelerated call instead of making it a loop of size 10000.

I see these options:

  1. Ignore the problem. It existed before, introduces a inefficiency of a magnitude and my commit only worsens it by a few percent.
  2. Try to figure out how exactly PyTorch does a BMM on a CPU and try to reimplement it
  3. Use PyTorch as a backend for Numpy
  4. Provide an option method=force_naive to the matmul function so that the user can override that manually
  5. Run a mini-benchmark when calling import numpy. Importing Numpy takes very long anyway, a few milliseconds of benchmark won't change that a lot. Then, use these results of the benchmark to decide whether to use naive multiplication or BLAS on a per-call basis.
  6. Run Benchmarks from 5 at build time. Would complicate the build though and the systems where Numpy will be run might differ from those where it was build.
  7. Hard-code some plausible thresholds based on benchmarks with modern systems. I've got a MacBook M1 and Ryzen 5950X at hand.

Any thoughts?

@seberg
Copy link
Member

seberg commented May 31, 2023

Only 1 and 7 sound reasonable to me everything else is unnecessarily complicated or just impossible.

@merny93
Copy link

merny93 commented Aug 11, 2023

I would argue that ignoring the problem is perfectly fair (vote for option 1).

This pull request aims to fix a ~100 times slow down caused by completely disregarding BLAS and defaulting to an extremely naïve matrix multiply. The only users that will notice a performance reduction are ones that are explicitly taking advantage of this bug/feature to trick numpy into avoiding BLAS for small arrays by breaking the strides.

In any case, a 100 times speedup on large operations (seconds) is worth more than a small offset on small operations. Overhead is always present in numpy and users tend to be aware of this trade-off which gives good performance for large operations.

@xor2k
Copy link
Contributor Author

xor2k commented Nov 20, 2023

Sorry for my late answer. I have made some tests and unfortunately could not come up with safe values to hardcode. I would also vote for option 1.

@mattip
Copy link
Member

mattip commented Dec 14, 2023

Could you run the benchmarks again?

Copy link
Member

@seiko2plus seiko2plus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just the part related to memory allocations, we should use our custom memory allocations.

@xor2k
Copy link
Contributor Author

xor2k commented Apr 6, 2024

LGTM, just the part related to memory allocations, we should use our custom memory allocations.

What exactly do you mean? PyMem_Malloc() and PyMem_Free()? Or something Numpy specific?

@xor2k
Copy link
Contributor Author

xor2k commented Feb 2, 2025

Must have accidentally closed the pull request, reopening.

@xor2k xor2k force-pushed the main branch 2 times, most recently from 4227c83 to a44d0c6 Compare February 15, 2025 14:19
@seberg seberg changed the title MAINT: speed up matmul for non-aligned input #23588 MAINT: speed up matmul for non-contiguous operands #23588 Feb 16, 2025
@xor2k xor2k force-pushed the main branch 12 times, most recently from 11f5430 to 7d2a0b6 Compare February 19, 2025 22:33
@xor2k xor2k requested a review from seberg February 19, 2025 22:47
@xor2k xor2k closed this Feb 21, 2025
@xor2k xor2k reopened this Feb 21, 2025
@seberg seberg changed the title MAINT: speed up matmul for non-contiguous operands #23588 ENH: speed up matmul for non-contiguous operands #23588 Mar 9, 2025
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Michael, sorry for this taking so long, I'll refuse the temptatatio to think about whether this can be shortened a bit and just put it in :).

It is a significant improvement after all!

If anyone notices a performance regression in their use-case and stumbles on this: Please let us know, this could (and maybe should) have a heuristic to decide to skip blas.

@seberg seberg dismissed seiko2plus’s stale review March 9, 2025 10:11

Let's follow up, it's a tiny thing and we use malloc in many places still (even if that isn't great).

@seberg seberg merged commit f104291 into numpy:main Mar 9, 2025
70 of 79 checks passed
@xor2k
Copy link
Contributor Author

xor2k commented Mar 9, 2025

Thank you! It took long indeed. However, I think the true bottleneck after all was a proper assessment of how much of a speedup this merge request actually was, which should be essential for every such undertaking and it can be done conclusively, though it took a while to come up with the right approach.

Regarding tiny (but also small) matrices, I think the Blas frameworks should fix their performance issues first themselves (if there are any) so that it is never necessary not to use Blas in the very first place. I advertised #23752 (comment) in the respective Github repositories for the open source frameworks (references can be found in this merge request as well) and let's see what will happen.

@xor2k
Copy link
Contributor Author

xor2k commented Mar 16, 2025

Btw. if somebody is interested: here to code to render the graphs above https://gist.github.com/xor2k/b2b7d1d5e87bfe8a8a30c2d0c7e12f9e

@xor2k
Copy link
Contributor Author

xor2k commented Mar 20, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Multiplication by real part of complex matrix very slow
8 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.