ENH: speed up matmul for non-contiguous operands #23588 #23752

xor2k · May 11, 2023

Should provide a solution for #23588

Further speed/memory optimizations are still possible (I'll work on it these days), but initial working solution that does the trick. Open for feedback already.

Closes gh-23123, gh-23588

MatteoRaso · May 12, 2023

Thanks for the PR. Please add benchmark results so we can measure the speedup.

mattip

Seems reasonable, and much simpler than I imagined. Nice!

mattip · May 12, 2023

numpy/core/src/umath/matmul.c.src

+#if @USEBLAS@ && defined(HAVE_CBLAS)
+    free(tmp_ip1);
+    free(tmp_ip2);
+    free(tmp_op);


Checking for NULL would save a syscall.

Checked free on https://godbolt.org and indeed, it always creates a syscall, even with the latest GCC and Clang and -O3. Quite disappointing. I've thus added checks for NULL.

mattip · May 12, 2023

numpy/core/src/umath/matmul.c.src

+                if(
+                    tmp_ip1 == NULL || tmp_ip2 == NULL || tmp_op == NULL
+                ) {
+                    @TYPE@_matmul_inner_noblas(ip1, is1_m, is1_n, 


It might be nice to refactor this to error out rather than take the slow path. The case where malloc fails is the case that will be really slow in the slow path. Erroring would require refactoring this to be a new-style inner loop function (with a return value), which could be done later.

I think it should work to grab the GIL, set the memory error and then just return. It would be nicer with a new-style loop of course, but a larger refactor indeed.

The solution was quite simple, should be done

seberg

We should indeed make sure there are tests, although code-coverage suggest they exist, so I guess it should be good (malloc errors cannot be covered reasonably).

LGTM mostly, I think we should make it a hard error: just try once that the error actually works by changing the code. And a bit of refactor/double checking iteration order would be a good.

We should indeed have a small benchmark to proof its worth it. Probably just add one that uses striding in one or two places is enough. Alternatively, one that works on arr.real or arr.imag (for a complex array), which also uses strides implicitly.

seberg · May 12, 2023

numpy/core/src/umath/matmul.c.src

+                if(
+                    tmp_ip1 == NULL || tmp_ip2 == NULL || tmp_op == NULL
+                ) {
+                    @TYPE@_matmul_inner_noblas(ip1, is1_m, is1_n, 


I think it should work to grab the GIL, set the memory error and then just return. It would be nicer with a new-style loop of course, but a larger refactor indeed.

seberg · May 12, 2023

numpy/core/src/umath/matmul.c.src

+                }
+                if(tmp_op == NULL) {
+                    tmp_op = (@typ@ *)malloc(sizeof(@typ@) * dm * dp);
+                }


In practice, most of the time we don't have to copy all ops, just some. The info should already be available from the earlier probing code.

For example, especially the output is very unlikely to require copying, because its typically a freshly created array.

I've noticed that, too & I've fixed it.

The copying of the output when it is alread aligned still seems to be there!

This would indeed be good nice to fix, I think. The arrays could be:

Already blas-able (either C or F).

We should also deal with memory order nicely (this could make a huge speed difference when it kicks in!):

If it is already blasable, we should not copy. Right now that isn't the case, because when the branch is taken, it is always taken for all 3 arguments.

Even when we copying, you should check whether the array is F or C memory order (but not contiguous). I.e. F order means that strides[2] > strides[1] (strides[0] is outer-loop). In that case: 1. the copy should swap iteration order and 2. pass the transpose flag to blas.

I guess somewhat related, we have to ensure that all paths get hit by the tests.

EDIT: Note that one could, e.g., create a single array and use the axes argument to swap axes, etc.

This would indeed be good nice to fix, I think. The arrays could be:

Already blas-able (either C or F).

We should also deal with memory order nicely (this could make a huge speed difference when it kicks in!):

If it is already blasable, we should not copy. Right now that isn't the case, because when the branch is taken, it is always taken for all 3 arguments.

Even when we copying, you should check whether the array is F or C memory order (but not contiguous). I.e. F order means that strides[2] > strides[1] (strides[0] is outer-loop). In that case: 1. the copy should swap iteration order and 2. pass the transpose flag to blas.

2.1 is done by the latest version of the PR. However, I have a question about that transpose flag: it's about the BLAS matrix matrix multiplication (can't affect copying, since copying does not have a transpose flag). Is there a need to modify the signature of @TYPE@_matmul_matrixmatrix or is it sufficient to just swap some strides? The copying will definitely normalize the matrix strides, so strides[2] < strides[1] will be ensured after the copy.

The copying of the output when it is alread aligned still seems to be there!

I've implemented that in the most recent version, too.

I guess somewhat related, we have to ensure that all paths get hit by the tests.

EDIT: Note that one could, e.g., create a single array and use the axes argument to swap axes, etc.

You mean test or benchmark? During development I had the impression that the test coverage is quite good. The benchmark could need an update.

Unfortunately, C-coverage isn't run in the test suite anymore. FWIW, I suspect you are right and test coverage is fairly good. The one thing I could imagine missing would be a strided out= argument.

numpy/core/src/umath/matmul.c.src

seberg · May 12, 2023

numpy/core/src/umath/matmul.c.src

+                }
+                else {
+                    for (n = 0; n < dn; n++) {
+                        for (m = 0; m < dm; m++) {


I didn't check and it is probably true. But you should ensure that is_m < is_n (which depends on the memory order). If that is not true, the loop order needs to be swapped.

Can you go a little more into detail what should be done if that is the case?

Would probably make sense to cover this with another test case, would it?

Well, I guess lapack supports a.T @ b.T = o.T mainly, and for that it would make sense to swap things early on (m and n). It might not matter too much in practice, since it may be that for larger matrices (things don't fit into L1 comfortably?), the actual matmul dominates anyway.

There is one other things which could make sense. IIRC when n is small (mainl?) its not really worthwhile to do the copy at all, very rough timing I think n < 10 might just be a reasonable thing.

seberg · May 18, 2023

@xor2k you asked yesterday about what to do here.
First, we should have very basic additional benchmarks I think. Nothing fancy, just one that shows a good speedup. That is largely because it is de-facto policy to have a chance of avoiding future regression.

This is maybe mainly a general improvement, on 1.23.4 (probably with openblas), I get:

In [1]: a = np.ones((10000, 2, 4))[:, :, ::2]
In [2]: %timeit a @ a
78.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: a = np.ones((10000, 2, 2))
In [4]: %timeit a @ a
507 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

with your branch, the first one goes into the same as the contiguous one:

In [2]: %timeit a @ a
615 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

We can see that the blas overhead is significant.

I don't want to map that out exactly, and yes different blas versions will behave very differently. Honestly, we also can opt to ignore it, because currently we already run into the "slow" path when the input is contiguous.
But maybe we can find a very rough heuritistc to avoid the 10x slowdown?!

As for the discussion about order. Lets ignore it, I think it is irrelevant, sorry. If arrays are tiny, the cache will take care of things. If (working) arrays are large, the matmul should dominate a huge amount anyway.

The other point is that we still need to fix the error handling if the malloc fails. If it fails, you need to grab the GIL and set a memory error, you can grep for NPY_ALLOW_C_API_DEF to see the pattern. Admittedly, its slightly terrifying, because without using the new API (which is a lot more work), we may call the inner-loop multiple times and rely on malloc failing the same way each time.
It should be safe in the sense that we definitely get a memory error at the end, even if the malloc later decides to be successful.

As far as I understood Matti, he also thought that going into the (potentially super slow) fallback path when we are so low on memory, is probably not useful.

xor2k · May 19, 2023

Okay, I understand. However, this barely affects this pull request as this is an issue that affects the use of BLAS in general, so it has affected Numpy even before my pull request. Other programming languages have reported that "issue" (won't call it bug) as well, see a long discussion with benchmarks going back to 2013 e.g.

JuliaLang/julia#3239

So one can basically compile Numpy without BLAS support, e.g. by running

NPY_BLAS_ORDER= NPY_LAPACK_ORDER= pip install -e .

in a test venv/conda environment and many small matrix multiplications will be much faster. I suggest to open a fresh issue for the many-small-matmul performance. I have some ideas already.

I'll be on vacation next week but will be available the week after.

xor2k · May 31, 2023

@xor2k you asked yesterday about what to do here. First, we should have very basic additional benchmarks I think. Nothing fancy, just one that shows a good speedup. That is largely because it is de-facto policy to have a chance of avoiding future regression.

This is maybe mainly a general improvement, on 1.23.4 (probably with openblas), I get:
In [1]: a = np.ones((10000, 2, 4))[:, :, ::2]
In [2]: %timeit a @ a
78.4 µs ± 204 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: a = np.ones((10000, 2, 2))
In [4]: %timeit a @ a
507 µs ± 537 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
with your branch, the first one goes into the same as the contiguous one:
In [2]: %timeit a @ a
615 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
We can see that the blas overhead is significant.

The main issue is that BLAS does not support "batched matrix multiply" BMM, which would be really useful in cases like that, compare

https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
https://www.intel.com/content/www/us/en/developer/articles/technical/introducing-batch-gemm-operations.html

In PyTorch, they have it:

https://pytorch.org/docs/stable/generated/torch.bmm.html
https://stackoverflow.com/questions/62015172/how-to-find-c-source-code-of-torch-bmm-of-pytorch

BMM would allow to skip the outer loop and take the batch dimension (10000) as an argument into the accelerated call instead of making it a loop of size 10000.

I see these options:

Ignore the problem. It existed before, introduces a inefficiency of a magnitude and my commit only worsens it by a few percent.
Try to figure out how exactly PyTorch does a BMM on a CPU and try to reimplement it
Use PyTorch as a backend for Numpy
Provide an option method=force_naive to the matmul function so that the user can override that manually
Run a mini-benchmark when calling import numpy. Importing Numpy takes very long anyway, a few milliseconds of benchmark won't change that a lot. Then, use these results of the benchmark to decide whether to use naive multiplication or BLAS on a per-call basis.
Run Benchmarks from 5 at build time. Would complicate the build though and the systems where Numpy will be run might differ from those where it was build.
Hard-code some plausible thresholds based on benchmarks with modern systems. I've got a MacBook M1 and Ryzen 5950X at hand.

Any thoughts?

seberg · May 31, 2023

Only 1 and 7 sound reasonable to me everything else is unnecessarily complicated or just impossible.

merny93 · Aug 11, 2023

I would argue that ignoring the problem is perfectly fair (vote for option 1).

This pull request aims to fix a ~100 times slow down caused by completely disregarding BLAS and defaulting to an extremely naïve matrix multiply. The only users that will notice a performance reduction are ones that are explicitly taking advantage of this bug/feature to trick numpy into avoiding BLAS for small arrays by breaking the strides.

In any case, a 100 times speedup on large operations (seconds) is worth more than a small offset on small operations. Overhead is always present in numpy and users tend to be aware of this trade-off which gives good performance for large operations.

xor2k · Nov 20, 2023

Sorry for my late answer. I have made some tests and unfortunately could not come up with safe values to hardcode. I would also vote for option 1.

mattip · Dec 14, 2023

Could you run the benchmarks again?

seiko2plus

LGTM, just the part related to memory allocations, we should use our custom memory allocations.

xor2k · Apr 6, 2024

LGTM, just the part related to memory allocations, we should use our custom memory allocations.

What exactly do you mean? PyMem_Malloc() and PyMem_Free()? Or something Numpy specific?

xor2k · Feb 2, 2025

Must have accidentally closed the pull request, reopening.

seberg

Thanks Michael, sorry for this taking so long, I'll refuse the temptatatio to think about whether this can be shortened a bit and just put it in :).

It is a significant improvement after all!

If anyone notices a performance regression in their use-case and stumbles on this: Please let us know, this could (and maybe should) have a heuristic to decide to skip blas.

Let's follow up, it's a tiny thing and we use malloc in many places still (even if that isn't great).

xor2k · Mar 9, 2025

Thank you! It took long indeed. However, I think the true bottleneck after all was a proper assessment of how much of a speedup this merge request actually was, which should be essential for every such undertaking and it can be done conclusively, though it took a while to come up with the right approach.

Regarding tiny (but also small) matrices, I think the Blas frameworks should fix their performance issues first themselves (if there are any) so that it is never necessary not to use Blas in the very first place. I advertised #23752 (comment) in the respective Github repositories for the open source frameworks (references can be found in this merge request as well) and let's see what will happen.

xor2k · Mar 16, 2025

Btw. if somebody is interested: here to code to render the graphs above https://gist.github.com/xor2k/b2b7d1d5e87bfe8a8a30c2d0c7e12f9e

xor2k · Mar 20, 2025

Here a link to the LinkedIn post I made about it https://www.linkedin.com/posts/michael-siebert-67bb3338_made-matmul-a-core-function-of-numpy-around-activity-7304556069832331264-wgz6

charris added 03 - Maintenance component: __array_function__ and removed component: __array_function__ labels May 11, 2023

charris changed the title ~~speed up matmul #23588~~ MAINT: speed up matmul #23588 May 11, 2023

mattip reviewed May 12, 2023

View reviewed changes

seberg reviewed May 12, 2023

View reviewed changes

xor2k force-pushed the main branch 2 times, most recently from 7ee8af8 to aa9839f Compare May 12, 2023 22:29

jcmgray mentioned this pull request May 17, 2023

ENH: speed up einsum with optimize using batched matmul #23513

Open

3 tasks

xor2k force-pushed the main branch from aa9839f to 219de26 Compare May 31, 2023 14:47

xor2k force-pushed the main branch from 219de26 to 9988796 Compare December 11, 2023 16:49

seiko2plus previously requested changes Feb 12, 2024

View reviewed changes

xor2k force-pushed the main branch from 9988796 to bd5dbb8 Compare April 6, 2024 09:17

xor2k force-pushed the main branch from bd5dbb8 to 78c00d9 Compare April 6, 2024 09:24

xor2k force-pushed the main branch 3 times, most recently from a45a77e to af80b5a Compare November 11, 2024 07:44

xor2k force-pushed the main branch from af80b5a to 794da66 Compare December 16, 2024 12:12

xor2k closed this Dec 18, 2024

xor2k force-pushed the main branch from 794da66 to cb34472 Compare December 18, 2024 13:55

xor2k force-pushed the main branch 2 times, most recently from 4227c83 to a44d0c6 Compare February 15, 2025 14:19

seberg changed the title ~~MAINT: speed up matmul for non-aligned input #23588~~ MAINT: speed up matmul for non-contiguous operands #23588 Feb 16, 2025

xor2k force-pushed the main branch 12 times, most recently from 11f5430 to 7d2a0b6 Compare February 19, 2025 22:33

xor2k requested a review from seberg February 19, 2025 22:47

xor2k closed this Feb 21, 2025

xor2k force-pushed the main branch from 7d2a0b6 to 5566cc4 Compare February 21, 2025 18:45

speed up matmul numpy#23588

53f911d

xor2k reopened this Feb 21, 2025

seberg changed the title ~~MAINT: speed up matmul for non-contiguous operands #23588~~ ENH: speed up matmul for non-contiguous operands #23588 Mar 9, 2025

seberg approved these changes Mar 9, 2025

View reviewed changes

seberg merged commit f104291 into numpy:main Mar 9, 2025
70 of 79 checks passed

xor2k mentioned this pull request Mar 9, 2025

BUG: matmul (@ overload) and dot significant performance differences for non-contiguous arrays (noblas) #23588

Closed

This was referenced Jun 10, 2025

BUG: Incorrect result for numpy.matmul when order="F" is passed #29164

Closed

BUG: fix matmul with transposed out arg #29179

Merged

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH: speed up matmul for non-contiguous operands #23588 #23752

ENH: speed up matmul for non-contiguous operands #23588 #23752

Uh oh!

Conversation

xor2k commented May 11, 2023 • edited by seberg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MatteoRaso commented May 12, 2023

Uh oh!

mattip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xor2k Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg commented May 18, 2023

Uh oh!

xor2k commented May 19, 2023

Uh oh!

xor2k commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented May 31, 2023

Uh oh!

merny93 commented Aug 11, 2023

Uh oh!

xor2k commented Nov 20, 2023

Uh oh!

mattip commented Dec 14, 2023

Uh oh!

seiko2plus left a comment

Choose a reason for hiding this comment

Uh oh!

xor2k commented Apr 6, 2024

Uh oh!

xor2k commented May 11, 2023 •

edited by seberg

Loading

mhvk Feb 10, 2025 •

edited

Loading

xor2k Feb 14, 2025 •

edited

Loading

xor2k commented May 31, 2023 •

edited

Loading