-
-
Notifications
You must be signed in to change notification settings - Fork 11k
ENH: speed up matmul for non-contiguous operands #23588 #23752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. Please add benchmark results so we can measure the speedup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable, and much simpler than I imagined. Nice!
numpy/core/src/umath/matmul.c.src
Outdated
#if @USEBLAS@ && defined(HAVE_CBLAS) | ||
free(tmp_ip1); | ||
free(tmp_ip2); | ||
free(tmp_op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking for NULL would save a syscall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked free
on https://godbolt.org and indeed, it always creates a syscall, even with the latest GCC and Clang and -O3
. Quite disappointing. I've thus added checks for NULL.
numpy/core/src/umath/matmul.c.src
Outdated
if( | ||
tmp_ip1 == NULL || tmp_ip2 == NULL || tmp_op == NULL | ||
) { | ||
@TYPE@_matmul_inner_noblas(ip1, is1_m, is1_n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to refactor this to error out rather than take the slow path. The case where malloc fails is the case that will be really slow in the slow path. Erroring would require refactoring this to be a new-style inner loop function (with a return value), which could be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should work to grab the GIL, set the memory error and then just return. It would be nicer with a new-style loop of course, but a larger refactor indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution was quite simple, should be done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should indeed make sure there are tests, although code-coverage suggest they exist, so I guess it should be good (malloc errors cannot be covered reasonably).
LGTM mostly, I think we should make it a hard error: just try once that the error actually works by changing the code. And a bit of refactor/double checking iteration order would be a good.
We should indeed have a small benchmark to proof its worth it. Probably just add one that uses striding in one or two places is enough. Alternatively, one that works on arr.real
or arr.imag
(for a complex array), which also uses strides implicitly.
numpy/core/src/umath/matmul.c.src
Outdated
if( | ||
tmp_ip1 == NULL || tmp_ip2 == NULL || tmp_op == NULL | ||
) { | ||
@TYPE@_matmul_inner_noblas(ip1, is1_m, is1_n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should work to grab the GIL, set the memory error and then just return. It would be nicer with a new-style loop of course, but a larger refactor indeed.
numpy/core/src/umath/matmul.c.src
Outdated
} | ||
if(tmp_op == NULL) { | ||
tmp_op = (@typ@ *)malloc(sizeof(@typ@) * dm * dp); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, most of the time we don't have to copy all ops, just some. The info should already be available from the earlier probing code.
For example, especially the output is very unlikely to require copying, because its typically a freshly created array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed that, too & I've fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copying of the output when it is alread aligned still seems to be there!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would indeed be good nice to fix, I think. The arrays could be:
- Already blas-able (either C or F).
- We should also deal with memory order nicely (this could make a huge speed difference when it kicks in!):
- If it is already blasable, we should not copy. Right now that isn't the case, because when the branch is taken, it is always taken for all 3 arguments.
- Even when we copying, you should check whether the array is F or C memory order (but not contiguous). I.e. F order means that
strides[2] > strides[1]
(strides[0]
is outer-loop). In that case: 1. the copy should swap iteration order and 2. pass the transpose flag to blas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess somewhat related, we have to ensure that all paths get hit by the tests.
EDIT: Note that one could, e.g., create a single array and use the axes
argument to swap axes, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would indeed be good nice to fix, I think. The arrays could be:
Already blas-able (either C or F).
We should also deal with memory order nicely (this could make a huge speed difference when it kicks in!):
- If it is already blasable, we should not copy. Right now that isn't the case, because when the branch is taken, it is always taken for all 3 arguments.
- Even when we copying, you should check whether the array is F or C memory order (but not contiguous). I.e. F order means that
strides[2] > strides[1]
(strides[0]
is outer-loop). In that case: 1. the copy should swap iteration order and 2. pass the transpose flag to blas.
2.1 is done by the latest version of the PR. However, I have a question about that transpose flag: it's about the BLAS matrix matrix multiplication (can't affect copying, since copying does not have a transpose flag). Is there a need to modify the signature of @TYPE@_matmul_matrixmatrix
or is it sufficient to just swap some strides? The copying will definitely normalize the matrix strides, so strides[2] < strides[1]
will be ensured after the copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copying of the output when it is alread aligned still seems to be there!
I've implemented that in the most recent version, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess somewhat related, we have to ensure that all paths get hit by the tests.
EDIT: Note that one could, e.g., create a single array and use the
axes
argument to swap axes, etc.
You mean test or benchmark? During development I had the impression that the test coverage is quite good. The benchmark could need an update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, C-coverage isn't run in the test suite anymore. FWIW, I suspect you are right and test coverage is fairly good. The one thing I could imagine missing would be a strided out=
argument.
numpy/core/src/umath/matmul.c.src
Outdated
} | ||
else { | ||
for (n = 0; n < dn; n++) { | ||
for (m = 0; m < dm; m++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check and it is probably true. But you should ensure that is_m < is_n
(which depends on the memory order). If that is not true, the loop order needs to be swapped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you go a little more into detail what should be done if that is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would probably make sense to cover this with another test case, would it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess lapack supports a.T @ b.T = o.T
mainly, and for that it would make sense to swap things early on (m and n). It might not matter too much in practice, since it may be that for larger matrices (things don't fit into L1 comfortably?), the actual matmul dominates anyway.
There is one other things which could make sense. IIRC when n
is small (mainl?) its not really worthwhile to do the copy at all, very rough timing I think n < 10
might just be a reasonable thing.
7ee8af8
to
aa9839f
Compare
@xor2k you asked yesterday about what to do here. This is maybe mainly a general improvement, on 1.23.4 (probably with openblas), I get:
with your branch, the first one goes into the same as the contiguous one:
We can see that the blas overhead is significant. I don't want to map that out exactly, and yes different blas versions will behave very differently. Honestly, we also can opt to ignore it, because currently we already run into the "slow" path when the input is contiguous. As for the discussion about order. Lets ignore it, I think it is irrelevant, sorry. If arrays are tiny, the cache will take care of things. If (working) arrays are large, the matmul should dominate a huge amount anyway. The other point is that we still need to fix the error handling if the malloc fails. If it fails, you need to grab the GIL and set a memory error, you can grep for As far as I understood Matti, he also thought that going into the (potentially super slow) fallback path when we are so low on memory, is probably not useful. |
Okay, I understand. However, this barely affects this pull request as this is an issue that affects the use of BLAS in general, so it has affected Numpy even before my pull request. Other programming languages have reported that "issue" (won't call it bug) as well, see a long discussion with benchmarks going back to 2013 e.g. So one can basically compile Numpy without BLAS support, e.g. by running
in a test venv/conda environment and many small matrix multiplications will be much faster. I suggest to open a fresh issue for the many-small-matmul performance. I have some ideas already. I'll be on vacation next week but will be available the week after. |
The main issue is that BLAS does not support "batched matrix multiply" BMM, which would be really useful in cases like that, compare https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/ In PyTorch, they have it: https://pytorch.org/docs/stable/generated/torch.bmm.html BMM would allow to skip the outer loop and take the batch dimension (10000) as an argument into the accelerated call instead of making it a loop of size 10000. I see these options:
Any thoughts? |
Only 1 and 7 sound reasonable to me everything else is unnecessarily complicated or just impossible. |
I would argue that ignoring the problem is perfectly fair (vote for option 1). This pull request aims to fix a ~100 times slow down caused by completely disregarding BLAS and defaulting to an extremely naïve matrix multiply. The only users that will notice a performance reduction are ones that are explicitly taking advantage of this bug/feature to trick numpy into avoiding BLAS for small arrays by breaking the strides. In any case, a 100 times speedup on large operations (seconds) is worth more than a small offset on small operations. Overhead is always present in numpy and users tend to be aware of this trade-off which gives good performance for large operations. |
Sorry for my late answer. I have made some tests and unfortunately could not come up with safe values to hardcode. I would also vote for option 1. |
Could you run the benchmarks again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just the part related to memory allocations, we should use our custom memory allocations.
What exactly do you mean? PyMem_Malloc() and PyMem_Free()? Or something Numpy specific? |
a45a77e
to
af80b5a
Compare
Must have accidentally closed the pull request, reopening. |
4227c83
to
a44d0c6
Compare
11f5430
to
7d2a0b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Michael, sorry for this taking so long, I'll refuse the temptatatio to think about whether this can be shortened a bit and just put it in :).
It is a significant improvement after all!
If anyone notices a performance regression in their use-case and stumbles on this: Please let us know, this could (and maybe should) have a heuristic to decide to skip blas.
Let's follow up, it's a tiny thing and we use malloc
in many places still (even if that isn't great).
Thank you! It took long indeed. However, I think the true bottleneck after all was a proper assessment of how much of a speedup this merge request actually was, which should be essential for every such undertaking and it can be done conclusively, though it took a while to come up with the right approach. Regarding tiny (but also small) matrices, I think the Blas frameworks should fix their performance issues first themselves (if there are any) so that it is never necessary not to use Blas in the very first place. I advertised #23752 (comment) in the respective Github repositories for the open source frameworks (references can be found in this merge request as well) and let's see what will happen. |
Btw. if somebody is interested: here to code to render the graphs above https://gist.github.com/xor2k/b2b7d1d5e87bfe8a8a30c2d0c7e12f9e |
Here a link to the LinkedIn post I made about it https://www.linkedin.com/posts/michael-siebert-67bb3338_made-matmul-a-core-function-of-numpy-around-activity-7304556069832331264-wgz6 |
Should provide a solution for #23588
Further speed/memory optimizations are still possible (I'll work on it these days), but initial working solution that does the trick. Open for feedback already.
Closes gh-23123, gh-23588