Description
Describe the issue:
The implementation of matmul
does some basic checks to verify if the array can be passed to gemm
and if it deems that this is not posssible it will fallback on a noblas
routine which has no regard for memory cache resulting in ~100 slowdowns. Here is the offending bit of source code.
dot
on the other hand is a lot more flexible, attempting to make a copy before passing the array to the blas routine. For small arrays the difference is not significant but for large arrays this results in much better performance. This behavior is explicitly seen in the source code (line 234 and 243 are the bad-stride copies) and can be confirmed by profiling as shown by hpaulj.
Neither of the docs pages for dot
or matmul
reference this behavior. On the contrary the dot
page states: If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.
There are a handful of similar issues open:
- BUG: Multiplication by real part of complex matrix very slow #23123
- complextype
- views
- BUG: Matrix multiplication with integers not using the cache friendly loop order #23260
I am not sure what the best fix is. Updating the docs would be a good start. Perhaps matmul
should try to copy too? at least for large arrays...
Reproduce the code example:
import numpy as np
from timeit import timeit
N = 2600
xx = np.random.randn(N, N)
yy = np.random.randn(N, N)
x = xx[::2, ::2]
y = yy[::2, ::2]
assert np.shares_memory(x, xx)
assert np.shares_memory(y, yy)
dot = timeit('np.dot(x,y)', number = 10, globals = globals())
matmul = timeit('np.matmul(x,y)', number = 10, globals = globals())
print('time for np.matmul: ', matmul)
print('time for np.dot: ', dot)
Error message:
time for np.matmul: 29.04214559996035
time for np.dot: 0.2680714999441989
Runtime information:
> import sys, numpy; print(numpy.__version__); print(sys.version)
1.24.2
3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
> print(numpy.show_runtime())
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
'found': ['SSSE3',
'SSE41',
'POPCNT',
'SSE42',
'AVX',
'F16C',
'FMA3',
'AVX2'],
'not_found': ['AVX512F',
'AVX512CD',
'AVX512_SKX',
'AVX512_CLX',
'AVX512_CNL',
'AVX512_ICL']}},
{'architecture': 'Prescott',
'filepath': 'C:\\Users\\______\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\numpy\\.libs\\libopenblas64__v0.3.21-gcc_10_3_0.dll',
'internal_api': 'openblas',
'num_threads': 24,
'prefix': 'libopenblas',
'threading_layer': 'pthreads',
'user_api': 'blas',
'version': '0.3.21'}]
None
Context for the issue:
The scientific computing community regularly uses the @
shorthand on large arrays. Developers expect that with Numpy they do not need to think about CPU cache and expect operations to run close to CPU limits rather than memory bandwidth limits.
At the very least a warning should be provided to users such that they know why their code is running orders of magnitude slower than expected.