Open
Description
Describe the issue:
I was comparing the performance of numpy and numba for matrix multiplication. I was using integers so that numpy doesn't use BLAS and the comparison would be more fair (both numpy and numba would only be using vectorization).
Based on my discussion with another user, it would seem that the loop order being used by @TYPE@_matmul_inner_noblas isn't the one that is most cache friendly.
Reproduce the code example:
import numpy as np
from numba import njit, prange
@njit
def matrix_multiplication(A, B):
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in range(m):
for j in range(n):
for k in range(p):
C[i, k] += A[i, j] * B[j, k]
return C
@njit
def matrix_multiplication2(A, B): # loop order equivalent to numpy
m, n = A.shape
_, p = B.shape
C = np.zeros((m, p))
for i in prange(m):
for k in range(p):
for j in range(n):
C[i, k] += A[i, j] * B[j, k]
return C
m = 1000
n = 1000
p = 1000
A = np.random.randint(1, 100, size=(m, n))
B = np.random.randint(1, 100, size=(n, p))
# compile function
matrix_multiplication(A, B)
matrix_multiplication2(A, B)
%timeit matrix_multiplication(A, B)
%timeit matrix_multiplication2(A, B)
%timeit A @ B
# numpy is a little slower than matrix_multiplication but faster than matrix_multiplication2
# 1.62 s ± 167 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2.48 s ± 157 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2 s ± 37.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Error message:
No response
Runtime information:
1.23.5
3.10.0 (tags/v3.10.0:b494f59, Oct 4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)]
Context for the issue:
No response