Open
Description
I noticed that histogram2d
(which uses histogramdd
) takes the same time for bins that are uniform by
construction (specified by number of bins and a range) as for bins that may not
be uniform (specified as an array of edges). This is due to
Line 1058 in 8829b80
which internally makes an array of edges to later use with
searchsorted
.Is there a reason to not directly calculate the bin index for the case of uniform binning (as it looks like is done for the 1D
histogram
)?Locally, I've been testing an implementation here, which does exactly this, and passes unit testing.
In my use-case, I get a 4-5x speedup. More generally, I observe the following speedups for a 2D gaussian distribution where both dimensions have uniform binning.
Reproducing code example:
import numpy as np
import time
np.random.seed(42)
xy = np.random.normal(0, 1, (int(5e6), 2))
# warmup
_ = np.histogram2d(xy[:, 0], xy[:, 1], bins=[100, 100], range=[[-3, 3], [-3, 3]])
# bins are uniform by construction
t0 = time.time()
h1 = np.histogram2d(xy[:, 0], xy[:, 1], bins=[100, 100], range=[[-3, 3], [-3, 3]])
print(time.time() - t0)
# bins may not be uniform
t0 = time.time()
edges = np.linspace(-3, 3, 100+1)
h1 = np.histogram2d(xy[:, 0], xy[:, 1], bins=[edges, edges])
print(time.time() - t0)
NumPy/Python version information:
1.19.0
3.7.3 (default, Mar 27 2019, 09:23:39) [Clang 10.0.0 (clang-1000.11.45.5)]