Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

BUG: Limit the maximal number of bins for automatic histogram binning #28426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Mar 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions 6 doc/release/upcoming_changes/28426.change.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Changes to automatic bin selection in numpy.histogram
-----------------------------------------------------
The automatic bin selection algorithm in ``numpy.histogram`` has been modified
to avoid out-of-memory errors for samples with low variation.
For full control over the selected bins the user can use set
the ``bin`` or ``range`` parameters of ``numpy.histogram``.
24 changes: 9 additions & 15 deletions 24 numpy/lib/_histograms_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,28 +228,24 @@ def _hist_bin_fd(x, range):

def _hist_bin_auto(x, range):
"""
Histogram bin estimator that uses the minimum width of the
Freedman-Diaconis and Sturges estimators if the FD bin width is non-zero.
If the bin width from the FD estimator is 0, the Sturges estimator is used.
Histogram bin estimator that uses the minimum width of a relaxed
Freedman-Diaconis and Sturges estimators if the FD bin width does
not result in a large number of bins. The relaxed Freedman-Diaconis estimator
limits the bin width to half the sqrt estimated to avoid small bins.

The FD estimator is usually the most robust method, but its width
estimate tends to be too large for small `x` and bad for data with limited
variance. The Sturges estimator is quite good for small (<1000) datasets
and is the default in the R language. This method gives good off-the-shelf
behaviour.

If there is limited variance the IQR can be 0, which results in the
FD bin width being 0 too. This is not a valid bin width, so
``np.histogram_bin_edges`` chooses 1 bin instead, which may not be optimal.
If the IQR is 0, it's unlikely any variance-based estimators will be of
use, so we revert to the Sturges estimator, which only uses the size of the
dataset in its calculation.

Parameters
----------
x : array_like
Input data that is to be histogrammed, trimmed to range. May not
be empty.
range : Tuple with range for the histogram

Returns
-------
Expand All @@ -261,12 +257,10 @@ def _hist_bin_auto(x, range):
"""
fd_bw = _hist_bin_fd(x, range)
sturges_bw = _hist_bin_sturges(x, range)
del range # unused
if fd_bw:
return min(fd_bw, sturges_bw)
else:
# limited variance, so we return a len dependent bw estimator
return sturges_bw
sqrt_bw = _hist_bin_sqrt(x, range)
# heuristic to limit the maximal number of bins
fd_bw_corrected = max(fd_bw, sqrt_bw / 2)
return min(fd_bw_corrected, sturges_bw)


# Private dict initialized at module load time
Expand Down
14 changes: 11 additions & 3 deletions 14 numpy/lib/tests/test_histograms.py
Original file line number Diff line number Diff line change
Expand Up @@ -416,6 +416,13 @@ def test_gh_23110(self):
expected_hist = np.array([1, 0])
assert_array_equal(hist, expected_hist)

def test_gh_28400(self):
e = 1 + 1e-12
Z = [0, 1, 1, 1, 1, 1, e, e, e, e, e, e, 2]
counts, edges = np.histogram(Z, bins="auto")
assert len(counts) < 10
assert edges[0] == Z[0]
assert edges[-1] == Z[-1]

class TestHistogramOptimBinNums:
"""
Expand Down Expand Up @@ -502,15 +509,16 @@ def test_novariance(self):

def test_limited_variance(self):
"""
Check when IQR is 0, but variance exists, we return the sturges value
and not the fd value.
Check when IQR is 0, but variance exists, we return a reasonable value.
"""
lim_var_data = np.ones(1000)
lim_var_data[:3] = 0
lim_var_data[-4:] = 100

edges_auto = histogram_bin_edges(lim_var_data, 'auto')
assert_equal(edges_auto, np.linspace(0, 100, 12))
assert_equal(edges_auto[0], 0)
assert_equal(edges_auto[-1], 100.)
assert len(edges_auto) < 100

edges_fd = histogram_bin_edges(lim_var_data, 'fd')
assert_equal(edges_fd, np.array([0, 100]))
Expand Down
Loading
Morty Proxy This is a proxified and sanitized view of the page, visit original site.