DOC: clarify int type constraint in numpy.bincount #23552

onitake · Apr 7, 2023

This doc string change clarifies the int dtype constraint in np.bincount().

The type of input elements is not only required to be positive integers, as stated, but it must actually be of the native int type. Otherwise, it will raise a TypeError when the input type is wider than the native array index type.

This was discovered while debugging the https://github.com/mikedh/trimesh library on a 32-bit system. This library uses np.int64 gratuitously, which causes many unit test to fail: https://salsa.debian.org/3dprinting-team/trimesh/-/jobs/4103005#L2836

It would be very helpful if this is clearly explained in the numpy documentation.

Fixes: #23526

MatteoRaso · Apr 8, 2023

Thanks for the PR. I'm a little confused about what you mean, though. If I'm understanding the problem correctly, then I think the code would still work without error if you used np.int8 or np.int16, which aren't the native int type. A more accurate doc example might be "this code can't use dtype=np.int64 on 32-bit architecture".

onitake · Apr 8, 2023

You're right, a smaller data type will also work.
I was too focused on the wrong assumptions made in the Python module I was debugging (the developers used int64 all over the place), and I didn't even think of that.

Nevertheless, I think the limitation should be made clear. I'll try to rewrite the text.

MatteoRaso

Ok, just a few minor details that need to be fixed up, and then this should be ready to merge.

EDIT: Also, the documentation on line 931 has to be tweaked.

numpy/core/multiarray.py

onitake · Apr 9, 2023

EDIT: Also, the documentation on line 931 has to be tweaked.

Done.

MatteoRaso

LGTM.

rossbar

Out of curiosity, is this something that should be called out? I suspect that using np.int64 on a 32-bit system would cause problems everywhere, not just with bincount.

onitake · Apr 10, 2023

Out of curiosity, is this something that should be called out? I suspect that using np.int64 on a 32-bit system would cause problems everywhere, not just with bincount.

Aside from bincount, I identified two other functions that depend on a native index type: take and repeat, at least that's the ones that were causing problems in the trimesh module. I suppose there may be others as well.

MatteoRaso · Apr 11, 2023

Out of curiosity, is this something that should be called out? I suspect that using np.int64 on a 32-bit system would cause problems everywhere, not just with bincount.

It's not obvious that this is the case, though, since 32-bit systems can still support 64-bit integers.

https://stackoverflow.com/questions/23038451/how-does-a-32-bit-processor-support-64-bit-integers

np.float128 can also be used just fine on x86_64 architecture (and maybe 32-bit systems, although I have no way of checking), so the disparity between the size of the dtype and the architecture isn't always a problem.

rkern · Apr 11, 2023

The actual requirement is more like we need the input to be safe-castable to the integer type used for indexing; i.e. np.intp. All of the affected functions are using the inputs as indices (or using them to manipulate indices). Lots of other functions work fine with np.int64 on 32-bit systems; you can add, multiply, subtract them all you like, but when you want to use them as indices, we need to convert them to np.intp first, and the default safe casting will raise an error in such cases because that would be a downcast.

Just a note, np.intp isn't the "native integer dtype"; we usually call the result of np.dtype(int) the native integer dtype, i.e. whatever a C long is. For very annoying reasons, on 64-bit Windows, these are not the same. The 64-bitness specifies the integer type of memory addresses (and thus indices), so np.intp is np.int64. But np.dtype(int) is still np.int32 because Windows decided to keep C long types 32-bit.

onitake · Apr 11, 2023

Does that mean dtype=int really is the appropriate type and will work on all platforms?
Because that's the way I've worked around the problems in trimesh: https://salsa.debian.org/3dprinting-team/trimesh/-/blob/master/debian/patches/02-array-index-32bit-architectures.patch

rkern · Apr 11, 2023

No, dtype=np.intp.

onitake · Apr 12, 2023

No, dtype=np.intp.

Oh. I see. That's quite confusing. 😞

seberg · Apr 12, 2023

FWIW, I am hoping we can change intp to be the default integer type in NumPy 2.0 (and that hopfully could be released by next year. Changes behavior a fair bit on windows, but for most users getting int64 rather than int32 really shouldn't matter in practice.

In many places (apparently not bincount), NumPy uses "same-kind" casting for indexing related operations, because otherwise the int64 ->int32 step not going through is tedious. In practice nobody really notices that, but this is unsafe (you can generate very large integer index values that should raise an error but don't).

(It would be nice to find a solution for this too, but we do not have it.)

onitake · May 4, 2023

A possible solution could be to add new casting modes to ndarray.astype:

Always downcast, but print a warning when there is data loss
Downcast when there is no data loss, but raise an error if a value exceeds the range of the target type
Downcast with saturation, i.e. limit the range to the target type and never exceed the positive or negative maximum value

All of these would come at a performance cost, but some vectorized optimization may be possible.

charris · May 10, 2025

This looks good, but numpy/core/multiarray.py has been renamed numpy/_core/multiarray.py. Could you make the update?

onitake · May 11, 2025

I rebased the branch, and also added a few more tweaks based on the discussion above.

seberg · May 11, 2025

@onitake thanks for the update, unfortunately (or fortunately actually!), we put in #28355 which probably means this doesn't happen anymore -- that isn't ideal, but in practice probably better than raising annoyingly (the tests wouldn't notice due to not building docs on 32bits).

(The PR was about allowing unsigned, which is safer; I am not sure we actually thought of int64->int32, I am happy to keep it, though because that is what indexing also does in practice. We should -- which I guess doesn't mean it'll happen soon -- have a mode that raises if the int64 is out of bounds...)

Anyway, long story, while the changes look nice, I think we can basically just close it.

onitake · May 12, 2025

What does that mean in practice, i.e. how would the following cases be treated with the new implementation?

On a 32-bit arch (where sizeof(np.intp)==4): np.bincount is called on a list of np.uint64 that contains only values between 0 and 1 million.
On a 32-bit arch: np.bincount is called on a list of np.uint64 that contains random values from the whole 0..2⁶⁴-1 interval.

Since the patch seems to force casting in some cases, it would be interesting to know the effects from a correctness (e.g. cast with undefined result), performance (e.g. extra operations), and user experience perspective (e.g. if runtime exceptions would be generated).

seberg · May 12, 2025

it would be interesting to know the effects from a correctness (e.g. cast with undefined result), performance (e.g. extra operations), and user experience perspective (e.g. if runtime exceptions would be generated).

Right now, no correctness enforced, you get potentially wrong results. But if you really have a few random values like you suggest in practice you will probably get some form of memory error either way, I would think.
In the future, depending on the performance impact, it might reliably raise an error.

onitake · May 13, 2025

@seberg In that case, I don't think the patch in #28355 should have been accepted as-is.
The previous implementation produced an error when the data type was too large, even when no values were outside the acceptable range.

What we have now can lead to undefined behavior, because truncation will produce incorrect results that don't necessarily trigger a memory error. Consider the following test case:

import numpy as np
data = np.random.default_rng().integers(low=0x100000001, high=0x10000000f, dtype=np.int64, size=10)
np.bincount(data)

When run on an architecture where the array index type is 32 bits wide, bincount will happily truncate the random values to the lower 32 bits, which will lead to a bincount result where the first 15 entries may have a value, which is not correct.
On a 64-bit architecture, an array with 4 billion integers would be produced, which is correct and definitely within possibility of current hardware.

onitake · May 13, 2025

FYI: I tested this in a 32-bit test environment, and it did in fact produce an incorrect result:

>>> import numpy as np
>>> data = np.random.default_rng().integers(low=0x100000001, high=0x10000000f, dtype=np.int64, size=10)
>>> data
array([4294967299, 4294967305, 4294967300, 4294967309, 4294967304,
       4294967310, 4294967309, 4294967300, 4294967306, 4294967301],
      dtype=int64)
>>> result = np.bincount(data)
>>> result
array([0, 0, 0, 1, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 1])
>>> result.dtype
dtype('int32')

github-actions bot added the 04 - Documentation label Apr 7, 2023

onitake force-pushed the doc/bincount-int-dtype branch 3 times, most recently from dc714b1 to 11df95e Compare April 7, 2023 17:23

MatteoRaso suggested changes Apr 9, 2023

View reviewed changes

numpy/core/multiarray.py Outdated Show resolved Hide resolved

numpy/core/multiarray.py Outdated Show resolved Hide resolved

MatteoRaso approved these changes Apr 9, 2023

View reviewed changes

rossbar reviewed Apr 10, 2023

View reviewed changes

onitake mentioned this pull request Apr 11, 2023

[3.5.23] test failures: TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe' mikedh/trimesh#690

Open

onitake added 3 commits May 11, 2025 10:16

DOC: clarify int type constraint in numpy.bincount

07954a0

DOC: clarify upper limit of bincount input dtype

a7d5de1

DOC: correct text as requested in review

782351e

onitake force-pushed the doc/bincount-int-dtype branch from 4fdf66a to 782351e Compare May 11, 2025 08:19

onitake added 2 commits May 11, 2025 10:28

DOC: Corrected one more instance of 'integer' vs 'array index'

b037f8a

DOC: clarified what the native index type is

ace6dbd

onitake force-pushed the doc/bincount-int-dtype branch from 6281125 to ace6dbd Compare May 11, 2025 08:29

seberg added the 57 - Close? Issues which may be closable unless discussion continued label May 11, 2025

melissawm added this to NumPy first-time contributor PRs May 23, 2025

melissawm moved this to Needs decision in NumPy first-time contributor PRs May 23, 2025

Search code, repositories, users, issues, pull requests...

Uh oh!

DOC: clarify int type constraint in numpy.bincount #23552

Are you sure you want to change the base?

DOC: clarify int type constraint in numpy.bincount #23552

Uh oh!

Conversation

onitake commented Apr 7, 2023

Uh oh!

MatteoRaso commented Apr 8, 2023

Uh oh!

onitake commented Apr 8, 2023

Uh oh!

MatteoRaso left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

onitake commented Apr 9, 2023

Uh oh!

MatteoRaso left a comment

Choose a reason for hiding this comment

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Uh oh!

onitake commented Apr 10, 2023

Uh oh!

MatteoRaso commented Apr 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rkern commented Apr 11, 2023

Uh oh!

onitake commented Apr 11, 2023

Uh oh!

rkern commented Apr 11, 2023

Uh oh!

onitake commented Apr 12, 2023

Uh oh!

seberg commented Apr 12, 2023

Uh oh!

onitake commented May 4, 2023

Uh oh!

charris commented May 10, 2025

Uh oh!

onitake commented May 11, 2025

Uh oh!

seberg commented May 11, 2025

Uh oh!

onitake commented May 12, 2025

Uh oh!

seberg commented May 12, 2025

Uh oh!

onitake commented May 13, 2025

Uh oh!

onitake commented May 13, 2025

Uh oh!

Uh oh!

MatteoRaso left a comment •

edited

Loading

MatteoRaso commented Apr 11, 2023 •

edited

Loading