Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DOC: clarify int type constraint in numpy.bincount #23552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
Loading
from

Conversation

onitake
Copy link

@onitake onitake commented Apr 7, 2023

This doc string change clarifies the int dtype constraint in np.bincount().

The type of input elements is not only required to be positive integers, as stated, but it must actually be of the native int type. Otherwise, it will raise a TypeError when the input type is wider than the native array index type.

This was discovered while debugging the https://github.com/mikedh/trimesh library on a 32-bit system. This library uses np.int64 gratuitously, which causes many unit test to fail: https://salsa.debian.org/3dprinting-team/trimesh/-/jobs/4103005#L2836

It would be very helpful if this is clearly explained in the numpy documentation.

Fixes: #23526

@onitake onitake force-pushed the doc/bincount-int-dtype branch 3 times, most recently from dc714b1 to 11df95e Compare April 7, 2023 17:23
@MatteoRaso
Copy link
Contributor

Thanks for the PR. I'm a little confused about what you mean, though. If I'm understanding the problem correctly, then I think the code would still work without error if you used np.int8 or np.int16, which aren't the native int type. A more accurate doc example might be "this code can't use dtype=np.int64 on 32-bit architecture".

@onitake
Copy link
Author

onitake commented Apr 8, 2023

You're right, a smaller data type will also work.
I was too focused on the wrong assumptions made in the Python module I was debugging (the developers used int64 all over the place), and I didn't even think of that.

Nevertheless, I think the limitation should be made clear. I'll try to rewrite the text.

Copy link
Contributor

@MatteoRaso MatteoRaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just a few minor details that need to be fixed up, and then this should be ready to merge.

EDIT: Also, the documentation on line 931 has to be tweaked.

numpy/core/multiarray.py Outdated Show resolved Hide resolved
numpy/core/multiarray.py Outdated Show resolved Hide resolved
@onitake
Copy link
Author

onitake commented Apr 9, 2023

EDIT: Also, the documentation on line 931 has to be tweaked.

Done.

Copy link
Contributor

@MatteoRaso MatteoRaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, is this something that should be called out? I suspect that using np.int64 on a 32-bit system would cause problems everywhere, not just with bincount.

@onitake
Copy link
Author

onitake commented Apr 10, 2023

Out of curiosity, is this something that should be called out? I suspect that using np.int64 on a 32-bit system would cause problems everywhere, not just with bincount.

Aside from bincount, I identified two other functions that depend on a native index type: take and repeat, at least that's the ones that were causing problems in the trimesh module. I suppose there may be others as well.

@MatteoRaso
Copy link
Contributor

MatteoRaso commented Apr 11, 2023

Out of curiosity, is this something that should be called out? I suspect that using np.int64 on a 32-bit system would cause problems everywhere, not just with bincount.

It's not obvious that this is the case, though, since 32-bit systems can still support 64-bit integers.

https://stackoverflow.com/questions/23038451/how-does-a-32-bit-processor-support-64-bit-integers

np.float128 can also be used just fine on x86_64 architecture (and maybe 32-bit systems, although I have no way of checking), so the disparity between the size of the dtype and the architecture isn't always a problem.

@rkern
Copy link
Member

rkern commented Apr 11, 2023

The actual requirement is more like we need the input to be safe-castable to the integer type used for indexing; i.e. np.intp. All of the affected functions are using the inputs as indices (or using them to manipulate indices). Lots of other functions work fine with np.int64 on 32-bit systems; you can add, multiply, subtract them all you like, but when you want to use them as indices, we need to convert them to np.intp first, and the default safe casting will raise an error in such cases because that would be a downcast.

Just a note, np.intp isn't the "native integer dtype"; we usually call the result of np.dtype(int) the native integer dtype, i.e. whatever a C long is. For very annoying reasons, on 64-bit Windows, these are not the same. The 64-bitness specifies the integer type of memory addresses (and thus indices), so np.intp is np.int64. But np.dtype(int) is still np.int32 because Windows decided to keep C long types 32-bit.

@onitake
Copy link
Author

onitake commented Apr 11, 2023

Does that mean dtype=int really is the appropriate type and will work on all platforms?
Because that's the way I've worked around the problems in trimesh: https://salsa.debian.org/3dprinting-team/trimesh/-/blob/master/debian/patches/02-array-index-32bit-architectures.patch

@rkern
Copy link
Member

rkern commented Apr 11, 2023

No, dtype=np.intp.

@onitake
Copy link
Author

onitake commented Apr 12, 2023

No, dtype=np.intp.

Oh. I see. That's quite confusing. 😞

@seberg
Copy link
Member

seberg commented Apr 12, 2023

FWIW, I am hoping we can change intp to be the default integer type in NumPy 2.0 (and that hopfully could be released by next year. Changes behavior a fair bit on windows, but for most users getting int64 rather than int32 really shouldn't matter in practice.

In many places (apparently not bincount), NumPy uses "same-kind" casting for indexing related operations, because otherwise the int64 ->int32 step not going through is tedious. In practice nobody really notices that, but this is unsafe (you can generate very large integer index values that should raise an error but don't).

(It would be nice to find a solution for this too, but we do not have it.)

@onitake
Copy link
Author

onitake commented May 4, 2023

A possible solution could be to add new casting modes to ndarray.astype:

  • Always downcast, but print a warning when there is data loss
  • Downcast when there is no data loss, but raise an error if a value exceeds the range of the target type
  • Downcast with saturation, i.e. limit the range to the target type and never exceed the positive or negative maximum value

All of these would come at a performance cost, but some vectorized optimization may be possible.

@charris
Copy link
Member

charris commented May 10, 2025

This looks good, but numpy/core/multiarray.py has been renamed numpy/_core/multiarray.py. Could you make the update?

@onitake onitake force-pushed the doc/bincount-int-dtype branch from 4fdf66a to 782351e Compare May 11, 2025 08:19
@onitake onitake force-pushed the doc/bincount-int-dtype branch from 6281125 to ace6dbd Compare May 11, 2025 08:29
@onitake
Copy link
Author

onitake commented May 11, 2025

I rebased the branch, and also added a few more tweaks based on the discussion above.

@seberg
Copy link
Member

seberg commented May 11, 2025

@onitake thanks for the update, unfortunately (or fortunately actually!), we put in #28355 which probably means this doesn't happen anymore -- that isn't ideal, but in practice probably better than raising annoyingly (the tests wouldn't notice due to not building docs on 32bits).

(The PR was about allowing unsigned, which is safer; I am not sure we actually thought of int64->int32, I am happy to keep it, though because that is what indexing also does in practice. We should -- which I guess doesn't mean it'll happen soon -- have a mode that raises if the int64 is out of bounds...)

Anyway, long story, while the changes look nice, I think we can basically just close it.

@seberg seberg added the 57 - Close? Issues which may be closable unless discussion continued label May 11, 2025
@onitake
Copy link
Author

onitake commented May 12, 2025

What does that mean in practice, i.e. how would the following cases be treated with the new implementation?

  1. On a 32-bit arch (where sizeof(np.intp)==4): np.bincount is called on a list of np.uint64 that contains only values between 0 and 1 million.
  2. On a 32-bit arch: np.bincount is called on a list of np.uint64 that contains random values from the whole 0..2⁶⁴-1 interval.

Since the patch seems to force casting in some cases, it would be interesting to know the effects from a correctness (e.g. cast with undefined result), performance (e.g. extra operations), and user experience perspective (e.g. if runtime exceptions would be generated).

@seberg
Copy link
Member

seberg commented May 12, 2025

it would be interesting to know the effects from a correctness (e.g. cast with undefined result), performance (e.g. extra operations), and user experience perspective (e.g. if runtime exceptions would be generated).

Right now, no correctness enforced, you get potentially wrong results. But if you really have a few random values like you suggest in practice you will probably get some form of memory error either way, I would think.
In the future, depending on the performance impact, it might reliably raise an error.

@onitake
Copy link
Author

onitake commented May 13, 2025

@seberg In that case, I don't think the patch in #28355 should have been accepted as-is.
The previous implementation produced an error when the data type was too large, even when no values were outside the acceptable range.

What we have now can lead to undefined behavior, because truncation will produce incorrect results that don't necessarily trigger a memory error. Consider the following test case:

import numpy as np
data = np.random.default_rng().integers(low=0x100000001, high=0x10000000f, dtype=np.int64, size=10)
np.bincount(data)

When run on an architecture where the array index type is 32 bits wide, bincount will happily truncate the random values to the lower 32 bits, which will lead to a bincount result where the first 15 entries may have a value, which is not correct.
On a 64-bit architecture, an array with 4 billion integers would be produced, which is correct and definitely within possibility of current hardware.

@onitake
Copy link
Author

onitake commented May 13, 2025

FYI: I tested this in a 32-bit test environment, and it did in fact produce an incorrect result:

>>> import numpy as np
>>> data = np.random.default_rng().integers(low=0x100000001, high=0x10000000f, dtype=np.int64, size=10)
>>> data
array([4294967299, 4294967305, 4294967300, 4294967309, 4294967304,
       4294967310, 4294967309, 4294967300, 4294967306, 4294967301],
      dtype=int64)
>>> result = np.bincount(data)
>>> result
array([0, 0, 0, 1, 2, 1, 0, 0, 1, 1, 1, 0, 0, 2, 1])
>>> result.dtype
dtype('int32')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
04 - Documentation 57 - Close? Issues which may be closable unless discussion continued
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: bincount documentation should mention native index type requirement
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.