-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
DOC: clarify int type constraint in numpy.bincount #23552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
dc714b1
to
11df95e
Compare
Thanks for the PR. I'm a little confused about what you mean, though. If I'm understanding the problem correctly, then I think the code would still work without error if you used |
You're right, a smaller data type will also work. Nevertheless, I think the limitation should be made clear. I'll try to rewrite the text. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, just a few minor details that need to be fixed up, and then this should be ready to merge.
EDIT: Also, the documentation on line 931 has to be tweaked.
Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, is this something that should be called out? I suspect that using np.int64
on a 32-bit system would cause problems everywhere, not just with bincount.
Aside from |
It's not obvious that this is the case, though, since 32-bit systems can still support 64-bit integers. https://stackoverflow.com/questions/23038451/how-does-a-32-bit-processor-support-64-bit-integers
|
The actual requirement is more like we need the input to be Just a note, |
Does that mean |
No, |
Oh. I see. That's quite confusing. 😞 |
FWIW, I am hoping we can change In many places (apparently not bincount), NumPy uses "same-kind" casting for indexing related operations, because otherwise the int64 ->int32 step not going through is tedious. In practice nobody really notices that, but this is unsafe (you can generate very large integer index values that should raise an error but don't). (It would be nice to find a solution for this too, but we do not have it.) |
A possible solution could be to add new casting modes to ndarray.astype:
All of these would come at a performance cost, but some vectorized optimization may be possible. |
This looks good, but |
4fdf66a
to
782351e
Compare
6281125
to
ace6dbd
Compare
I rebased the branch, and also added a few more tweaks based on the discussion above. |
@onitake thanks for the update, unfortunately (or fortunately actually!), we put in #28355 which probably means this doesn't happen anymore -- that isn't ideal, but in practice probably better than raising annoyingly (the tests wouldn't notice due to not building docs on 32bits). (The PR was about allowing unsigned, which is safer; I am not sure we actually thought of int64->int32, I am happy to keep it, though because that is what indexing also does in practice. We should -- which I guess doesn't mean it'll happen soon -- have a mode that raises if the int64 is out of bounds...) Anyway, long story, while the changes look nice, I think we can basically just close it. |
What does that mean in practice, i.e. how would the following cases be treated with the new implementation?
Since the patch seems to force casting in some cases, it would be interesting to know the effects from a correctness (e.g. cast with undefined result), performance (e.g. extra operations), and user experience perspective (e.g. if runtime exceptions would be generated). |
Right now, no correctness enforced, you get potentially wrong results. But if you really have a few random values like you suggest in practice you will probably get some form of memory error either way, I would think. |
@seberg In that case, I don't think the patch in #28355 should have been accepted as-is. What we have now can lead to undefined behavior, because truncation will produce incorrect results that don't necessarily trigger a memory error. Consider the following test case: import numpy as np
data = np.random.default_rng().integers(low=0x100000001, high=0x10000000f, dtype=np.int64, size=10)
np.bincount(data) When run on an architecture where the array index type is 32 bits wide, bincount will happily truncate the random values to the lower 32 bits, which will lead to a bincount result where the first 15 entries may have a value, which is not correct. |
FYI: I tested this in a 32-bit test environment, and it did in fact produce an incorrect result:
|
This doc string change clarifies the
int
dtype constraint innp.bincount()
.The type of input elements is not only required to be positive integers, as stated, but it must actually be of the native int type. Otherwise, it will raise a TypeError when the input type is wider than the native array index type.
This was discovered while debugging the https://github.com/mikedh/trimesh library on a 32-bit system. This library uses
np.int64
gratuitously, which causes many unit test to fail: https://salsa.debian.org/3dprinting-team/trimesh/-/jobs/4103005#L2836It would be very helpful if this is clearly explained in the numpy documentation.
Fixes: #23526