-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
MAINT: Speed up numpy.nonzero. #18368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please note the NumPy |
I should mention that I just found out that the workstation on which I ran the benchmarking script also had the |
to_jmp = nonzero_idxs_dispatcher_ND((void*)data, multi_index, PyArray_SHAPE(self), PyArray_STRIDES(self), dtype->type_num, nonzero_count, ndim); | ||
} else if (dtype->byteorder == '<') { | ||
to_jmp = nonzero_idxs_dispatcher_ND((void*)data, multi_index, PyArray_SHAPE(self), PyArray_STRIDES(self), dtype->type_num, nonzero_count, ndim); | ||
} else if (dtype->byteorder == '|' && dtype->elsize == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fly-by comment. I have not looked at this enough yet. We have PyArray_ISNBO
for this, the dispatch as is seems overly complex. (I am also confused about the == '<'
are you assuming a little endian machine?
I might love to think about putting this on the dtypes, but it probably is just hassle here.
One other thing to think about: Make sure you time your results with fortran (and sliced fortran arrays). Also, you may not really need 2D or 3D loops much, if you check whether you can iterate it as a 1-D array instead. We have macros for that: PyArray_TRIVIALLY_ITERABLE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization exploits C-contiguousness of arrays (without slices). With a little bit of change the code can be adjusted to work with slices. However, for such cases the bottleneck will be the cache misses when you have large arrays (and slices covering a small part) and so we might as well fall back to the generic iterator approach (thats what it is doing currently for slices). For the F-contiguous arrays I will have to use different loops (starting iteration from the first dimension and then moving inwards instead of the opposite needed for C-contiguous arrays). I can definitely add those loops to also cover optimization for F-contiguous arrays alongside the ones I have for handling C-contiguous arrays.
The reason why I am having multiple loops is to squeeze out as much performance as possible. Alternatively, I could just use 1 single loop and rely on multiple conditionals within the loop which would be a slower approach.
The way that numpy is working with byteordering is confusing me a bit. So, initially I did not have those conditions checking for byteorder and there is this one test case which was failing : https://github.com/numpy/numpy/blob/master/numpy/core/tests/test_regression.py#L1674-L1679 . The failure happens after byteswapping. Why is numpy treating a byteswapped array as if it was not byteswapped? My optimization does take into account that the array got byteswapped and the result changes after byteswapping. The old implementation however ignores the byteswapping. For the sake of consistency I also thought to follow this same behaviour. The byteorder condition hack that you referenced allowed the optimization to pass that test (since the optimization will be ignored if the byteorder got changed to big endian and rather the default iterator based implementation will be used).
So, by default, all arrays except the ones with elsize=1
will have byteorder : '='
irrespective of the endianness of the system? If that is the case when does an array have '>' or '<'
? What I intend to do is place some conditions that will detect that byteordering was changed and so, I can fall back to the default iterator approach which gives the same result even after byteswapping. Do you have any recommendations that is more reliable than that hack? There is 1 build failure which I am guessing is due to that hack being unreliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, yeah. For nonzero it matters a lot what the input shape was, since it decides how many outputs you need.
By default it is probably =
or |
if byte swapping is not valid. But don't rely on that since it doesn't always hold currently. You can also request these explicitly arr = np.ones(3, dtype=">i8")
for example.
The arr.byteswap()
and arr.newbyteorder()
commands are a bit strange maybe, since one swaps the bytes, without modifying the dtype and the other swaps the dtype but not the data. So in both cases the represented values changed. If you do both, the represented values stay the same though. (which that test does)
Just use that macro, which does the correct thing, by rather checking the opposite. If it was not native byte-order than it must be >
(or <
on a big-endian machine).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have replaced that hack with PyArray_ISNBO
. Should I go ahead and implement the loops for F-contagious arrays? Is there any documentation available on these handy macros or is it just that one can only dig into the codebase for finding them out (or just ask around) 😄 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these macros are public API which is largely documented in the reference manual... Beyond that, there isn't much docs, so in some cases there may not be any documentation.
Sorry about the f-contiguous thing, it may be nice, but I am not sure it is worth the trouble (at least in the first step). Just be sure that it doesn't get much slower for large F-contiguous arrays.
@seberg ,I have pushed recent commits that optimize for arrays and views that are F-contiguous, including those arrays with a fortran based memory layout. So, currently, any aligned and contiguous array will enjoy the optimization. Thats it from me for now. You can continue with the review. Regarding the speed on F-contiguous arrays, the speed matches that of C-contiguous arrays given the shapes are transposed. So, a C-contiguous array with shape: [m, n] will attain the same speed as an F-contiguous array with shape: [n, m] given they have the same values (since loops stop early if there are no remaining nonzero values). |
I was just now surprised by how much slower nonzero is compared to the comparison itself, i.e.
There are a few things left to do, but I could help a bit if you are running low on enthusiasm:
Question to be sure (maybe it is obvious from the tests, in which case, don't bother):
Backburner:
|
I am actually not acquainted with the numpy template language. It is possible to combine the f and c style loops using more macros but I intentionally left them separated for better code readability and maintainability. Still if we must combine the f and c style loops together, how about we try to do that in a fresh PR? We can then also think about ways to generalize to higher dimensional loops? The output of |
The use of |
The My question about the order is the output is about this example:
I.e. its easy to imagine that the last might instead return:
which is also correct but different. |
Actually, this implementation returns the second one : |
I have to think about it, but it is a subtle change that I would prefer we don't include here yet (to ensure that the rest doesn't get stalled because of that!). |
I like the |
I meant progress on this PR getting stalled, because by the question of changing the output order being OK. A new |
@seberg , I am getting this error after modifying the low level strided loops file : |
@touqir14 sounds like the "preprocessor" is breaking on something, maybe you are using @elsize@, but did not properly define a loop |
There was a typo that I wrote in one of the templating syntaxes. I have pushed the ported code to lowlevel_strided_loops file. Let me know how it looks. |
} | ||
|
||
|
||
#define nonzero_idxs_3D_F(data, idxs, shape, strides, nonzero_count, dtype) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping to avoid these giant in-line macros, maybe a bit out of habit in NumPy, but I don't really like them, and you could make a repeat
for the dtype
.
You can use such a repeat even inside the "dispatcher" function body, and here itself, then it these become proper functions.
Now, I am not quite sure it actually will end up much better, since we don't need the functions (as in functions you can get a pointer for here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have converted C and F style loops into functions and condensed the dispatcher function as you suggested @seberg .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added the benchmarks as well. Using the SIMD (applicable for bool, int) optimized count_nonzero
function, doing np.nonzero(array)
is faster than np.nonzero(array>0.5)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, whats wrong with CI's smoke test? Looks like there is a problem with the pip
.
@mattip , I have fixed the errors. Would you be able to review this? |
Hmm. @seberg I would think this refactor should bring |
@mattip you are right, its a step in the opposite direction unfortunately, although not a particularly big one. The problem is that nonzero is slightly special, because it has to allocate the correct output shape first (using I couldn't think of an obvious way to wrangle it into a gufunc, but maybe that is also because we don't have many gufuncs yet, so considering how massive the speedup is, I was willing to make an exception with the hope we can consolidate it again. |
ok, thanks |
Close/reopen to restart CI |
CI configuration needs updating. The errors are not due to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I trust that the code does the righ thing. I am sorry about the bazillion style nits, but the code is pretty liberal with breaking the NumPy style guidelines.
I am wondering if you can't rewrite the while
loops as for loops (potentially with the empty test field). Also I think that this might be one place where using NPY_GCC_OPT_3 is worthwhile and likely a much bigger difference than manually optimizing away a single if
.
I would be happier if we can write the if
here, it just makes for easier to read code, but if you time that its worth the micro-optimization even with -O3
then that is fine as well.
If its super annoying, I could try to do some of the style-fixups for you, I guess.
npy_intp a = 0; | ||
while (added_count < nonzero_count) { | ||
*idxs = a; | ||
bool to_increment = ((bool) *data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty sure this is better written as != 0
. Casting NaN
or inf
to bool might set the "invalid" floating point exception, which we do not want here. We never actually check the FPE flags here probably, but I still think its better.
Can you write this to_increment as an if
? I will trust the compiler can generate roughly the same code this is not some important micro-optimization?
/**begin repeat | ||
* | ||
* #dtype = npy_bool, npy_byte, npy_byte, npy_uint16, npy_int16, npy_uint32, npy_int32, npy_uint64, npy_int64, npy_float, npy_double# | ||
* #name = bool, u8, i8, u16, i16, u32, i32, u64, i64, f32, f64# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long lines, everywhere. Please stay below 80 characters, unless the overhang is small and wrapping seem untidy.
|
||
|
||
#define ptr_assignment1(ptr1, val1, stride1) *ptr1 = val1; ptr1 += stride1; | ||
#define ptr_assignment2(ptr1, val1, stride1, ptr2, val2, stride2) ptr_assignment1(ptr1, val1, stride1) *ptr2 = val2; ptr2 += stride2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I can't parse well "ptr_assignment2" means exactly. The first macro seems OK to just inline, the second probably as well?
|
||
/**begin repeat1 | ||
* | ||
* #layout=C,F# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we say not to deal with F
right now?
bool to_increment = ((bool) *data); | ||
idxs += ((int) to_increment); | ||
added_count += ((int) to_increment); | ||
data = (@dtype@ *) (((char*) data) + stride); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data is typed +=
will work just fine. If this is c-continguous, data++;
will work. But I think it is good if this is triggered for all 1-D arrays.
assert_equal(nzs, 0) | ||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excessive white space. There should be one empty line between def
's in classes. We are pretty strict on PEP.
assert_equal(np.count_nonzero(y_view), len(idxs_0)) | ||
nzs = 0 | ||
for i,j in zip(idxs_0, idxs_1): | ||
if y_view[i,j] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if y_view[i,j] == 0: | |
if y_view[i, j] == 0: |
for i,j in zip(idxs_0, idxs_1): | ||
if x_view[i,j] == 0: | ||
nzs += 1 | ||
assert_equal(nzs, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use assert nzs == 0
if nzs
is a Python scalar.
|
||
for i in range(iters): # check for slices | ||
x = ((2**33)*np.random.randn(10, 10)).astype(np.int32) | ||
x.flat[[2,6,9,15,21]] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x.flat[[2,6,9,15,21]] = 0 | |
x.flat[[2, 6, 9, 15, 21]] = 0 |
param_names = ["dtype", "shape"] | ||
|
||
def setup(self, dtype, size): | ||
self.x = np.random.randint(0,2,size=size).astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need benchmarks for "sparse" and "nonsparse" data.
That is, please generate a dataset which has very few (maybe only a single 1 somewhere in the middle or at the end), and a dataset which has everything 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be good to seed this, just to stabilize benchmarks a bit (although they tend to fluctuate absurdly often anyway).
while (added_count < nonzero_count && b<size_1) { | ||
*idxs_1 = b; | ||
npy_uintp to_increment = (npy_uintp) ((bool) *data); | ||
idxs_1 += (idxs_stride & (-to_increment)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh forgot to press enter here I think. This deserved an 🤯 for me... I somewhat hope we could just do an if (*data != 0)
and let the compiler do its magic, but if you say its worthwhile I am happy. I did try with godbold (I couldn't find the "share link" today), and I admit, the compiler does produce less code with this trick.
(I would like to make sure it is worthwhile with -O3
macro used)
for i,j in zip(idxs_0, idxs_1): | ||
if y_view[i,j] == 0: | ||
nzs += 1 | ||
assert_equal(nzs, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And one more comment that got losts. These tests seem very repitetive, can you try to use pytest.mark.parametrize
or create a helper function like check_nonzero_result
?
@seberg It looks like there hasn't been any activity on this PR since March. Since I'd like to be able to use this optimization, would it make any sense for me to fork the repo this PR is from, attempt to address your feedback, and make a new PR? |
@danielg1111 It is generally OK as long as the original work is credited, usually by working on top of existing commits. It would smooth the way if @touqir14 weighed in and gave permission. |
@danielg1111 @charris I haven't pushed some of my local updates yet which I hope to do in the next few days. I have been a little busy and haven't worked on this for a while. @danielg1111 if you can wait a bit this PR should be completed. |
A faster nonzero would definitely speed up some code I am writing to work in Avizo. Does anyone know if these fixes will be available anytime soon? Thanks! |
@touqir14 @danielg1111 The PR has been stale for some time, but it would be nice to get it going again. Does either one of you want to continue with it at this moment? I checked with a recent |
I sincerely apologize to the Numpy team for dragging this PR for so long. I have been quite busy with my work and some of my side projects. I have research paper submissions on May 3rd, and right after that I will make this my priority, @eendebakpt . Thank you for understanding. |
@touqir14 No problem. Ping me once you start working, and I'll help reviewing any updates |
Just to note, since we now have quite a bit of C++ code elsewhere. It might make sense to aim for C++. And maybe it would have been nicer from the start to just create a new file for it (even |
I am going to close the PR since it is stale. I do think it is a great idea, though, and very much worthwhile! Please do not hesitate to open a new PR based on this! But at this point it seems like a small project to continue unfortunately. @seiko2plus just in case this grabs your interest. EDIT: Just to note, this was discussed on the triage call today. |
This would be great to have. Shame that it is closed. |
Since the PR has been stale for a long time I intend to pick up the work. I will start by making PRs for the benchnmarks and unit tests. Then I will make a PR for the most important case C-contiguous float64 and int64. If someone is still interested to pick this up, let me know. |
This PR solves issue 11569. @tylerjereddy also requested optimizations for large boolean arrays. I have added optimizations to numpy.nonzero for contiguous and aligned numpy arrays. Generally, the speedup is at least 2 times and the maximum speed gain observed is about 8 times. Currently, the optimizations are exclusive to all int types, float types and bool type for arrays of dimension 1 to 3. If needed, optimizations for higher dimensional arrays can be added. Test cases have been added.
The following code illustrates some of the speed gains: