Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

MAINT: Speed up numpy.nonzero. #18368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from
Closed

Conversation

touqir14
Copy link
Contributor

@touqir14 touqir14 commented Feb 8, 2021

This PR solves issue 11569. @tylerjereddy also requested optimizations for large boolean arrays. I have added optimizations to numpy.nonzero for contiguous and aligned numpy arrays. Generally, the speedup is at least 2 times and the maximum speed gain observed is about 8 times. Currently, the optimizations are exclusive to all int types, float types and bool type for arrays of dimension 1 to 3. If needed, optimizations for higher dimensional arrays can be added. Test cases have been added.

The following code illustrates some of the speed gains:

import numpy as np

x = np.random.rand(1000000) > 0.5
%timeit np.nonzero(x)
666 µs ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
4.33 ms ± 17.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.rand(1000,1000) > 0.5
%timeit np.nonzero(x)
1.18 ms ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
9.55 ms ± 32.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.rand(100,100,100) > 0.5
%timeit np.nonzero(x)
1.48 ms ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
9.57 ms ± 44.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.rand(1000000) > 0.9
%timeit np.nonzero(x)
655 µs ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
1.77 ms ± 7.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [Without optimization]

x = np.random.rand(1000,1000) > 0.9
%timeit np.nonzero(x)
1.02 ms ± 2.46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
4.37 ms ± 12.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.rand(100,100,100) > 0.9
%timeit np.nonzero(x)
1.22 ms ± 9.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
3.9 ms ± 65.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.randint(0,9, size=(1000,1000), dtype=np.int32)
%timeit np.nonzero(x)
1.69 ms ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
7.83 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.randint(0,2, size=(1000,1000), dtype=np.int32)
%timeit np.nonzero(x)
1.49 ms ± 5.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
10.4 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.randint(0,9, size=(1000,1000), dtype=np.int64)
%timeit np.nonzero(x)
2.3 ms ± 6.37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
8.15 ms ± 29.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

x = np.random.randint(0,2, size=(1000,1000), dtype=np.int64)
%timeit np.nonzero(x)
1.93 ms ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) [With optimization]
10.7 ms ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) [Without optimization]

@charris charris changed the title Speeding up numpy.nonzero. MAINT: Speed up numpy.nonzero. Feb 8, 2021
@charris
Copy link
Member

charris commented Feb 8, 2021

Please note the NumPy C_STYLE_GUIDE.

@touqir14
Copy link
Contributor Author

touqir14 commented Feb 8, 2021

I should mention that I just found out that the workstation on which I ran the benchmarking script also had the numpy.count_nonzero optimization that I implemented in PR 18183 and so a minor part of the speedups is coming from there.

to_jmp = nonzero_idxs_dispatcher_ND((void*)data, multi_index, PyArray_SHAPE(self), PyArray_STRIDES(self), dtype->type_num, nonzero_count, ndim);
} else if (dtype->byteorder == '<') {
to_jmp = nonzero_idxs_dispatcher_ND((void*)data, multi_index, PyArray_SHAPE(self), PyArray_STRIDES(self), dtype->type_num, nonzero_count, ndim);
} else if (dtype->byteorder == '|' && dtype->elsize == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fly-by comment. I have not looked at this enough yet. We have PyArray_ISNBO for this, the dispatch as is seems overly complex. (I am also confused about the == '<' are you assuming a little endian machine?

I might love to think about putting this on the dtypes, but it probably is just hassle here.

One other thing to think about: Make sure you time your results with fortran (and sliced fortran arrays). Also, you may not really need 2D or 3D loops much, if you check whether you can iterate it as a 1-D array instead. We have macros for that: PyArray_TRIVIALLY_ITERABLE

Copy link
Contributor Author

@touqir14 touqir14 Feb 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization exploits C-contiguousness of arrays (without slices). With a little bit of change the code can be adjusted to work with slices. However, for such cases the bottleneck will be the cache misses when you have large arrays (and slices covering a small part) and so we might as well fall back to the generic iterator approach (thats what it is doing currently for slices). For the F-contiguous arrays I will have to use different loops (starting iteration from the first dimension and then moving inwards instead of the opposite needed for C-contiguous arrays). I can definitely add those loops to also cover optimization for F-contiguous arrays alongside the ones I have for handling C-contiguous arrays.

The reason why I am having multiple loops is to squeeze out as much performance as possible. Alternatively, I could just use 1 single loop and rely on multiple conditionals within the loop which would be a slower approach.

The way that numpy is working with byteordering is confusing me a bit. So, initially I did not have those conditions checking for byteorder and there is this one test case which was failing : https://github.com/numpy/numpy/blob/master/numpy/core/tests/test_regression.py#L1674-L1679 . The failure happens after byteswapping. Why is numpy treating a byteswapped array as if it was not byteswapped? My optimization does take into account that the array got byteswapped and the result changes after byteswapping. The old implementation however ignores the byteswapping. For the sake of consistency I also thought to follow this same behaviour. The byteorder condition hack that you referenced allowed the optimization to pass that test (since the optimization will be ignored if the byteorder got changed to big endian and rather the default iterator based implementation will be used).

So, by default, all arrays except the ones with elsize=1 will have byteorder : '=' irrespective of the endianness of the system? If that is the case when does an array have '>' or '<' ? What I intend to do is place some conditions that will detect that byteordering was changed and so, I can fall back to the default iterator approach which gives the same result even after byteswapping. Do you have any recommendations that is more reliable than that hack? There is 1 build failure which I am guessing is due to that hack being unreliable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, yeah. For nonzero it matters a lot what the input shape was, since it decides how many outputs you need.

By default it is probably = or | if byte swapping is not valid. But don't rely on that since it doesn't always hold currently. You can also request these explicitly arr = np.ones(3, dtype=">i8") for example.

The arr.byteswap() and arr.newbyteorder() commands are a bit strange maybe, since one swaps the bytes, without modifying the dtype and the other swaps the dtype but not the data. So in both cases the represented values changed. If you do both, the represented values stay the same though. (which that test does)

Just use that macro, which does the correct thing, by rather checking the opposite. If it was not native byte-order than it must be > (or < on a big-endian machine).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced that hack with PyArray_ISNBO . Should I go ahead and implement the loops for F-contagious arrays? Is there any documentation available on these handy macros or is it just that one can only dig into the codebase for finding them out (or just ask around) 😄 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these macros are public API which is largely documented in the reference manual... Beyond that, there isn't much docs, so in some cases there may not be any documentation.

Sorry about the f-contiguous thing, it may be nice, but I am not sure it is worth the trouble (at least in the first step). Just be sure that it doesn't get much slower for large F-contiguous arrays.

@touqir14
Copy link
Contributor Author

touqir14 commented Feb 9, 2021

@seberg ,I have pushed recent commits that optimize for arrays and views that are F-contiguous, including those arrays with a fortran based memory layout. So, currently, any aligned and contiguous array will enjoy the optimization. Thats it from me for now. You can continue with the review.

Regarding the speed on F-contiguous arrays, the speed matches that of C-contiguous arrays given the shapes are transposed. So, a C-contiguous array with shape: [m, n] will attain the same speed as an F-contiguous array with shape: [n, m] given they have the same values (since loops stop early if there are no remaining nonzero values).

@seberg
Copy link
Member

seberg commented Feb 12, 2021

I was just now surprised by how much slower nonzero is compared to the comparison itself, i.e. trues = x > 0.5; np.nonzero(trues) compared to np.nonzero(x > 0.5), that realigns my view on this entirely. I think we definitely should do it.

In [8]: x = np.random.rand(1000,1000)
In [9]: %timeit x > 0.5
487 µs ± 59.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: %timeit np.nonzero(x > 0.5)
9.24 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

There are a few things left to do, but I could help a bit if you are running low on enthusiasm:

  1. We should have benchmarks for all the cases you explicitly made faster. And I would like to include at least one or two also for np.nonzero(double_array > 0.5) style call, because I think that is effectively the more common case. I assume this means we need to add new ones to numpy/benchmark/.
  2. You have pretty big macros right now. Those macros should be functions written in the NumPy "templating" language. The easiest way here will be to move those functions into lowlevel_strided_loops.c.src. That should will likely allow you to combine F and C order functions into a single one. (Not sure about functions for the number of dimensions.)

Question to be sure (maybe it is obvious from the tests, in which case, don't bother):

  • Optimizing F order as well is nice, but I think np.nonzero might always return C-order indices right now. We must not change that. We should have a test for it, and if the current code should change it, maybe just pull it out again for the moment.

Backburner:

  • There are still some C-style nits (long lines, missing brackets)
  • I am a bit curious (if I am not off track again), whether skipping the count_nonzeros and instead using realloc is useful. I guess it should not be, since that is so much faster than the actual nonzero core?
  • The huge speedup seems almost "too large", I am wondering if the generic path is not implemented much slower than it should be.

@touqir14
Copy link
Contributor Author

I am actually not acquainted with the numpy template language. It is possible to combine the f and c style loops using more macros but I intentionally left them separated for better code readability and maintainability. Still if we must combine the f and c style loops together, how about we try to do that in a fresh PR? We can then also think about ways to generalize to higher dimensional loops?

The output of np.nonzero is a tuple of 1D arrays. F style or C style does not affect the output. Is that what you were concerned about?

@touqir14
Copy link
Contributor Author

touqir14 commented Feb 12, 2021

The use of count_nonzero actually benefits the nonzero function's speed for sparse cases. The count_nonzero probably accounts for 1/6th or less time which I think is worth being included for the sparse cases. The higher the dimensions the lower the overhead is due to count_nonzero. The SIMD optimization from 18183 has really been useful to keep the overhead low.

@seberg
Copy link
Member

seberg commented Feb 12, 2021

The c.src are a bit unique, but not very tricky see here for details: https://numpy.org/doc/stable/reference/distutils_guide.html#other-files. I am not sure if it makes sense to combine fortran/C into a single chunk, so fine with leaving it for now.

My question about the order is the output is about this example:

In [4]: np.nonzero(np.ones((2, 2)))
Out[4]: (array([0, 0, 1, 1]), array([0, 1, 0, 1]))

In [5]: np.nonzero(np.ones((2, 2), order="F"))
Out[5]: (array([0, 0, 1, 1]), array([0, 1, 0, 1]))

I.e. its easy to imagine that the last might instead return:

(array([0, 1, 0, 1]), array([0, 0, 1, 1]))

which is also correct but different.

@touqir14
Copy link
Contributor Author

Actually, this implementation returns the second one : (array([0, 1, 0, 1]), array([0, 0, 1, 1])) . This is unavoidable if we do not want to lose performance. For f-style arrays, f-style loops are always faster. Is it really important to preserve the order? In fact, if those indices are somehow later used to index that f-style array for some BLAS operation, this should be faster than having the output given in a c-order.

@seberg
Copy link
Member

seberg commented Feb 12, 2021

I have to think about it, but it is a subtle change that I would prefer we don't include here yet (to ensure that the rest doesn't get stalled because of that!).
The solution might also be to add np.nonzero(..., order="C"). I.e. an order parameter (we could limit support to only C or K or so, where K would mean "I don't care").

@touqir14
Copy link
Contributor Author

I like the order parameter approach. Just wondering, do you know which ones might get stalled due to the change in the ordering?

@seberg
Copy link
Member

seberg commented Feb 12, 2021

I meant progress on this PR getting stalled, because by the question of changing the output order being OK. A new order keyword argument is likely pretty uncontroversial, but also feels like it may be a good followup.

@touqir14
Copy link
Contributor Author

@seberg , I am getting this error after modifying the low level strided loops file : ValueError: In "numpy/core/src/multiarray/lowlevel_strided_loops.c.src" loop at line 1: no definition of key "elsize" and I haven't modified anything. Its probably something common and there is a fix.

@seberg
Copy link
Member

seberg commented Feb 12, 2021

@touqir14 sounds like the "preprocessor" is breaking on something, maybe you are using @elsize@, but did not properly define a loop elsize=1,2,4 (or because you edited some /**begin repeat that is not quite right and a later one is confused by it. If you can show the diff, I can take a look.

@touqir14
Copy link
Contributor Author

There was a typo that I wrote in one of the templating syntaxes. I have pushed the ported code to lowlevel_strided_loops file. Let me know how it looks.

}


#define nonzero_idxs_3D_F(data, idxs, shape, strides, nonzero_count, dtype) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to avoid these giant in-line macros, maybe a bit out of habit in NumPy, but I don't really like them, and you could make a repeat for the dtype.
You can use such a repeat even inside the "dispatcher" function body, and here itself, then it these become proper functions.

Now, I am not quite sure it actually will end up much better, since we don't need the functions (as in functions you can get a pointer for here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have converted C and F style loops into functions and condensed the dispatcher function as you suggested @seberg .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added the benchmarks as well. Using the SIMD (applicable for bool, int) optimized count_nonzero function, doing np.nonzero(array) is faster than np.nonzero(array>0.5).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, whats wrong with CI's smoke test? Looks like there is a problem with the pip.

@touqir14
Copy link
Contributor Author

@mattip , I have fixed the errors. Would you be able to review this?

@mattip
Copy link
Member

mattip commented Feb 25, 2021

Hmm. @seberg I would think this refactor should bring nonzero closer to the ufunc style of code, no? Or is this PR a first step in that direction?

@seberg
Copy link
Member

seberg commented Feb 25, 2021

@mattip you are right, its a step in the opposite direction unfortunately, although not a particularly big one. The problem is that nonzero is slightly special, because it has to allocate the correct output shape first (using count_nonzeros).

I couldn't think of an obvious way to wrangle it into a gufunc, but maybe that is also because we don't have many gufuncs yet, so considering how massive the speedup is, I was willing to make an exception with the hope we can consolidate it again.

@mattip
Copy link
Member

mattip commented Feb 25, 2021

ok, thanks

@touqir14
Copy link
Contributor Author

touqir14 commented Mar 3, 2021

Any progress with the review? @seberg @mattip

Base automatically changed from master to main March 4, 2021 02:05
@touqir14
Copy link
Contributor Author

touqir14 commented Mar 5, 2021

Close/reopen to restart CI

@touqir14 touqir14 closed this Mar 5, 2021
@touqir14 touqir14 reopened this Mar 5, 2021
@touqir14
Copy link
Contributor Author

touqir14 commented Mar 5, 2021

CI configuration needs updating. The errors are not due to this PR.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I trust that the code does the righ thing. I am sorry about the bazillion style nits, but the code is pretty liberal with breaking the NumPy style guidelines.

I am wondering if you can't rewrite the while loops as for loops (potentially with the empty test field). Also I think that this might be one place where using NPY_GCC_OPT_3 is worthwhile and likely a much bigger difference than manually optimizing away a single if.

I would be happier if we can write the if here, it just makes for easier to read code, but if you time that its worth the micro-optimization even with -O3 then that is fine as well.

If its super annoying, I could try to do some of the style-fixups for you, I guess.

npy_intp a = 0;
while (added_count < nonzero_count) {
*idxs = a;
bool to_increment = ((bool) *data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty sure this is better written as != 0. Casting NaN or inf to bool might set the "invalid" floating point exception, which we do not want here. We never actually check the FPE flags here probably, but I still think its better.

Can you write this to_increment as an if? I will trust the compiler can generate roughly the same code this is not some important micro-optimization?

/**begin repeat
*
* #dtype = npy_bool, npy_byte, npy_byte, npy_uint16, npy_int16, npy_uint32, npy_int32, npy_uint64, npy_int64, npy_float, npy_double#
* #name = bool, u8, i8, u16, i16, u32, i32, u64, i64, f32, f64#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long lines, everywhere. Please stay below 80 characters, unless the overhang is small and wrapping seem untidy.



#define ptr_assignment1(ptr1, val1, stride1) *ptr1 = val1; ptr1 += stride1;
#define ptr_assignment2(ptr1, val1, stride1, ptr2, val2, stride2) ptr_assignment1(ptr1, val1, stride1) *ptr2 = val2; ptr2 += stride2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I can't parse well "ptr_assignment2" means exactly. The first macro seems OK to just inline, the second probably as well?


/**begin repeat1
*
* #layout=C,F#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we say not to deal with F right now?

bool to_increment = ((bool) *data);
idxs += ((int) to_increment);
added_count += ((int) to_increment);
data = (@dtype@ *) (((char*) data) + stride);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data is typed += will work just fine. If this is c-continguous, data++; will work. But I think it is good if this is triggered for all 1-D arrays.

assert_equal(nzs, 0)



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excessive white space. There should be one empty line between def's in classes. We are pretty strict on PEP.

assert_equal(np.count_nonzero(y_view), len(idxs_0))
nzs = 0
for i,j in zip(idxs_0, idxs_1):
if y_view[i,j] == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if y_view[i,j] == 0:
if y_view[i, j] == 0:

for i,j in zip(idxs_0, idxs_1):
if x_view[i,j] == 0:
nzs += 1
assert_equal(nzs, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use assert nzs == 0 if nzs is a Python scalar.


for i in range(iters): # check for slices
x = ((2**33)*np.random.randn(10, 10)).astype(np.int32)
x.flat[[2,6,9,15,21]] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
x.flat[[2,6,9,15,21]] = 0
x.flat[[2, 6, 9, 15, 21]] = 0

param_names = ["dtype", "shape"]

def setup(self, dtype, size):
self.x = np.random.randint(0,2,size=size).astype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need benchmarks for "sparse" and "nonsparse" data.

That is, please generate a dataset which has very few (maybe only a single 1 somewhere in the middle or at the end), and a dataset which has everything 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be good to seed this, just to stabilize benchmarks a bit (although they tend to fluctuate absurdly often anyway).

while (added_count < nonzero_count && b<size_1) {
*idxs_1 = b;
npy_uintp to_increment = (npy_uintp) ((bool) *data);
idxs_1 += (idxs_stride & (-to_increment));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh forgot to press enter here I think. This deserved an 🤯 for me... I somewhat hope we could just do an if (*data != 0) and let the compiler do its magic, but if you say its worthwhile I am happy. I did try with godbold (I couldn't find the "share link" today), and I admit, the compiler does produce less code with this trick.

(I would like to make sure it is worthwhile with -O3 macro used)

for i,j in zip(idxs_0, idxs_1):
if y_view[i,j] == 0:
nzs += 1
assert_equal(nzs, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more comment that got losts. These tests seem very repitetive, can you try to use pytest.mark.parametrize or create a helper function like check_nonzero_result?

@dgiger42
Copy link

dgiger42 commented Jul 9, 2021

@seberg It looks like there hasn't been any activity on this PR since March. Since I'd like to be able to use this optimization, would it make any sense for me to fork the repo this PR is from, attempt to address your feedback, and make a new PR?

@charris
Copy link
Member

charris commented Jul 10, 2021

@danielg1111 It is generally OK as long as the original work is credited, usually by working on top of existing commits. It would smooth the way if @touqir14 weighed in and gave permission.

@touqir14
Copy link
Contributor Author

@danielg1111 @charris I haven't pushed some of my local updates yet which I hope to do in the next few days. I have been a little busy and haven't worked on this for a while. @danielg1111 if you can wait a bit this PR should be completed.

@GaryJHarveyJr
Copy link

A faster nonzero would definitely speed up some code I am writing to work in Avizo. Does anyone know if these fixes will be available anytime soon? Thanks!

@eendebakpt
Copy link
Contributor

@touqir14 @danielg1111 The PR has been stale for some time, but it would be nice to get it going again. Does either one of you want to continue with it at this moment?

I checked with a recent numpy , and there is still room for improvement of nonzero

@touqir14
Copy link
Contributor Author

I sincerely apologize to the Numpy team for dragging this PR for so long. I have been quite busy with my work and some of my side projects. I have research paper submissions on May 3rd, and right after that I will make this my priority, @eendebakpt . Thank you for understanding.

@eendebakpt
Copy link
Contributor

@touqir14 No problem. Ping me once you start working, and I'll help reviewing any updates

@seberg
Copy link
Member

seberg commented May 30, 2022

Just to note, since we now have quite a bit of C++ code elsewhere. It might make sense to aim for C++. And maybe it would have been nicer from the start to just create a new file for it (even nonzero.cpp).
(I don't remember if that is a full round from the start, but back then we didn't really have C++ yet...)

@seberg seberg added the 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue. label Nov 2, 2022
@seberg
Copy link
Member

seberg commented Dec 14, 2022

I am going to close the PR since it is stale. I do think it is a great idea, though, and very much worthwhile! Please do not hesitate to open a new PR based on this!

But at this point it seems like a small project to continue unfortunately. @seiko2plus just in case this grabs your interest.

EDIT: Just to note, this was discussed on the triage call today.

@seberg seberg closed this Dec 14, 2022
@ogencoglu
Copy link

This would be great to have. Shame that it is closed.

@eendebakpt
Copy link
Contributor

Since the PR has been stale for a long time I intend to pick up the work. I will start by making PRs for the benchnmarks and unit tests. Then I will make a PR for the most important case C-contiguous float64 and int64. If someone is still interested to pick this up, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 64 - Good Idea Inactive PR with a good start or idea. Consider studying it if you are working on a related issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.