-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP, MAINT: NeighborsBase KDTree
upstream
#31358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
WIP, MAINT: NeighborsBase KDTree
upstream
#31358
Conversation
* This is an extension of the concept in scikit-learngh-31347--here, part of the usage of in-house `KDTree` in `NeighborsBase` is replaced by its upstream version from SciPy. This is a much more challenging effort that clearly shows some substantial differences between the two `KDTree` APIs/methods, and the shims needed to address them. At the moment, there is still a small number of residual test failures (29 locally) in the full testsuite. * Some kind of API unification/equivalence of offerings seems likely to be needed for these kinds of replacements to be more sustainable (the shims added here were quite time consuming to figure out). Some of the test expectations may also be debatable for cases with i.e., degenerate input.
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
self._sp_tree = spKDTree( | ||
X, | ||
self.leaf_size, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing _tree
itself was much messier, so I tried to keep this scoped for prototyping.
for sub_arr in chunked_results[0]: | ||
nn_vals.append(len(sub_arr)) | ||
if return_distance: | ||
dd, ii = self._sp_tree.query(X, k=max(nn_vals), distance_upper_bound=radius) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needing to call query_ball_point
above and query
here is a demonstration of API differences in KDTree
between our libraries causing awkwardness in substituted workflows.
if sort_results: | ||
sort_inds = np.argsort(finite_indices) | ||
sorted_inds = finite_indices[sort_inds] | ||
sorted_dists = finite_dists[sort_inds] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different handling of ragged data structures and inf
/invalid values also seems to be present, requiring these additional shims.
Some of this work was done in person with @virchan today. |
This is an extension of the concept in MAINT: mutual information using upstream KDTree #31347--here, part of the usage of in-house
KDTree
inNeighborsBase
is replaced by its upstream version from SciPy. This is a much more challenging effort that clearly shows some substantial differences between the twoKDTree
APIs/methods, and the shims needed to address them. At the moment, there is still a small number of residual test failures (29 locally) in the full testsuite.Some kind of API unification/equivalence of offerings seems likely to be needed for these kinds of replacements to be more sustainable (the shims added here were quite time consuming to figure out). Some of the test expectations may also be debatable for cases with i.e., degenerate input.