WIP, MAINT: NeighborsBase `KDTree` upstream #31358

tylerjereddy · May 12, 2025

This is an extension of the concept in MAINT: mutual information using upstream KDTree #31347--here, part of the usage of in-house KDTree in NeighborsBase is replaced by its upstream version from SciPy. This is a much more challenging effort that clearly shows some substantial differences between the two KDTree APIs/methods, and the shims needed to address them. At the moment, there is still a small number of residual test failures (29 locally) in the full testsuite.
Some kind of API unification/equivalence of offerings seems likely to be needed for these kinds of replacements to be more sustainable (the shims added here were quite time consuming to figure out). Some of the test expectations may also be debatable for cases with i.e., degenerate input.

* This is an extension of the concept in scikit-learngh-31347--here, part of the usage of in-house `KDTree` in `NeighborsBase` is replaced by its upstream version from SciPy. This is a much more challenging effort that clearly shows some substantial differences between the two `KDTree` APIs/methods, and the shims needed to address them. At the moment, there is still a small number of residual test failures (29 locally) in the full testsuite. * Some kind of API unification/equivalence of offerings seems likely to be needed for these kinds of replacements to be more sustainable (the shims added here were quite time consuming to figure out). Some of the test expectations may also be debatable for cases with i.e., degenerate input.

github-actions · May 12, 2025

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`ruff check`

ruff detected issues. Please run ruff check --fix --output-format=full locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


sklearn/neighbors/_base.py:1308:89: E501 Line too long (92 > 88)
     |
1306 |                 nn_vals.append(len(sub_arr))
1307 |             if return_distance:
1308 |                 dd, ii = self._sp_tree.query(X, k=max(nn_vals), distance_upper_bound=radius)
     |                                                                                         ^^^^ E501
1309 |                 dd_new = []
1310 |                 ii_new = []
     |

sklearn/neighbors/_base.py:1326:89: E501 Line too long (98 > 88)
     |
1324 |                 ii = ii_new
1325 |                 try:
1326 |                     chunked_results = [(np.asarray(ii, dtype=int), np.asarray(dd, dtype=X.dtype))]
     |                                                                                         ^^^^^^^^^^ E501
1327 |                 except ValueError:
1328 |                     chunked_results = [(np.asarray(ii, dtype=object), np.asarray(dd, dtype=object))]
     |

sklearn/neighbors/_base.py:1328:89: E501 Line too long (100 > 88)
     |
1326 |                     chunked_results = [(np.asarray(ii, dtype=int), np.asarray(dd, dtype=X.dtype))]
1327 |                 except ValueError:
1328 |                     chunked_results = [(np.asarray(ii, dtype=object), np.asarray(dd, dtype=object))]
     |                                                                                         ^^^^^^^^^^^^ E501
1329 |             else:
1330 |                 for idx, sub_ele in enumerate(chunked_results[0]):
     |

Found 3 errors.

`ruff format`

ruff detected issues. Please run ruff format locally and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.11.7.


--- sklearn/neighbors/_base.py
+++ sklearn/neighbors/_base.py
@@ -1305,7 +1305,9 @@
             for sub_arr in chunked_results[0]:
                 nn_vals.append(len(sub_arr))
             if return_distance:
-                dd, ii = self._sp_tree.query(X, k=max(nn_vals), distance_upper_bound=radius)
+                dd, ii = self._sp_tree.query(
+                    X, k=max(nn_vals), distance_upper_bound=radius
+                )
                 dd_new = []
                 ii_new = []
                 for i in range(len(dd)):
@@ -1323,9 +1325,13 @@
                 dd = dd_new
                 ii = ii_new
                 try:
-                    chunked_results = [(np.asarray(ii, dtype=int), np.asarray(dd, dtype=X.dtype))]
+                    chunked_results = [
+                        (np.asarray(ii, dtype=int), np.asarray(dd, dtype=X.dtype))
+                    ]
                 except ValueError:
-                    chunked_results = [(np.asarray(ii, dtype=object), np.asarray(dd, dtype=object))]
+                    chunked_results = [
+                        (np.asarray(ii, dtype=object), np.asarray(dd, dtype=object))
+                    ]
             else:
                 for idx, sub_ele in enumerate(chunked_results[0]):
                     chunked_results[0][idx] = np.sort(chunked_results[0][idx])

1 file would be reformatted, 917 files already formatted

_{Generated for commit: b29ab20. Link to the linter CI: here}

tylerjereddy · May 13, 2025

sklearn/neighbors/_base.py

+            self._sp_tree = spKDTree(
+                X,
+                self.leaf_size,
+            )


Replacing _tree itself was much messier, so I tried to keep this scoped for prototyping.

tylerjereddy · May 13, 2025

sklearn/neighbors/_base.py

+            for sub_arr in chunked_results[0]:
+                nn_vals.append(len(sub_arr))
+            if return_distance:
+                dd, ii = self._sp_tree.query(X, k=max(nn_vals), distance_upper_bound=radius)


Needing to call query_ball_point above and query here is a demonstration of API differences in KDTree between our libraries causing awkwardness in substituted workflows.

tylerjereddy · May 13, 2025

sklearn/neighbors/_base.py

+                    if sort_results:
+                        sort_inds = np.argsort(finite_indices)
+                        sorted_inds = finite_indices[sort_inds]
+                        sorted_dists = finite_dists[sort_inds]


Different handling of ragged data structures and inf/invalid values also seems to be present, requiring these additional shims.

tylerjereddy · May 13, 2025

Some of this work was done in person with @virchan today.

github-actions bot added the module:neighbors label May 12, 2025

tylerjereddy commented May 13, 2025

View reviewed changes

tylerjereddy mentioned this pull request May 13, 2025

Reducing Ecosystem Duplication scientific-python/summit-2025#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP, MAINT: NeighborsBase `KDTree` upstream #31358

WIP, MAINT: NeighborsBase `KDTree` upstream #31358

Uh oh!

tylerjereddy commented May 12, 2025

Uh oh!

github-actions bot commented May 12, 2025

Uh oh!

tylerjereddy May 13, 2025

Uh oh!

tylerjereddy May 13, 2025

Uh oh!

tylerjereddy May 13, 2025

Uh oh!

tylerjereddy commented May 13, 2025

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

WIP, MAINT: NeighborsBase KDTree upstream #31358

Are you sure you want to change the base?

WIP, MAINT: NeighborsBase KDTree upstream #31358

Uh oh!

Conversation

tylerjereddy commented May 12, 2025

Uh oh!

github-actions bot commented May 12, 2025

❌ Linting issues

ruff check

ruff format

Uh oh!

tylerjereddy May 13, 2025

Choose a reason for hiding this comment

Uh oh!

tylerjereddy May 13, 2025

Choose a reason for hiding this comment

Uh oh!

tylerjereddy May 13, 2025

Choose a reason for hiding this comment

Uh oh!

tylerjereddy commented May 13, 2025

Uh oh!

Uh oh!

WIP, MAINT: NeighborsBase `KDTree` upstream #31358

WIP, MAINT: NeighborsBase `KDTree` upstream #31358

`ruff check`

`ruff format`