MAINT: mutual information using upstream KDTree #31347

tylerjereddy · May 9, 2025

This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the sklearn in-house KDTree by the mutual information infrastructure are replaced with upstream KDTree from SciPy. The full test suite appears to pass locally.

If there is an appetite for this, we could spend some time on it (scientific-python/summit-2025#31). I've placed some more detailed discussion points below. This may also interest @sturlamolden.

More detailed Analysis of the Situation

Potential Advantages of Doing This:

The SciPy version of KDTree has been heavily optimized and supports concurrency in many places that the sklearn version does not, as far as I can tell.
Reduced duplication of effort--community folks who are experts on KDTree could focus their efforts on a single (or at least, less) implementation(s). This is the main one I had in mind.

Considerations, Drawbacks, Points that are not clear:

An obvious question is what "replacing" would even mean and how far it would go--we could just replace a few obvious cases to start, like here, but leave the in-house KDTree sources and offerings alone for quite some time to see how that goes first, progressively performing replacements in cases where there's an obvious benefit, or at least no loss of performance or functionality.
The switchover would require reviewer bandwidth, and KDTree code is quite complex. The features offered by both libraries are not identical, and if you have a code base using two types of KDTree it may temporarily make things even more complex.
(minor) you don't have asv benchmarks for your KDTree implementation I don't think? We'd want to add that to avoid performance regressions I suspect, and/or to demonstrate improvements where SciPy adds the workers/concurrency option.
A design challenge is that sklearn uses a "base" BinaryTree that is "specialized" to KDTree and BallTree in an object-oriented style. SciPy doesn't have BallTree (should it?)--how would this be coordinated if there was an appetite for upstreaming?
There appears to be some sklearn activity surrounding 32- and 64-bit "variations" or type preservation with the binary tree infrastructure? I'm not so sure how this would fly/work for SciPy.
Compatibility with pipelines may require quite a bit more thought when upstreaming.

github-actions · May 9, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: da64dc8. Link to the linter CI: here}

tylerjereddy · May 10, 2025

sklearn/feature_selection/_mutual_info.py

-    ny = kd.query_radius(y, radius, count_only=True, return_distance=False)
+    kd = KDTree(y)
+    ny = kd.query_ball_point(y, radius, p=np.inf)
+    ny = [len(sub_list) for sub_list in ny]


Here and elsewhere in this diff, we can do even better (see below). It seems that the two APIs have effectively developed via convergent evolution to offer similar options with different names.

kd = KDTree(y) - ny = kd.query_ball_point(y, radius, p=np.inf) - ny = [len(sub_list) for sub_list in ny] + ny = kd.query_ball_point(y, radius, p=np.inf, return_length=True)

* This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the `sklearn` in-house `KDTree` by the mutual information infrastructure are replaced with upstream `KDTree` from SciPy. The full test suite appears to pass locally.

tylerjereddy · May 12, 2025

I applied the simplification mentioned at #31347 (comment) and tests were still happy locally.

* This is an extension of the concept in scikit-learngh-31347--here, part of the usage of in-house `KDTree` in `NeighborsBase` is replaced by its upstream version from SciPy. This is a much more challenging effort that clearly shows some substantial differences between the two `KDTree` APIs/methods, and the shims needed to address them. At the moment, there is still a small number of residual test failures (29 locally) in the full testsuite. * Some kind of API unification/equivalence of offerings seems likely to be needed for these kinds of replacements to be more sustainable (the shims added here were quite time consuming to figure out). Some of the test expectations may also be debatable for cases with i.e., degenerate input.

thomasjpfan · May 13, 2025

If scikit-learn can upstream all it's requirements for KDTree and friends into SciPy, then I would be for using SciPy's KDTree. Given the differences between scikit-learn's KDTree and SciPy's KDTree, I suspect this will be a heavy lift.

virchan · May 13, 2025

Cc'ing @glemaitre in advance, as this may affect imbalanced-learn.

github-actions bot added the module:feature_selection label May 9, 2025

tylerjereddy force-pushed the treddy_kdtree_txn_1 branch from 90d9ed9 to d30b4fc Compare May 9, 2025 23:19

tylerjereddy mentioned this pull request May 9, 2025

Reducing Ecosystem Duplication scientific-python/summit-2025#31

Open

tylerjereddy commented May 10, 2025

View reviewed changes

tylerjereddy force-pushed the treddy_kdtree_txn_1 branch from d30b4fc to da64dc8 Compare May 12, 2025 00:01

tylerjereddy mentioned this pull request May 12, 2025

WIP, MAINT: NeighborsBase KDTree upstream #31358

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT: mutual information using upstream KDTree #31347

MAINT: mutual information using upstream KDTree #31347

Uh oh!

tylerjereddy commented May 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented May 9, 2025 •

edited

Loading

Uh oh!

tylerjereddy May 10, 2025

Uh oh!

tylerjereddy commented May 12, 2025

Uh oh!

thomasjpfan commented May 13, 2025

Uh oh!

virchan commented May 13, 2025

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

MAINT: mutual information using upstream KDTree #31347

Are you sure you want to change the base?

MAINT: mutual information using upstream KDTree #31347

Uh oh!

Conversation

tylerjereddy commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

tylerjereddy May 10, 2025

Choose a reason for hiding this comment

Uh oh!

tylerjereddy commented May 12, 2025

Uh oh!

thomasjpfan commented May 13, 2025

Uh oh!

virchan commented May 13, 2025

Uh oh!

Uh oh!

tylerjereddy commented May 9, 2025 •

edited

Loading

github-actions bot commented May 9, 2025 •

edited

Loading