Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

MAINT: mutual information using upstream KDTree #31347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
Loading
from

Conversation

tylerjereddy
Copy link
Contributor

@tylerjereddy tylerjereddy commented May 9, 2025

  • This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the sklearn in-house KDTree by the mutual information infrastructure are replaced with upstream KDTree from SciPy. The full test suite appears to pass locally.

If there is an appetite for this, we could spend some time on it (scientific-python/summit-2025#31). I've placed some more detailed discussion points below. This may also interest @sturlamolden.


More detailed Analysis of the Situation

Potential Advantages of Doing This:

  1. The SciPy version of KDTree has been heavily optimized and supports concurrency in many places that the sklearn version does not, as far as I can tell.
  2. Reduced duplication of effort--community folks who are experts on KDTree could focus their efforts on a single (or at least, less) implementation(s). This is the main one I had in mind.

Considerations, Drawbacks, Points that are not clear:

  1. An obvious question is what "replacing" would even mean and how far it would go--we could just replace a few obvious cases to start, like here, but leave the in-house KDTree sources and offerings alone for quite some time to see how that goes first, progressively performing replacements in cases where there's an obvious benefit, or at least no loss of performance or functionality.
  2. The switchover would require reviewer bandwidth, and KDTree code is quite complex. The features offered by both libraries are not identical, and if you have a code base using two types of KDTree it may temporarily make things even more complex.
  3. (minor) you don't have asv benchmarks for your KDTree implementation I don't think? We'd want to add that to avoid performance regressions I suspect, and/or to demonstrate improvements where SciPy adds the workers/concurrency option.
  4. A design challenge is that sklearn uses a "base" BinaryTree that is "specialized" to KDTree and BallTree in an object-oriented style. SciPy doesn't have BallTree (should it?)--how would this be coordinated if there was an appetite for upstreaming?
  5. There appears to be some sklearn activity surrounding 32- and 64-bit "variations" or type preservation with the binary tree infrastructure? I'm not so sure how this would fly/work for SciPy.
  6. Compatibility with pipelines may require quite a bit more thought when upstreaming.

Copy link

github-actions bot commented May 9, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: da64dc8. Link to the linter CI: here

ny = kd.query_radius(y, radius, count_only=True, return_distance=False)
kd = KDTree(y)
ny = kd.query_ball_point(y, radius, p=np.inf)
ny = [len(sub_list) for sub_list in ny]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in this diff, we can do even better (see below). It seems that the two APIs have effectively developed via convergent evolution to offer similar options with different names.

     kd = KDTree(y)
-    ny = kd.query_ball_point(y, radius, p=np.inf)
-    ny = [len(sub_list) for sub_list in ny]
+    ny = kd.query_ball_point(y, radius, p=np.inf, return_length=True)

* This patch aims to test the waters for reducing
community duplication of effort/maintenance with
KDTree. In particular, a few select use cases
of the `sklearn` in-house `KDTree` by the mutual information
infrastructure are replaced with upstream `KDTree` from SciPy.
The full test suite appears to pass locally.
@tylerjereddy tylerjereddy force-pushed the treddy_kdtree_txn_1 branch from d30b4fc to da64dc8 Compare May 12, 2025 00:01
@tylerjereddy
Copy link
Contributor Author

I applied the simplification mentioned at #31347 (comment) and tests were still happy locally.

tylerjereddy added a commit to tylerjereddy/scikit-learn that referenced this pull request May 12, 2025
* This is an extension of the concept in scikit-learngh-31347--here,
part of the usage of in-house `KDTree` in `NeighborsBase`
is replaced by its upstream version from SciPy. This is a much
more challenging effort that clearly shows some substantial
differences between the two `KDTree` APIs/methods, and the shims
needed to address them. At the moment, there is still a small
number of residual test failures (29 locally) in the full testsuite.

* Some kind of API unification/equivalence of offerings seems likely
to be needed for these kinds of replacements to be more sustainable (the
shims added here were quite time consuming to figure out). Some of the
test expectations may also be debatable for cases with i.e., degenerate
input.
@thomasjpfan
Copy link
Member

If scikit-learn can upstream all it's requirements for KDTree and friends into SciPy, then I would be for using SciPy's KDTree. Given the differences between scikit-learn's KDTree and SciPy's KDTree, I suspect this will be a heavy lift.

@virchan
Copy link
Member

virchan commented May 13, 2025

Cc'ing @glemaitre in advance, as this may affect imbalanced-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.