-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT: mutual information using upstream KDTree #31347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
90d9ed9
to
d30b4fc
Compare
ny = kd.query_radius(y, radius, count_only=True, return_distance=False) | ||
kd = KDTree(y) | ||
ny = kd.query_ball_point(y, radius, p=np.inf) | ||
ny = [len(sub_list) for sub_list in ny] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere in this diff, we can do even better (see below). It seems that the two APIs have effectively developed via convergent evolution to offer similar options with different names.
kd = KDTree(y)
- ny = kd.query_ball_point(y, radius, p=np.inf)
- ny = [len(sub_list) for sub_list in ny]
+ ny = kd.query_ball_point(y, radius, p=np.inf, return_length=True)
* This patch aims to test the waters for reducing community duplication of effort/maintenance with KDTree. In particular, a few select use cases of the `sklearn` in-house `KDTree` by the mutual information infrastructure are replaced with upstream `KDTree` from SciPy. The full test suite appears to pass locally.
d30b4fc
to
da64dc8
Compare
I applied the simplification mentioned at #31347 (comment) and tests were still happy locally. |
* This is an extension of the concept in scikit-learngh-31347--here, part of the usage of in-house `KDTree` in `NeighborsBase` is replaced by its upstream version from SciPy. This is a much more challenging effort that clearly shows some substantial differences between the two `KDTree` APIs/methods, and the shims needed to address them. At the moment, there is still a small number of residual test failures (29 locally) in the full testsuite. * Some kind of API unification/equivalence of offerings seems likely to be needed for these kinds of replacements to be more sustainable (the shims added here were quite time consuming to figure out). Some of the test expectations may also be debatable for cases with i.e., degenerate input.
If scikit-learn can upstream all it's requirements for KDTree and friends into SciPy, then I would be for using SciPy's KDTree. Given the differences between scikit-learn's KDTree and SciPy's KDTree, I suspect this will be a heavy lift. |
Cc'ing @glemaitre in advance, as this may affect imbalanced-learn. |
KDTree
. In particular, a few select use cases of thesklearn
in-houseKDTree
by the mutual information infrastructure are replaced with upstreamKDTree
from SciPy. The full test suite appears to pass locally.If there is an appetite for this, we could spend some time on it (scientific-python/summit-2025#31). I've placed some more detailed discussion points below. This may also interest @sturlamolden.
More detailed Analysis of the Situation
Potential Advantages of Doing This:
KDTree
has been heavily optimized and supports concurrency in many places that thesklearn
version does not, as far as I can tell.KDTree
could focus their efforts on a single (or at least, less) implementation(s). This is the main one I had in mind.Considerations, Drawbacks, Points that are not clear:
KDTree
sources and offerings alone for quite some time to see how that goes first, progressively performing replacements in cases where there's an obvious benefit, or at least no loss of performance or functionality.KDTree
code is quite complex. The features offered by both libraries are not identical, and if you have a code base using two types ofKDTree
it may temporarily make things even more complex.asv
benchmarks for yourKDTree
implementation I don't think? We'd want to add that to avoid performance regressions I suspect, and/or to demonstrate improvements where SciPy adds theworkers
/concurrency option.sklearn
uses a "base"BinaryTree
that is "specialized" toKDTree
andBallTree
in an object-oriented style. SciPy doesn't haveBallTree
(should it?)--how would this be coordinated if there was an appetite for upstreaming?sklearn
activity surrounding 32- and 64-bit "variations" or type preservation with the binary tree infrastructure? I'm not so sure how this would fly/work for SciPy.