-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
PERF Implement PairwiseDistancesReduction
backend for RadiusNeighbors.predict_proba
#26828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Micky774
merged 17 commits into
scikit-learn:main
from
OmarManzoor:radius_neighbors_predict_pwdb
Sep 25, 2023
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
6b2700a
PERF Implement PairwiseDistancesReduction backend for RadiusNeighbors…
OmarManzoor 0ad3127
FIX index issue in _parallel_on_X_prange_iter_finalize
OmarManzoor c46e590
Add tests for wrong method usages
OmarManzoor 04dd4b2
* Use a mem view and numpy array to handle outliers
OmarManzoor 495e97e
Remove tolist() when printing outliers in error
OmarManzoor 9d0fc3e
Merge branch 'main' into radius_neighbors_predict_pwdb
OmarManzoor 65d8b96
Add changelog
OmarManzoor 3154843
Applied PR suggestions
6af6134
Merge branch 'main' into radius_neighbors_predict_pwdb
OmarManzoor a4fa941
Merge branch 'main' into radius_neighbors_predict_pwdb
OmarManzoor 32026e7
Import WeightingStrategy from the common _classmode
OmarManzoor 3a8d391
Merge branch 'main' into radius_neighbors_predict_pwdb
OmarManzoor 61e3235
Merge branch 'main' into radius_neighbors_predict_pwdb
OmarManzoor daf0443
Address PR suggestions
OmarManzoor a973325
Merge branch 'main' into radius_neighbors_predict_pwdb
jjerphan 0615c5f
Merge branch 'main' into radius_neighbors_predict_pwdb
OmarManzoor a1e722d
Updates: PR suggestions
OmarManzoor File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,10 @@ | |
RadiusNeighbors32, | ||
RadiusNeighbors64, | ||
) | ||
from ._radius_neighbors_classmode import ( | ||
RadiusNeighborsClassMode32, | ||
RadiusNeighborsClassMode64, | ||
) | ||
|
||
|
||
def sqeuclidean_row_norms(X, num_threads): | ||
|
@@ -597,3 +601,153 @@ def compute( | |
"Only float64 or float32 datasets pairs are supported at this time, " | ||
f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}." | ||
) | ||
|
||
|
||
class RadiusNeighborsClassMode(BaseDistancesReductionDispatcher): | ||
"""Compute radius-based class modes of row vectors of X using the | ||
those of Y. | ||
|
||
For each row-vector X[i] of the queries X, find all the indices j of | ||
row-vectors in Y such that: | ||
|
||
dist(X[i], Y[j]) <= radius | ||
|
||
RadiusNeighborsClassMode is typically used to perform bruteforce | ||
radius neighbors queries when the weighted mode of the labels for | ||
the nearest neighbors within the specified radius are required, | ||
such as in `predict` methods. | ||
Comment on lines
+615
to
+618
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not related to this PR: I think such parts of the doc-string of |
||
|
||
This class is not meant to be instantiated, one should only use | ||
its :meth:`compute` classmethod which handles allocation and | ||
deallocation consistently. | ||
""" | ||
|
||
@classmethod | ||
def valid_metrics(cls) -> List[str]: | ||
excluded = { | ||
# Euclidean is technically usable for RadiusNeighborsClassMode | ||
# but it would not be competitive. | ||
# TODO: implement Euclidean specialization using GEMM. | ||
"euclidean", | ||
"sqeuclidean", | ||
} | ||
return sorted(set(BaseDistancesReductionDispatcher.valid_metrics()) - excluded) | ||
|
||
@classmethod | ||
def compute( | ||
cls, | ||
X, | ||
Y, | ||
radius, | ||
weights, | ||
Y_labels, | ||
unique_Y_labels, | ||
outlier_label, | ||
metric="euclidean", | ||
chunk_size=None, | ||
metric_kwargs=None, | ||
strategy=None, | ||
): | ||
"""Return the results of the reduction for the given arguments. | ||
Parameters | ||
---------- | ||
X : ndarray of shape (n_samples_X, n_features) | ||
The input array to be labelled. | ||
Y : ndarray of shape (n_samples_Y, n_features) | ||
The input array whose class membership is provided through | ||
the `Y_labels` parameter. | ||
radius : float | ||
The radius defining the neighborhood. | ||
weights : ndarray | ||
The weights applied to the `Y_labels` when computing the | ||
weighted mode of the labels. | ||
Y_labels : ndarray | ||
An array containing the index of the class membership of the | ||
associated samples in `Y`. This is used in labeling `X`. | ||
unique_Y_labels : ndarray | ||
An array containing all unique class labels. | ||
outlier_label : int, default=None | ||
Label for outlier samples (samples with no neighbors in given | ||
radius). In the default case when the value is None if any | ||
outlier is detected, a ValueError will be raised. The outlier | ||
label should be selected from among the unique 'Y' labels. If | ||
it is specified with a different value a warning will be raised | ||
and all class probabilities of outliers will be assigned to be 0. | ||
OmarManzoor marked this conversation as resolved.
Show resolved
Hide resolved
|
||
metric : str, default='euclidean' | ||
The distance metric to use. For a list of available metrics, see | ||
the documentation of :class:`~sklearn.metrics.DistanceMetric`. | ||
Currently does not support `'precomputed'`. | ||
chunk_size : int, default=None, | ||
The number of vectors per chunk. If None (default) looks-up in | ||
scikit-learn configuration for `pairwise_dist_chunk_size`, | ||
and use 256 if it is not set. | ||
metric_kwargs : dict, default=None | ||
Keyword arguments to pass to specified metric function. | ||
strategy : str, {'auto', 'parallel_on_X', 'parallel_on_Y'}, default=None | ||
The chunking strategy defining which dataset parallelization are made on. | ||
For both strategies the computations happens with two nested loops, | ||
respectively on chunks of X and chunks of Y. | ||
Strategies differs on which loop (outer or inner) is made to run | ||
in parallel with the Cython `prange` construct: | ||
- 'parallel_on_X' dispatches chunks of X uniformly on threads. | ||
Each thread then iterates on all the chunks of Y. This strategy is | ||
embarrassingly parallel and comes with no datastructures | ||
synchronisation. | ||
- 'parallel_on_Y' dispatches chunks of Y uniformly on threads. | ||
Each thread processes all the chunks of X in turn. This strategy is | ||
a sequence of embarrassingly parallel subtasks (the inner loop on Y | ||
chunks) with intermediate datastructures synchronisation at each | ||
iteration of the sequential outer loop on X chunks. | ||
- 'auto' relies on a simple heuristic to choose between | ||
'parallel_on_X' and 'parallel_on_Y': when `X.shape[0]` is large enough, | ||
'parallel_on_X' is usually the most efficient strategy. | ||
When `X.shape[0]` is small but `Y.shape[0]` is large, 'parallel_on_Y' | ||
brings more opportunity for parallelism and is therefore more efficient | ||
despite the synchronization step at each iteration of the outer loop | ||
on chunks of `X`. | ||
- None (default) looks-up in scikit-learn configuration for | ||
`pairwise_dist_parallel_strategy`, and use 'auto' if it is not set. | ||
Returns | ||
------- | ||
probabilities : ndarray of shape (n_samples_X, n_classes) | ||
An array containing the class probabilities for each sample. | ||
""" | ||
if weights not in {"uniform", "distance"}: | ||
raise ValueError( | ||
"Only the 'uniform' or 'distance' weights options are supported" | ||
f" at this time. Got: {weights=}." | ||
) | ||
if X.dtype == Y.dtype == np.float64: | ||
return RadiusNeighborsClassMode64.compute( | ||
X=X, | ||
Y=Y, | ||
radius=radius, | ||
weights=weights, | ||
Y_labels=np.array(Y_labels, dtype=np.intp), | ||
unique_Y_labels=np.array(unique_Y_labels, dtype=np.intp), | ||
outlier_label=outlier_label, | ||
metric=metric, | ||
chunk_size=chunk_size, | ||
metric_kwargs=metric_kwargs, | ||
strategy=strategy, | ||
) | ||
|
||
if X.dtype == Y.dtype == np.float32: | ||
return RadiusNeighborsClassMode32.compute( | ||
X=X, | ||
Y=Y, | ||
radius=radius, | ||
weights=weights, | ||
Y_labels=np.array(Y_labels, dtype=np.intp), | ||
unique_Y_labels=np.array(unique_Y_labels, dtype=np.intp), | ||
outlier_label=outlier_label, | ||
metric=metric, | ||
chunk_size=chunk_size, | ||
metric_kwargs=metric_kwargs, | ||
strategy=strategy, | ||
) | ||
|
||
raise ValueError( | ||
"Only float64 or float32 datasets pairs are supported at this time, " | ||
f"got: X.dtype={X.dtype} and Y.dtype={Y.dtype}." | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.