-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Add reflection to reduce bias near boundary for KernelDensity
(fixes #27023).
#29370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
4601ce2
to
93be818
Compare
56e147f
to
f2054af
Compare
628ef9b
to
a67908a
Compare
Thanks for the PR @tillahoffmann. Please ping us in a month if you still don't have a reviewer here. We should be able to allocate some resources then. In the meantime, you don't need to force push commits. We squash and merge at the end anyway. Force pushing makes it hard to see the changes over time. |
Sounds like a plan; will get in touch later in the summer. |
Hi @adrinjalali, thanks for the help with this PR. As discussed, I wanted to ping regarding a review. |
Maybe @jeremiedbb or @adam2392 would be able to have a look? |
Meybe @lorentzenchr could have a look here? Or @ogrisel ? |
Thanks @tillahoffmann for the PR! Could you tell where is this procedure discussed or justified in Boneva et al, or elsewhere ? Especially I'm curious why we need to reflect over all faces (points, lines, up to facets) of the hypercube in the n-dimensional setting. It seems to me that Boneval et al only deal with the 1D case. More generally, do we have a mathematical justification for the reflection or is it just a hack ? |
@@ -187,6 +201,23 @@ def _choose_algorithm(self, algorithm, metric): | ||
) | ||
return algorithm | ||
|
||
def _evaluate_hypercube_faces(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A docstring would be helpful to understand the shape and meaning of the faces
array. In particular, the nan values seem to play an important role. If I understood correctly, in the unit cube, the point (1,0,1) is the face [1,0,1]
, the edge (0,0,0)-(0,0,1) is the face [0,0,nan]
, the plane (0,0,0)-(0,0,1)-(0,1,0) is the face [0, nan, nan]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think it would be better to define it as an external function and not a method.
def _evaluate_hypercube_faces(bounds):
# ...
return faces
indicate that a feature is not bounded. The computation cost grows exponentially | ||
with :code:`n_features`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation cost grows exponentially
That's true and a bit worrisome, maybe we should raise a UserWarning
when n_features
is above a threshold ? The number of faces grows as 3**n_features
, and as we store the scores for each face, I am wondering if we could have out-of-memory errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there maybe a version of the algorithm that takes the bandwidth
into account ? In my intuition only samples close to the boundary needs to be fixed/reflected. For compact support kernels (tophat, cosine, etc) that's clearly the case, only samples within bandwidth
of the boundary would contribute. For the gaussian and exponential, I think the contribution is exponentially suppressed as me move from the boundary. I am wondering if we are not doing a lot of unnecessary computations in the n-dimensional setting.
if np.any(upper <= lower): | ||
raise ValueError( | ||
f"invalid bounds: upper ({upper}) is not larger than lower " | ||
f"({lower})" | ||
) | ||
if np.any(X < lower): | ||
raise ValueError("samples must be larger than lower bound") | ||
if np.any(X > upper): | ||
raise ValueError("samples must be smaller than upper bound") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also raise a ValueError
if the X
does not have n_features=X.shape[1]=bounds.shape[0]
def test_kde_hypercube_faces(n_dims, expected_counts): | ||
# Check that the number of faces of each kind match theory from | ||
# https://en.wikipedia.org/wiki/Hypercube#Faces. | ||
bounds = np.arange(2 * n_dims).reshape((n_dims, 2)) | ||
kde = KernelDensity(bounds=bounds) | ||
faces = kde._evaluate_hypercube_faces() | ||
n_nans = np.isnan(faces).sum(axis=1) | ||
actual_counts = np.bincount(n_nans) | ||
np.testing.assert_array_equal(actual_counts, expected_counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments could be helpful to explain the test in line with the docstring of _evaluate_hypercube_faces
, eg n_nans
is the dimension of the face/m-cube, and the count of m-cubes in the n-hypercube should match the theory
def test_kde_bounded_density(): | ||
n_samples, n_features = (100_000, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_samples=100_000
is quite big. How long does the test takes ? Could you have a test with lower n_samples=100
?
# Without any bounds, we expect a bias to 0.25 at the corners, 0.5 at the edges, and | ||
# no bias in the center. | ||
kde = KernelDensity(bandwidth="scott").fit(X_train) | ||
np.testing.assert_allclose( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.testing.assert_allclose( | |
assert_allclose( |
Equivalent function imported from sklearn.utils._testing
at the beginning of test_kde.py
|
||
# With full bounds, we don't get any bias. | ||
kde = KernelDensity(bandwidth="scott", bounds=[[0, 1], [0, 1]]).fit(X_train) | ||
np.testing.assert_allclose( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and for the following assertions.
def _more_tags(self): | ||
return { | ||
"_xfail_checks": { | ||
"check_sample_weights_invariance": ( | ||
"sample_weight must have positive values" | ||
), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _more_tags(self): | |
return { | |
"_xfail_checks": { | |
"check_sample_weights_invariance": ( | |
"sample_weight must have positive values" | |
), | |
} | |
} |
This should fix the error in the azure pipeline test. You can also merge the main branch, because _more_tags
has been removed since.
This PR introduces a new keyword argument
bounds
forKernelDensity
to reduce bias on a bounded domain as discussed in #27023. Specifically, for density estimation, each point to be scored is reflected on the faces of the hypercube defined bybounds
such that mass that may have "leaked out of the bounds" is accounted for. This method was developed in Boneva et al. (1971).It may be beneficial to validate that the bounds satisfy
lower_bound < upper_bound
, but I wasn't able to figure out how to achieve that using_parameter_constraints
. Input much appreciated.