Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add reflection to reduce bias near boundary for KernelDensity (fixes #27023). #29370

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
Loading
from

Conversation

tillahoffmann
Copy link

@tillahoffmann tillahoffmann commented Jun 29, 2024

This PR introduces a new keyword argument bounds for KernelDensity to reduce bias on a bounded domain as discussed in #27023. Specifically, for density estimation, each point to be scored is reflected on the faces of the hypercube defined by bounds such that mass that may have "leaked out of the bounds" is accounted for. This method was developed in Boneva et al. (1971).

It may be beneficial to validate that the bounds satisfy lower_bound < upper_bound, but I wasn't able to figure out how to achieve that using _parameter_constraints. Input much appreciated.

Copy link

github-actions bot commented Jun 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6991853. Link to the linter CI: here

@adrinjalali
Copy link
Member

Thanks for the PR @tillahoffmann. Please ping us in a month if you still don't have a reviewer here. We should be able to allocate some resources then. In the meantime, you don't need to force push commits. We squash and merge at the end anyway. Force pushing makes it hard to see the changes over time.

@tillahoffmann
Copy link
Author

Sounds like a plan; will get in touch later in the summer.

@tillahoffmann
Copy link
Author

Hi @adrinjalali, thanks for the help with this PR. As discussed, I wanted to ping regarding a review.

@adrinjalali
Copy link
Member

Maybe @jeremiedbb or @adam2392 would be able to have a look?

@adrinjalali
Copy link
Member

Meybe @lorentzenchr could have a look here? Or @ogrisel ?

@antoinebaker
Copy link
Contributor

antoinebaker commented Nov 27, 2024

Thanks @tillahoffmann for the PR!

Could you tell where is this procedure discussed or justified in Boneva et al, or elsewhere ? Especially I'm curious why we need to reflect over all faces (points, lines, up to facets) of the hypercube in the n-dimensional setting. It seems to me that Boneval et al only deal with the 1D case. More generally, do we have a mathematical justification for the reflection or is it just a hack ?

@@ -187,6 +201,23 @@ def _choose_algorithm(self, algorithm, metric):
)
return algorithm

def _evaluate_hypercube_faces(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A docstring would be helpful to understand the shape and meaning of the faces array. In particular, the nan values seem to play an important role. If I understood correctly, in the unit cube, the point (1,0,1) is the face [1,0,1], the edge (0,0,0)-(0,0,1) is the face [0,0,nan], the plane (0,0,0)-(0,0,1)-(0,1,0) is the face [0, nan, nan] ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think it would be better to define it as an external function and not a method.

def _evaluate_hypercube_faces(bounds):
    # ...
    return faces

Comment on lines +94 to +95
indicate that a feature is not bounded. The computation cost grows exponentially
with :code:`n_features`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation cost grows exponentially

That's true and a bit worrisome, maybe we should raise a UserWarning when n_features is above a threshold ? The number of faces grows as 3**n_features, and as we store the scores for each face, I am wondering if we could have out-of-memory errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there maybe a version of the algorithm that takes the bandwidth into account ? In my intuition only samples close to the boundary needs to be fixed/reflected. For compact support kernels (tophat, cosine, etc) that's clearly the case, only samples within bandwidth of the boundary would contribute. For the gaussian and exponential, I think the contribution is exponentially suppressed as me move from the boundary. I am wondering if we are not doing a lot of unnecessary computations in the n-dimensional setting.

Comment on lines +432 to +440
if np.any(upper <= lower):
raise ValueError(
f"invalid bounds: upper ({upper}) is not larger than lower "
f"({lower})"
)
if np.any(X < lower):
raise ValueError("samples must be larger than lower bound")
if np.any(X > upper):
raise ValueError("samples must be smaller than upper bound")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also raise a ValueError if the X does not have n_features=X.shape[1]=bounds.shape[0]

Comment on lines +266 to +274
def test_kde_hypercube_faces(n_dims, expected_counts):
# Check that the number of faces of each kind match theory from
# https://en.wikipedia.org/wiki/Hypercube#Faces.
bounds = np.arange(2 * n_dims).reshape((n_dims, 2))
kde = KernelDensity(bounds=bounds)
faces = kde._evaluate_hypercube_faces()
n_nans = np.isnan(faces).sum(axis=1)
actual_counts = np.bincount(n_nans)
np.testing.assert_array_equal(actual_counts, expected_counts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments could be helpful to explain the test in line with the docstring of _evaluate_hypercube_faces, eg n_nans is the dimension of the face/m-cube, and the count of m-cubes in the n-hypercube should match the theory $2^{n-m}\binom{n}{m}$

Comment on lines +292 to +293
def test_kde_bounded_density():
n_samples, n_features = (100_000, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_samples=100_000 is quite big. How long does the test takes ? Could you have a test with lower n_samples=100 ?

# Without any bounds, we expect a bias to 0.25 at the corners, 0.5 at the edges, and
# no bias in the center.
kde = KernelDensity(bandwidth="scott").fit(X_train)
np.testing.assert_allclose(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.testing.assert_allclose(
assert_allclose(

Equivalent function imported from sklearn.utils._testing at the beginning of test_kde.py


# With full bounds, we don't get any bias.
kde = KernelDensity(bandwidth="scott", bounds=[[0, 1], [0, 1]]).fit(X_train)
np.testing.assert_allclose(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here and for the following assertions.

Comment on lines +412 to +419
def _more_tags(self):
return {
"_xfail_checks": {
"check_sample_weights_invariance": (
"sample_weight must have positive values"
),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _more_tags(self):
return {
"_xfail_checks": {
"check_sample_weights_invariance": (
"sample_weight must have positive values"
),
}
}

This should fix the error in the azure pipeline test. You can also merge the main branch, because _more_tags has been removed since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.