Add reflection to reduce bias near boundary for `KernelDensity` (fixes #27023). #29370

tillahoffmann · Jun 29, 2024

This PR introduces a new keyword argument bounds for KernelDensity to reduce bias on a bounded domain as discussed in #27023. Specifically, for density estimation, each point to be scored is reflected on the faces of the hypercube defined by bounds such that mass that may have "leaked out of the bounds" is accounted for. This method was developed in Boneva et al. (1971).

It may be beneficial to validate that the bounds satisfy lower_bound < upper_bound, but I wasn't able to figure out how to achieve that using _parameter_constraints. Input much appreciated.

github-actions · Jun 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6991853. Link to the linter CI: here}

…scikit-learn#27023).

adrinjalali · Aug 14, 2024

Thanks for the PR @tillahoffmann. Please ping us in a month if you still don't have a reviewer here. We should be able to allocate some resources then. In the meantime, you don't need to force push commits. We squash and merge at the end anyway. Force pushing makes it hard to see the changes over time.

tillahoffmann · Aug 14, 2024

Sounds like a plan; will get in touch later in the summer.

tillahoffmann · Sep 20, 2024

Hi @adrinjalali, thanks for the help with this PR. As discussed, I wanted to ping regarding a review.

adrinjalali · Sep 23, 2024

Maybe @jeremiedbb or @adam2392 would be able to have a look?

adrinjalali · Nov 25, 2024

Meybe @lorentzenchr could have a look here? Or @ogrisel ?

antoinebaker · Nov 27, 2024

Thanks @tillahoffmann for the PR!

Could you tell where is this procedure discussed or justified in Boneva et al, or elsewhere ? Especially I'm curious why we need to reflect over all faces (points, lines, up to facets) of the hypercube in the n-dimensional setting. It seems to me that Boneval et al only deal with the 1D case. More generally, do we have a mathematical justification for the reflection or is it just a hack ?

antoinebaker · Nov 27, 2024

sklearn/neighbors/_kde.py

@@ -187,6 +201,23 @@ def _choose_algorithm(self, algorithm, metric):
                )
            return algorithm

+    def _evaluate_hypercube_faces(self):


A docstring would be helpful to understand the shape and meaning of the faces array. In particular, the nan values seem to play an important role. If I understood correctly, in the unit cube, the point (1,0,1) is the face [1,0,1], the edge (0,0,0)-(0,0,1) is the face [0,0,nan], the plane (0,0,0)-(0,0,1)-(0,1,0) is the face [0, nan, nan] ?

Also I think it would be better to define it as an external function and not a method.

def _evaluate_hypercube_faces(bounds): # ... return faces

antoinebaker · Nov 28, 2024

sklearn/neighbors/_kde.py

+        indicate that a feature is not bounded. The computation cost grows exponentially
+        with :code:`n_features`.


The computation cost grows exponentially

That's true and a bit worrisome, maybe we should raise a UserWarning when n_features is above a threshold ? The number of faces grows as 3**n_features, and as we store the scores for each face, I am wondering if we could have out-of-memory errors.

Is there maybe a version of the algorithm that takes the bandwidth into account ? In my intuition only samples close to the boundary needs to be fixed/reflected. For compact support kernels (tophat, cosine, etc) that's clearly the case, only samples within bandwidth of the boundary would contribute. For the gaussian and exponential, I think the contribution is exponentially suppressed as me move from the boundary. I am wondering if we are not doing a lot of unnecessary computations in the n-dimensional setting.

antoinebaker · Nov 28, 2024

sklearn/neighbors/_kde.py

+            if np.any(upper <= lower):
+                raise ValueError(
+                    f"invalid bounds: upper ({upper}) is not larger than lower "
+                    f"({lower})"
+                )
+            if np.any(X < lower):
+                raise ValueError("samples must be larger than lower bound")
+            if np.any(X > upper):
+                raise ValueError("samples must be smaller than upper bound")


I would also raise a ValueError if the X does not have n_features=X.shape[1]=bounds.shape[0]

antoinebaker · Nov 28, 2024

sklearn/neighbors/tests/test_kde.py

+def test_kde_hypercube_faces(n_dims, expected_counts):
+    # Check that the number of faces of each kind match theory from
+    # https://en.wikipedia.org/wiki/Hypercube#Faces.
+    bounds = np.arange(2 * n_dims).reshape((n_dims, 2))
+    kde = KernelDensity(bounds=bounds)
+    faces = kde._evaluate_hypercube_faces()
+    n_nans = np.isnan(faces).sum(axis=1)
+    actual_counts = np.bincount(n_nans)
+    np.testing.assert_array_equal(actual_counts, expected_counts)


Comments could be helpful to explain the test in line with the docstring of _evaluate_hypercube_faces, eg n_nans is the dimension of the face/m-cube, and the count of m-cubes in the n-hypercube should match the theory $2^{n-m}\binom{n}{m}$

antoinebaker · Nov 28, 2024

sklearn/neighbors/tests/test_kde.py

+def test_kde_bounded_density():
+    n_samples, n_features = (100_000, 2)


n_samples=100_000 is quite big. How long does the test takes ? Could you have a test with lower n_samples=100 ?

antoinebaker · Nov 28, 2024

sklearn/neighbors/tests/test_kde.py

+    # Without any bounds, we expect a bias to 0.25 at the corners, 0.5 at the edges, and
+    # no bias in the center.
+    kde = KernelDensity(bandwidth="scott").fit(X_train)
+    np.testing.assert_allclose(


Suggested change

np.testing.assert_allclose(

assert_allclose(

Equivalent function imported from sklearn.utils._testing at the beginning of test_kde.py

antoinebaker · Nov 28, 2024

sklearn/neighbors/tests/test_kde.py

+
+    # With full bounds, we don't get any bias.
+    kde = KernelDensity(bandwidth="scott", bounds=[[0, 1], [0, 1]]).fit(X_train)
+    np.testing.assert_allclose(


Same here and for the following assertions.

antoinebaker · Nov 28, 2024

sklearn/neighbors/_kde.py

+    def _more_tags(self):
+        return {
+            "_xfail_checks": {
+                "check_sample_weights_invariance": (
+                    "sample_weight must have positive values"
+                ),
+            }
+        }


Suggested change

def _more_tags(self):

return {

"_xfail_checks": {

"check_sample_weights_invariance": (

"sample_weight must have positive values"

),

}

}

This should fix the error in the azure pipeline test. You can also merge the main branch, because _more_tags has been removed since.

github-actions bot added the module:neighbors label Jun 29, 2024

tillahoffmann force-pushed the kde-bounds branch 3 times, most recently from 4601ce2 to 93be818 Compare June 29, 2024 19:18

tillahoffmann force-pushed the kde-bounds branch from 56e147f to f2054af Compare July 24, 2024 16:52

tillahoffmann mentioned this pull request Jul 24, 2024

Kernel density estimation on (semi-)bounded domains #27023

Open

tillahoffmann added 3 commits August 13, 2024 11:25

Add reflection to reduce bias near boundary for KernelDensity (fixes …

e40419d

…scikit-learn#27023).

Improve test coverage of KernelDensity with bounds.

d038560

Add example for bounded density estimation.

a67908a

tillahoffmann force-pushed the kde-bounds branch from 628ef9b to a67908a Compare August 13, 2024 15:25

tillahoffmann added 2 commits November 24, 2024 22:57

Merge remote-tracking branch 'upstream/main' into kde-bounds

9854eb7

Add upcoming change file for scikit-learn#29370.

6991853

antoinebaker reviewed Nov 28, 2024

View reviewed changes

		indicate that a feature is not bounded. The computation cost grows exponentially
		with :code:`n_features`.

		def test_kde_bounded_density():
		n_samples, n_features = (100_000, 2)

Search code, repositories, users, issues, pull requests...

Uh oh!

Add reflection to reduce bias near boundary for KernelDensity (fixes #27023). #29370

Are you sure you want to change the base?

Add reflection to reduce bias near boundary for KernelDensity (fixes #27023). #29370

Uh oh!

Conversation

tillahoffmann commented Jun 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

adrinjalali commented Aug 14, 2024

Uh oh!

tillahoffmann commented Aug 14, 2024

Uh oh!

tillahoffmann commented Sep 20, 2024

Uh oh!

adrinjalali commented Sep 23, 2024

Uh oh!

adrinjalali commented Nov 25, 2024

Uh oh!

antoinebaker commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Add reflection to reduce bias near boundary for `KernelDensity` (fixes #27023). #29370

Add reflection to reduce bias near boundary for `KernelDensity` (fixes #27023). #29370

tillahoffmann commented Jun 29, 2024 •

edited

Loading

github-actions bot commented Jun 29, 2024 •

edited

Loading

antoinebaker commented Nov 27, 2024 •

edited

Loading