Description
Describe the bug
I was using kernel density estimator
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html
using 'silverman' or 'scott' as the bandwidth argument. Then I found that the bandwidth automatically adjusted by the algorithm is independent of the actual scale of the dataset. In fact, I was shocked to find that the calculation of a bandwidth in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/neighbors/_kde.py for 'silverman' and 'scott' does not check the scales of data at all.
Suppose I fit the model kde
to some 2D data X
and get the bandwidth as kde.bandwidth_
.
Next, I fit the model kde
to the same 2D data X
but with all elements multiplied by, say, 20 and get the bandwidth as kde.bandwidth_
.
I found that these two values of kde.bandwidth_
are equal (it is calculated from the shape of X
, see the source code). But obviously they should differ by a factor of 20 if the bandwidth is really computed in a truly adaptive manner.
For your reference, I want to mention that scipy's KDE https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html calculates the covariance of data to extract the scale of data. I think this is the right thing to do.
Note that if the bandwidth is incorrect, everything else is incorrect too, including probablities of samples, etc.
Steps/Code to Reproduce
import numpy as np
from sklearn.neighbors import KernelDensity
X = np.random.randn(1000, 2)
kde = KernelDensity(bandwidth='scott')
kde.fit(X)
print(kde.bandwidth_)
kde.fit(X * 20)
print(kde.bandwidth_)
Expected Results
Different bandwidths for data sets with different scales.
Actual Results
0.31622776601683794
0.31622776601683794
Versions
1.2.1