-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Warn if eps is not specificed for DBSCAN #13541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think the issue is that we expect all the estimators to be instantiable (or whatever this word is supposed to be :P) w/o any given parameters. But specially in this case, I agree that a default value makes little sense. Or it could be somehow a smart value taken from the data itself, i.e. an |
it would be great to have a better default with some "auto" mode indeed.
All sklearn estimators require defaults for (hyper)parameters.
… |
Thanks. I prefer your description. We can't change the default like that as
it breaks backwards compatibility. I'm okay with deprecating towards making
it automatic, or maybe just smaller, by default. I don't know how Auto
would work with sparse precomputed graph input.
|
any pragmatic decision that leads to doc/docstring clarification works for
me.
|
Updated: But if the user does not specify The description change is now a separate pull request, #13563 |
On the long run, a I'd suggest to look into using the computed |
Maybe we should have a way to tag those parameters whose default value make little sense (e.g., eps in DBSCAN, n_clusters in KMeans). We can raise a self-defined warning when users use the default value of these parameters. |
personally I like this PR (maybe we should define a warning, e.g., DefaultMeaninglessWarning?) |
I think |
I don't like it and I don't know how the decision was made. I'll open an issue to discuss how to handle those parameters whose default value make little sense. |
@kno10 linter fails and there's a conflict. Otherwise LGTM! |
The old default value of 0.5 is quite arbitrary. It may work well for 2D normalized toy examples, but for real data the user really should choose this parameter, not rely on the default to be "usually good" - it won't be good most of the time. Hence, warn the user if he does not specify the eps parameter.
Rebased, and fixed the line lengths to make lint happy (hopefully). |
Rebased & reformatted to line length 79. |
So this now raises a UserWarning unless warn is passed. This is certainly a departure from history and I'd like to hear the thoughts of others. @GaelVaroquaux what do you think of this "there is no good default" warning? |
The default
eps=0.5
parameter makes little sense for most data sets. A good value depends on the data set and distance function. 0.5 may work for a particularly dimensionality of normalized data with Euclidean and a typical toy example data set size, but usually will be a bad choice. Even most of the examples from sklearn use different values. If you have lots of data points, a smaller eps may be necessary, for example.Hence, the user has to experiment with this parameter, the default usually won't give good results. By making it required instead of optional, we can prevent this mistake.
The description of the
eps
parameter is misleading. This was recently discussed on SO: https://stackoverflow.com/a/55388827/1939754 ("DBSCAN eps incorrect behaviour")As user Anonymousse stated, the old description is wrong (it's distance from some core point, not pairwise distances), but in particular it also is not a global diameter.