Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Warn if eps is not specificed for DBSCAN #13541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kno10
Copy link
Contributor

@kno10 kno10 commented Mar 28, 2019

  1. The default eps=0.5 parameter makes little sense for most data sets. A good value depends on the data set and distance function. 0.5 may work for a particularly dimensionality of normalized data with Euclidean and a typical toy example data set size, but usually will be a bad choice. Even most of the examples from sklearn use different values. If you have lots of data points, a smaller eps may be necessary, for example.
    Hence, the user has to experiment with this parameter, the default usually won't give good results. By making it required instead of optional, we can prevent this mistake.

  2. The description of the eps parameter is misleading. This was recently discussed on SO: https://stackoverflow.com/a/55388827/1939754 ("DBSCAN eps incorrect behaviour")
    As user Anonymousse stated, the old description is wrong (it's distance from some core point, not pairwise distances), but in particular it also is not a global diameter.

@adrinjalali
Copy link
Member

I think the issue is that we expect all the estimators to be instantiable (or whatever this word is supposed to be :P) w/o any given parameters. But specially in this case, I agree that a default value makes little sense. Or it could be somehow a smart value taken from the data itself, i.e. an auto option.

@agramfort
Copy link
Member

agramfort commented Mar 28, 2019 via email

@jnothman
Copy link
Member

jnothman commented Mar 29, 2019 via email

@agramfort
Copy link
Member

agramfort commented Mar 29, 2019 via email

@kno10 kno10 changed the title Remove bad default eps of DBSCAN, clarify parameter description Warn if eps is not specificed for DBSCAN Apr 2, 2019
@kno10
Copy link
Contributor Author

kno10 commented Apr 2, 2019

Updated:
the patch now effectively retains the default value of 0.5 for backwards compatibility.

But if the user does not specify eps, a warning will be emitted. Choosing eps=0.5 is enough to quieten the warning, if that parameter actually works for your data set.

The description change is now a separate pull request, #13563

@kno10
Copy link
Contributor Author

kno10 commented Apr 2, 2019

On the long run, a auto mode would supposedly be a good idea.

I'd suggest to look into using the computed neighborhoods for this, as the DBSCAN authors suggested to look into a knee of the (minpts-1)-nearest-neighbor plots. A simple approach could be to use a quantile, such that 1/2 of the data set become core points. This should work for surprisingly many users and data sets. But the current code uses return_distance=False, and performance characteristics may change if distances are computed (the original DBSCAN did not use NearestNeighbors, only aforementioned heuristic does).

@qinhanmin2014
Copy link
Member

Maybe we should have a way to tag those parameters whose default value make little sense (e.g., eps in DBSCAN, n_clusters in KMeans). We can raise a self-defined warning when users use the default value of these parameters.

@qinhanmin2014
Copy link
Member

personally I like this PR (maybe we should define a warning, e.g., DefaultMeaninglessWarning?)

@kno10
Copy link
Contributor Author

kno10 commented Apr 3, 2019

I think DefaultMeaninglessWarning is a good idea, because this makes it easier to find all these locations later on, and survey for example how many algorithms don't have good defaults.
Because at some point the decision that "all parameters should have a default" could be reconsidered, and then one could more easily remove all the bad default parameters introduced just to satisfy this policy. Spectral clustering has such a parameter, n_clusters=8, that shouldn't have a default.

@qinhanmin2014
Copy link
Member

Because at some point the decision that "all parameters should have a default" could be reconsidered

I don't like it and I don't know how the decision was made.

I'll open an issue to discuss how to handle those parameters whose default value make little sense.

@adrinjalali
Copy link
Member

@kno10 linter fails and there's a conflict. Otherwise LGTM!

@qinhanmin2014
Copy link
Member

@kno10 linter fails and there's a conflict. Otherwise LGTM!

We've already updated the doc. I think we need to make decision in #13570 before merging this PR.

The old default value of 0.5 is quite arbitrary.
It may work well for 2D normalized toy examples, but for real data the user really should choose this parameter, not rely on the default to be "usually good" - it won't be good most of the time.

Hence, warn the user if he does not specify the eps parameter.
@kno10
Copy link
Contributor Author

kno10 commented Apr 4, 2019

Rebased, and fixed the line lengths to make lint happy (hopefully).

@kno10
Copy link
Contributor Author

kno10 commented Apr 4, 2019

Rebased & reformatted to line length 79.

@jnothman
Copy link
Member

jnothman commented Apr 4, 2019

So this now raises a UserWarning unless warn is passed. This is certainly a departure from history and I'd like to hear the thoughts of others. @GaelVaroquaux what do you think of this "there is no good default" warning?

@amueller amueller added the Needs Decision Requires decision label Aug 6, 2019
Base automatically changed from master to main January 22, 2021 10:51
@jeremiedbb jeremiedbb added the Superseded PR has been replace by a newer PR label Feb 11, 2022
@cmarmo
Copy link
Contributor

cmarmo commented Jul 10, 2022

I am closing this pull request as superseded by #14942.
The addition of this warning is still subject to some discussion. Please check #14942 for further details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:cluster Needs Decision Requires decision Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.