Warn if eps is not specificed for DBSCAN #13541

kno10 · Mar 28, 2019

The default eps=0.5 parameter makes little sense for most data sets. A good value depends on the data set and distance function. 0.5 may work for a particularly dimensionality of normalized data with Euclidean and a typical toy example data set size, but usually will be a bad choice. Even most of the examples from sklearn use different values. If you have lots of data points, a smaller eps may be necessary, for example.
Hence, the user has to experiment with this parameter, the default usually won't give good results. By making it required instead of optional, we can prevent this mistake.
The description of the eps parameter is misleading. This was recently discussed on SO: https://stackoverflow.com/a/55388827/1939754 ("DBSCAN eps incorrect behaviour")
As user Anonymousse stated, the old description is wrong (it's distance from some core point, not pairwise distances), but in particular it also is not a global diameter.

adrinjalali · Mar 28, 2019

I think the issue is that we expect all the estimators to be instantiable (or whatever this word is supposed to be :P) w/o any given parameters. But specially in this case, I agree that a default value makes little sense. Or it could be somehow a smart value taken from the data itself, i.e. an auto option.

agramfort · Mar 28, 2019

it would be great to have a better default with some "auto" mode indeed. All sklearn estimators require defaults for (hyper)parameters.

…

jnothman · Mar 29, 2019

Thanks. I prefer your description. We can't change the default like that as it breaks backwards compatibility. I'm okay with deprecating towards making it automatic, or maybe just smaller, by default. I don't know how Auto would work with sparse precomputed graph input.

agramfort · Mar 29, 2019

any pragmatic decision that leads to doc/docstring clarification works for me.

kno10 · Apr 2, 2019

Updated:
the patch now effectively retains the default value of 0.5 for backwards compatibility.

But if the user does not specify eps, a warning will be emitted. Choosing eps=0.5 is enough to quieten the warning, if that parameter actually works for your data set.

The description change is now a separate pull request, #13563

kno10 · Apr 2, 2019

On the long run, a auto mode would supposedly be a good idea.

I'd suggest to look into using the computed neighborhoods for this, as the DBSCAN authors suggested to look into a knee of the (minpts-1)-nearest-neighbor plots. A simple approach could be to use a quantile, such that 1/2 of the data set become core points. This should work for surprisingly many users and data sets. But the current code uses return_distance=False, and performance characteristics may change if distances are computed (the original DBSCAN did not use NearestNeighbors, only aforementioned heuristic does).

qinhanmin2014 · Apr 2, 2019

Maybe we should have a way to tag those parameters whose default value make little sense (e.g., eps in DBSCAN, n_clusters in KMeans). We can raise a self-defined warning when users use the default value of these parameters.

qinhanmin2014 · Apr 2, 2019

personally I like this PR (maybe we should define a warning, e.g., DefaultMeaninglessWarning?)

kno10 · Apr 3, 2019

I think DefaultMeaninglessWarning is a good idea, because this makes it easier to find all these locations later on, and survey for example how many algorithms don't have good defaults.
Because at some point the decision that "all parameters should have a default" could be reconsidered, and then one could more easily remove all the bad default parameters introduced just to satisfy this policy. Spectral clustering has such a parameter, n_clusters=8, that shouldn't have a default.

qinhanmin2014 · Apr 3, 2019

Because at some point the decision that "all parameters should have a default" could be reconsidered

I don't like it and I don't know how the decision was made.

I'll open an issue to discuss how to handle those parameters whose default value make little sense.

adrinjalali · Apr 4, 2019

@kno10 linter fails and there's a conflict. Otherwise LGTM!

qinhanmin2014 · Apr 4, 2019

@kno10 linter fails and there's a conflict. Otherwise LGTM!

We've already updated the doc. I think we need to make decision in #13570 before merging this PR.

The old default value of 0.5 is quite arbitrary. It may work well for 2D normalized toy examples, but for real data the user really should choose this parameter, not rely on the default to be "usually good" - it won't be good most of the time. Hence, warn the user if he does not specify the eps parameter.

kno10 · Apr 4, 2019

Rebased, and fixed the line lengths to make lint happy (hopefully).

kno10 · Apr 4, 2019

Rebased & reformatted to line length 79.

jnothman · Apr 4, 2019

So this now raises a UserWarning unless warn is passed. This is certainly a departure from history and I'd like to hear the thoughts of others. @GaelVaroquaux what do you think of this "there is no good default" warning?

cmarmo · Jul 10, 2022

I am closing this pull request as superseded by #14942.
The addition of this warning is still subject to some discussion. Please check #14942 for further details.

kno10 force-pushed the patch-7 branch from 6cf9ea3 to 5e8d116 Compare April 2, 2019 14:12

kno10 changed the title ~~Remove bad default eps of DBSCAN, clarify parameter description~~ Warn if eps is not specificed for DBSCAN Apr 2, 2019

qinhanmin2014 mentioned this pull request Apr 3, 2019

The default value of some parameters make little sense #13570

Open

kno10 force-pushed the patch-7 branch from 5e8d116 to 419b378 Compare April 4, 2019 14:57

amueller added the Needs Decision Requires decision label Aug 6, 2019

adrinjalali mentioned this pull request Sep 10, 2019

Warning the user of bad default values, starting by dbscan.eps #14942

Closed

github-actions bot added the module:cluster label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:51

jeremiedbb added the Superseded PR has been replace by a newer PR label Feb 11, 2022

cmarmo closed this Jul 10, 2022

kno10 mentioned this pull request Nov 26, 2024

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Open

kno10 mentioned this pull request Feb 13, 2025

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

Open

Search code, repositories, users, issues, pull requests...

Uh oh!

Warn if eps is not specificed for DBSCAN #13541

Warn if eps is not specificed for DBSCAN #13541

Uh oh!

Conversation

kno10 commented Mar 28, 2019

Uh oh!

adrinjalali commented Mar 28, 2019

Uh oh!

agramfort commented Mar 28, 2019 via email

Uh oh!

jnothman commented Mar 29, 2019 via email

Uh oh!

agramfort commented Mar 29, 2019 via email

Uh oh!

kno10 commented Apr 2, 2019

Uh oh!

kno10 commented Apr 2, 2019

Uh oh!

qinhanmin2014 commented Apr 2, 2019

Uh oh!

qinhanmin2014 commented Apr 2, 2019

Uh oh!

kno10 commented Apr 3, 2019

Uh oh!

qinhanmin2014 commented Apr 3, 2019

Uh oh!

adrinjalali commented Apr 4, 2019

Uh oh!

qinhanmin2014 commented Apr 4, 2019

Uh oh!

kno10 commented Apr 4, 2019

Uh oh!

kno10 commented Apr 4, 2019

Uh oh!

jnothman commented Apr 4, 2019

Uh oh!

cmarmo commented Jul 10, 2022

Uh oh!

Uh oh!