Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 51b1852

Browse filesBrowse files
kno10jnothman
authored andcommitted
DOC Clarify eps parameter importance in DBSCAN (#13563)
1 parent 8042d74 commit 51b1852
Copy full SHA for 51b1852

File tree

2 files changed

+35
-5
lines changed
Filter options

2 files changed

+35
-5
lines changed

‎doc/modules/clustering.rst

Copy file name to clipboardExpand all lines: doc/modules/clustering.rst
+17-1Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -752,6 +752,18 @@ Any core sample is part of a cluster, by definition. Any sample that is not a
752752
core sample, and is at least ``eps`` in distance from any core sample, is
753753
considered an outlier by the algorithm.
754754

755+
While the parameter ``min_samples`` primarily controls how tolerant the
756+
algorithm is towards noise (on noisy and large data sets it may be desiable
757+
to increase this parameter), the parameter ``eps`` is *crucial to choose
758+
appropriately* for the data set and distance function and usually cannot be
759+
left at the default value. It controls the local neighborhood of the points.
760+
When chosen too small, most data will not be clustered at all (and labeled
761+
as ``-1`` for "noise"). When chosen too large, it causes close clusters to
762+
be merged into one cluster, and eventually the entire data set to be returned
763+
as a single cluster. Some heuristics for choosing this parameter have been
764+
discussed in literature, for example based on a knee in the nearest neighbor
765+
distances plot (as discussed in the references below).
766+
755767
In the figure below, the color indicates cluster membership, with large circles
756768
indicating core samples found by the algorithm. Smaller circles are non-core
757769
samples that are still part of a cluster. Moreover, the outliers are indicated
@@ -793,7 +805,7 @@ by black points below.
793805

794806
This implementation is by default not memory efficient because it constructs
795807
a full pairwise similarity matrix in the case where kd-trees or ball-trees cannot
796-
be used (e.g. with sparse matrices). This matrix will consume n^2 floats.
808+
be used (e.g., with sparse matrices). This matrix will consume n^2 floats.
797809
A couple of mechanisms for getting around this are:
798810

799811
- A sparse radius neighborhood graph (where missing entries are presumed to
@@ -814,6 +826,10 @@ by black points below.
814826
In Proceedings of the 2nd International Conference on Knowledge Discovery
815827
and Data Mining, Portland, OR, AAAI Press, pp. 226–231. 1996
816828

829+
* "DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.
830+
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
831+
In ACM Transactions on Database Systems (TODS), 42(3), 19.
832+
817833
.. _birch:
818834

819835
Birch

‎sklearn/cluster/dbscan_.py

Copy file name to clipboardExpand all lines: sklearn/cluster/dbscan_.py
+18-4Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,11 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
3535
``metric='precomputed'``.
3636
3737
eps : float, optional
38-
The maximum distance between two samples for them to be considered
39-
as in the same neighborhood.
38+
The maximum distance between two samples for one to be considered
39+
as in the neighborhood of the other. This is not a maximum bound
40+
on the distances of points within a cluster. This is the most
41+
important DBSCAN parameter to choose appropriately for your data set
42+
and distance function.
4043
4144
min_samples : int, optional
4245
The number of samples (or total weight) in a neighborhood for a point
@@ -128,6 +131,10 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
128131
Algorithm for Discovering Clusters in Large Spatial Databases with Noise".
129132
In: Proceedings of the 2nd International Conference on Knowledge Discovery
130133
and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996
134+
135+
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
136+
DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.
137+
ACM Transactions on Database Systems (TODS), 42(3), 19.
131138
"""
132139
if not eps > 0.0:
133140
raise ValueError("eps must be positive.")
@@ -195,8 +202,11 @@ class DBSCAN(BaseEstimator, ClusterMixin):
195202
Parameters
196203
----------
197204
eps : float, optional
198-
The maximum distance between two samples for them to be considered
199-
as in the same neighborhood.
205+
The maximum distance between two samples for one to be considered
206+
as in the neighborhood of the other. This is not a maximum bound
207+
on the distances of points within a cluster. This is the most
208+
important DBSCAN parameter to choose appropriately for your data set
209+
and distance function.
200210
201211
min_samples : int, optional
202212
The number of samples (or total weight) in a neighborhood for a point
@@ -300,6 +310,10 @@ class DBSCAN(BaseEstimator, ClusterMixin):
300310
Algorithm for Discovering Clusters in Large Spatial Databases with Noise".
301311
In: Proceedings of the 2nd International Conference on Knowledge Discovery
302312
and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996
313+
314+
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017).
315+
DBSCAN revisited, revisited: why and how you should (still) use DBSCAN.
316+
ACM Transactions on Database Systems (TODS), 42(3), 19.
303317
"""
304318

305319
def __init__(self, eps=0.5, min_samples=5, metric='euclidean',

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.