Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 889966d

Browse filesBrowse files
x006amueller
authored andcommitted
Updates for DBSCAN clsutering docs
Please enter the commit message for your changes. Lines starting
1 parent 38868a6 commit 889966d
Copy full SHA for 889966d

File tree

Expand file treeCollapse file tree

1 file changed

+45
-11
lines changed
Filter options
Expand file treeCollapse file tree

1 file changed

+45
-11
lines changed

‎doc/modules/clustering.rst

Copy file name to clipboardExpand all lines: doc/modules/clustering.rst
+45-11Lines changed: 45 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -428,17 +428,51 @@ enable only merging of neighboring pixels on an image, as in the
428428
DBSCAN
429429
======
430430

431-
The :class:`DBSCAN` algorithm clusters data by finding core points which have
432-
many neighbours within a given radius. After a core point is found, the cluster
433-
is expanded by adding its neighbours to the current cluster and recursively
434-
checking if any are core points. Formally, a point is considered a core point
435-
if it has more than min_points points which are of a similarity greater than
436-
the given threshold eps. This is shown in the figure below, where the color
437-
indicates cluster membership and large circles indicate core points found by
438-
the algorithm. Moreover, the algorithm can detect outliers, indicated by black
439-
points below. The outliers are defined as points which do not belong to
440-
any current cluster and do not have enough close neighbours to start a
441-
new cluster.
431+
The :class:`DBSCAN` algorithm views clusters as areas of high density
432+
separated by areas of low density. Due to this rather generic view, clusters
433+
found by DBSCAN can be any shape, as opposed to k-means which assumes that
434+
clusters are convex shaped. The central component to the DBSCAN is the concept
435+
of *core samples*, which are samples that are in areas of high density. A
436+
cluster is therefore a set of core samples, each highly similar to each other
437+
and a set of non-core samples that are similar to a core sample (but are not
438+
themselves core samples). There are two parameters to the algorithm,
439+
`min_points` and `eps`, which define formally what we mean when we say *dense*.
440+
A higher `min_points` or lower `eps` indicate higher density necessary to form
441+
a cluster.
442+
443+
More formally, we define a core sample as being a sample in the dataset such
444+
that there exists `min_samples` other samples with a similarity higher than
445+
`eps` to it, which are defined as *neighbors* of the core sample. This tells
446+
us that the core sample is in a dense area of the vector space. A cluster
447+
is a set of core samples, that can be built by recursively by taking a core
448+
sample, finding all of its neighbors that are core samples, finding all of
449+
*their* neighbors that are core samples, and so on. A cluster also has a
450+
set of non-core samples, which are samples that are neighbors of a core sample
451+
in the cluster but are not themselves core samples. Intuitively, these samples
452+
are on the fringes of a cluster.
453+
454+
Any core sample is part of a cluster, by definition. Further, any cluster must
455+
have at least `min_samples` points in it, following the definition of a core
456+
sample. For any sample that is not a core sample, and does not have a
457+
similarity higher than `eps` to a core sample, it is considered an outlier by
458+
the algorithm.
459+
460+
The algorithm is non-deterministic, however the core samples themselves will
461+
always belong to the same clusters (although the labels themselves may be
462+
different). The non-determinism comes from deciding on which cluster a
463+
non-core sample belongs to. A non-core sample can be have a similarity higher
464+
than `eps` to two core samples in different classes. Following from the
465+
triangular inequality, those two core samples would be less similar than
466+
`eps` from each other -- else they would be in the same class. The non-core
467+
sample is simply assigned to which ever cluster is generated first, where
468+
the order is determined randomly within the code. Other than *that*, the
469+
algorithm is deterministic, making the results relatively stable between
470+
iterations on the same data.
471+
472+
In the figure below, the color indicates cluster membership, with large circles
473+
indicating core samples found by the algorithm. Smaller circles are non-core
474+
samples that are still part of a cluster. Moreover, the outliers are indicated
475+
by black points below.
442476

443477
.. |dbscan_results| image:: ../auto_examples/cluster/images/plot_dbscan_1.png
444478
:target: ../auto_examples/cluster/plot_dbscan.html

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.