@@ -428,17 +428,51 @@ enable only merging of neighboring pixels on an image, as in the
428
428
DBSCAN
429
429
======
430
430
431
- The :class: `DBSCAN ` algorithm clusters data by finding core points which have
432
- many neighbours within a given radius. After a core point is found, the cluster
433
- is expanded by adding its neighbours to the current cluster and recursively
434
- checking if any are core points. Formally, a point is considered a core point
435
- if it has more than min_points points which are of a similarity greater than
436
- the given threshold eps. This is shown in the figure below, where the color
437
- indicates cluster membership and large circles indicate core points found by
438
- the algorithm. Moreover, the algorithm can detect outliers, indicated by black
439
- points below. The outliers are defined as points which do not belong to
440
- any current cluster and do not have enough close neighbours to start a
441
- new cluster.
431
+ The :class: `DBSCAN ` algorithm views clusters as areas of high density
432
+ separated by areas of low density. Due to this rather generic view, clusters
433
+ found by DBSCAN can be any shape, as opposed to k-means which assumes that
434
+ clusters are convex shaped. The central component to the DBSCAN is the concept
435
+ of *core samples *, which are samples that are in areas of high density. A
436
+ cluster is therefore a set of core samples, each highly similar to each other
437
+ and a set of non-core samples that are similar to a core sample (but are not
438
+ themselves core samples). There are two parameters to the algorithm,
439
+ `min_points ` and `eps `, which define formally what we mean when we say *dense *.
440
+ A higher `min_points ` or lower `eps ` indicate higher density necessary to form
441
+ a cluster.
442
+
443
+ More formally, we define a core sample as being a sample in the dataset such
444
+ that there exists `min_samples ` other samples with a similarity higher than
445
+ `eps ` to it, which are defined as *neighbors * of the core sample. This tells
446
+ us that the core sample is in a dense area of the vector space. A cluster
447
+ is a set of core samples, that can be built by recursively by taking a core
448
+ sample, finding all of its neighbors that are core samples, finding all of
449
+ *their * neighbors that are core samples, and so on. A cluster also has a
450
+ set of non-core samples, which are samples that are neighbors of a core sample
451
+ in the cluster but are not themselves core samples. Intuitively, these samples
452
+ are on the fringes of a cluster.
453
+
454
+ Any core sample is part of a cluster, by definition. Further, any cluster must
455
+ have at least `min_samples ` points in it, following the definition of a core
456
+ sample. For any sample that is not a core sample, and does not have a
457
+ similarity higher than `eps ` to a core sample, it is considered an outlier by
458
+ the algorithm.
459
+
460
+ The algorithm is non-deterministic, however the core samples themselves will
461
+ always belong to the same clusters (although the labels themselves may be
462
+ different). The non-determinism comes from deciding on which cluster a
463
+ non-core sample belongs to. A non-core sample can be have a similarity higher
464
+ than `eps ` to two core samples in different classes. Following from the
465
+ triangular inequality, those two core samples would be less similar than
466
+ `eps ` from each other -- else they would be in the same class. The non-core
467
+ sample is simply assigned to which ever cluster is generated first, where
468
+ the order is determined randomly within the code. Other than *that *, the
469
+ algorithm is deterministic, making the results relatively stable between
470
+ iterations on the same data.
471
+
472
+ In the figure below, the color indicates cluster membership, with large circles
473
+ indicating core samples found by the algorithm. Smaller circles are non-core
474
+ samples that are still part of a cluster. Moreover, the outliers are indicated
475
+ by black points below.
442
476
443
477
.. |dbscan_results | image :: ../auto_examples/cluster/images/plot_dbscan_1.png
444
478
:target: ../auto_examples/cluster/plot_dbscan.html
0 commit comments