Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 233c7dc

Browse filesBrowse files
committed
enhanced (hopefully) DBScan documentation; killed some whitespace along the way...
1 parent c012d4f commit 233c7dc
Copy full SHA for 233c7dc

File tree

Expand file treeCollapse file tree

1 file changed

+28
-27
lines changed
Filter options
Expand file treeCollapse file tree

1 file changed

+28
-27
lines changed

‎doc/modules/clustering.rst

Copy file name to clipboardExpand all lines: doc/modules/clustering.rst
+28-27Lines changed: 28 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -55,20 +55,20 @@ Overview of clustering methods
5555
- number of clusters
5656
- Very large `n_samples`, medium `n_clusters` with
5757
:ref:`MiniBatch code <mini_batch_kmeans>`
58-
- General-purpose, even cluster size, flat geometry, not too many clusters
59-
- Distances between points
58+
- General-purpose, even cluster size, flat geometry, not too many clusters
59+
- Distances between points
6060

6161
* - :ref:`Affinity propagation <affinity_propagation>`
62-
- damping, sample preference
62+
- damping, sample preference
6363
- Not scalable with n_samples
6464
- Many clusters, uneven cluster size, non-flat geometry
6565
- Graph distance (e.g. nearest-neighbor graph)
6666

6767
* - :ref:`Mean-shift <mean_shift>`
68-
- bandwidth
68+
- bandwidth
6969
- Not scalable with n_samples
7070
- Many clusters, uneven cluster size, non-flat geometry
71-
- Distances between points
71+
- Distances between points
7272

7373
* - :ref:`Spectral clustering <spectral_clustering>`
7474
- number of clusters
@@ -80,13 +80,13 @@ Overview of clustering methods
8080
- number of clusters
8181
- Large `n_samples` and `n_clusters`
8282
- Many clusters, possibly connectivity constraints
83-
- Distances between points
83+
- Distances between points
8484

8585
* - :ref:`DBSCAN <dbscan>`
8686
- neighborhood size
8787
- Very large `n_samples`, medium `n_clusters`
8888
- Non-flat geometry, uneven cluster sizes
89-
- Distances between nearest points
89+
- Distances between nearest points
9090

9191
* - :ref:`Gaussian mixtures <mixture>`
9292
- many
@@ -116,7 +116,7 @@ be specified. It scales well to large number of samples and has been used
116116
across a large range of application areas in many different fields. It is
117117
also equivalent to the expectation-maximization algorithm when setting the
118118
covariance matrix to be diagonal, equal and small. The K-means algorithm
119-
aims to choose centroids :math:`C` that minimise the within cluster sum of
119+
aims to choose centroids :math:`C` that minimise the within cluster sum of
120120
squares objective function with a dataset :math:`X` with :math:`n` samples:
121121

122122
.. math:: J(X, C) = \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_j - \mu_i||^2)
@@ -156,8 +156,8 @@ centroids to be (generally) distant from each other, leading to provably better
156156
results than random initialisation.
157157

158158
A parameter can be given to allow K-means to be run in parallel, called
159-
`n_jobs`. Giving this parameter a positive value uses that many processors
160-
(default=1). A value of -1 uses all processors, with -2 using one less, and so
159+
`n_jobs`. Giving this parameter a positive value uses that many processors
160+
(default=1). A value of -1 uses all processors, with -2 using one less, and so
161161
on. Parallelization generally speeds up computation at the cost of memory (in
162162
this case, multiple copies of centroids need to be stored, one for each job).
163163

@@ -500,16 +500,17 @@ separated by areas of low density. Due to this rather generic view, clusters
500500
found by DBSCAN can be any shape, as opposed to k-means which assumes that
501501
clusters are convex shaped. The central component to the DBSCAN is the concept
502502
of *core samples*, which are samples that are in areas of high density. A
503-
cluster is therefore a set of core samples, each highly similar to each other
504-
and a set of non-core samples that are similar to a core sample (but are not
503+
cluster is therefore a set of core samples, each close to each other
504+
(measured by some distance measure)
505+
and a set of non-core samples that are close to a core sample (but are not
505506
themselves core samples). There are two parameters to the algorithm,
506-
`min_points` and `eps`, which define formally what we mean when we say *dense*.
507-
A higher `min_points` or lower `eps` indicate higher density necessary to form
507+
`min_samples` and `eps`, which define formally what we mean when we say *dense*.
508+
A higher `min_samples` or lower `eps` indicate higher density necessary to form
508509
a cluster.
509510

510511
More formally, we define a core sample as being a sample in the dataset such
511-
that there exists `min_samples` other samples with a similarity higher than
512-
`eps` to it, which are defined as *neighbors* of the core sample. This tells
512+
that there exist `min_samples` other samples within a distance of
513+
`eps`, which are defined as *neighbors* of the core sample. This tells
513514
us that the core sample is in a dense area of the vector space. A cluster
514515
is a set of core samples, that can be built by recursively by taking a core
515516
sample, finding all of its neighbors that are core samples, finding all of
@@ -520,24 +521,24 @@ are on the fringes of a cluster.
520521

521522
Any core sample is part of a cluster, by definition. Further, any cluster has
522523
at least `min_samples` points in it, following the definition of a core
523-
sample. For any sample that is not a core sample, and does not have a
524-
similarity higher than `eps` to a core sample, it is considered an outlier by
524+
sample. For any sample that is not a core sample, and does have a
525+
distance higher than `eps` to any core sample, it is considered an outlier by
525526
the algorithm.
526527

527528
The algorithm is non-deterministic, however the core samples themselves will
528529
always belong to the same clusters (although the labels themselves may be
529530
different). The non-determinism comes from deciding on which cluster a
530-
non-core sample belongs to. A non-core sample can be have a similarity higher
531+
non-core sample belongs to. A non-core sample can have a distance lower
531532
than `eps` to two core samples in different classes. Following from the
532-
triangular inequality, those two core samples would be less similar than
533+
triangular inequality, those two core samples would be more distant than
533534
`eps` from each other -- else they would be in the same class. The non-core
534535
sample is simply assigned to which ever cluster is generated first, where
535536
the order is determined randomly within the code. Other than the ordering of,
536537
the dataset, the algorithm is deterministic, making the results relatively
537538
stable between iterations on the same data.
538539

539540
In the figure below, the color indicates cluster membership, with large circles
540-
indicating core samples found by the algorithm. Smaller circles are non-core
541+
indicating core samples found by the algorithm. Smaller circles are non-core
541542
samples that are still part of a cluster. Moreover, the outliers are indicated
542543
by black points below.
543544

@@ -819,7 +820,7 @@ Drawbacks
819820

820821
* :ref:`example_cluster_plot_adjusted_for_chance_measures.py`: Analysis of
821822
the impact of the dataset size on the value of clustering measures
822-
for random assignments. This example also includes the Adjusted Rand
823+
for random assignments. This example also includes the Adjusted Rand
823824
Index.
824825

825826

@@ -864,7 +865,7 @@ following equation, from Vinh, Epps, and Bailey, (2009). In this equation,
864865
\frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!
865866
(N-a_i-b_j+n_{ij})!}
866867

867-
Using the expected value, the adjusted mutual information can then be
868+
Using the expected value, the adjusted mutual information can then be
868869
calculated using a similar form to that of the adjusted Rand index:
869870

870871
.. math:: \text{AMI} = \frac{\text{MI} - E[\text{MI}]}{\max(H(U), H(V)) - E[\text{MI}]}
@@ -875,7 +876,7 @@ calculated using a similar form to that of the adjusted Rand index:
875876
knowledge reuse framework for combining multiple partitions". Journal of
876877
Machine Learning Research 3: 583–617. doi:10.1162/153244303321897735
877878

878-
* Vinh, Epps, and Bailey, (2009). "Information theoretic measures
879+
* Vinh, Epps, and Bailey, (2009). "Information theoretic measures
879880
for clusterings comparison". Proceedings of the 26th Annual International
880881
Conference on Machine Learning - ICML '09.
881882
doi:10.1145/1553374.1553511. ISBN 9781605585161.
@@ -1045,7 +1046,7 @@ mean of homogeneity and completeness**:
10451046
10461047
.. [B2011] `Identication and Characterization of Events in Social Media
10471048
<http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf>`_, Hila
1048-
Becker, PhD Thesis.
1049+
Becker, PhD Thesis.
10491050
10501051
.. _silhouette_coefficient:
10511052

@@ -1073,7 +1074,7 @@ The Silhoeutte Coefficient *s* for a single sample is then given as:
10731074

10741075
.. math:: s = \frac{b - a}{max(a, b)}
10751076

1076-
The Silhouette Coefficient for a set of samples is given as the mean of the
1077+
The Silhouette Coefficient for a set of samples is given as the mean of the
10771078
Silhouette Coefficient for each sample.
10781079

10791080

@@ -1091,7 +1092,7 @@ cluster analysis.
10911092
>>> from sklearn.cluster import KMeans
10921093
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
10931094
>>> labels = kmeans_model.labels_
1094-
>>> metrics.silhouette_score(X, labels, metric='euclidean')
1095+
>>> metrics.silhouette_score(X, labels, metric='euclidean')
10951096
... # doctest: +ELLIPSIS
10961097
0.55...
10971098

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.