seckcoder
diff --git a/‎doc/modules/clustering.rst
Copy file name to clipboardExpand all lines: doc/modules/clustering.rst
+28-27Lines changed: 28 additions & 27 deletions b/‎doc/modules/clustering.rst
Copy file name to clipboardExpand all lines: doc/modules/clustering.rst
+28-27Lines changed: 28 additions & 27 deletions
@@ -55,20 +55,20 @@ Overview of clustering methods
      - number of clusters
      - Very large `n_samples`, medium `n_clusters` with
        :ref:`MiniBatch code <mini_batch_kmeans>`
-     - General-purpose, even cluster size, flat geometry, not too many clusters 
-     - Distances between points 
+     - General-purpose, even cluster size, flat geometry, not too many clusters
+     - Distances between points
 
    * - :ref:`Affinity propagation <affinity_propagation>`
-     - damping, sample preference 
+     - damping, sample preference
      - Not scalable with n_samples
      - Many clusters, uneven cluster size, non-flat geometry
      - Graph distance (e.g. nearest-neighbor graph)
 
    * - :ref:`Mean-shift <mean_shift>`
-     - bandwidth 
+     - bandwidth
      - Not scalable with n_samples
      - Many clusters, uneven cluster size, non-flat geometry
-     - Distances between points 
+     - Distances between points
 
    * - :ref:`Spectral clustering <spectral_clustering>`
      - number of clusters
@@ -80,13 +80,13 @@ Overview of clustering methods
      - number of clusters
      - Large `n_samples` and `n_clusters`
      - Many clusters, possibly connectivity constraints
-     - Distances between points 
+     - Distances between points
 
    * - :ref:`DBSCAN <dbscan>`
      - neighborhood size
      - Very large `n_samples`, medium `n_clusters`
      - Non-flat geometry, uneven cluster sizes
-     - Distances between nearest points 
+     - Distances between nearest points
 
    * - :ref:`Gaussian mixtures <mixture>`
      - many
@@ -116,7 +116,7 @@ be specified. It scales well to large number of samples and has been used
 across a large range of application areas in many different fields. It is
 also equivalent to the expectation-maximization algorithm when setting the
 covariance matrix to be diagonal, equal and small. The K-means algorithm
-aims to choose centroids :math:`C` that minimise the within cluster sum of 
+aims to choose centroids :math:`C` that minimise the within cluster sum of
 squares objective function with a dataset :math:`X` with :math:`n` samples:
 
 .. math:: J(X, C) = \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_j - \mu_i||^2)
@@ -156,8 +156,8 @@ centroids to be (generally) distant from each other, leading to provably better
 results than random initialisation.
 
 A parameter can be given to allow K-means to be run in parallel, called
-`n_jobs`. Giving this parameter a positive value uses that many processors 
-(default=1). A value of -1 uses all processors, with -2 using one less, and so 
+`n_jobs`. Giving this parameter a positive value uses that many processors
+(default=1). A value of -1 uses all processors, with -2 using one less, and so
 on. Parallelization generally speeds up computation at the cost of memory (in
 this case, multiple copies of centroids need to be stored, one for each job).
 
@@ -500,16 +500,17 @@ separated by areas of low density. Due to this rather generic view, clusters
 found by DBSCAN can be any shape, as opposed to k-means which assumes that
 clusters are convex shaped. The central component to the DBSCAN is the concept
 of *core samples*, which are samples that are in areas of high density. A
-cluster is therefore a set of core samples, each highly similar to each other
-and a set of non-core samples that are similar to a core sample (but are not
+cluster is therefore a set of core samples, each close to each other
+(measured by some distance measure)
+and a set of non-core samples that are close to a core sample (but are not
 themselves core samples). There are two parameters to the algorithm,
-`min_points` and `eps`, which define formally what we mean when we say *dense*.
-A higher `min_points` or lower `eps` indicate higher density necessary to form
+`min_samples` and `eps`, which define formally what we mean when we say *dense*.
+A higher `min_samples` or lower `eps` indicate higher density necessary to form
 a cluster.
 
 More formally, we define a core sample as being a sample in the dataset such
-that there exists `min_samples` other samples with a similarity higher than
-`eps` to it, which are defined as *neighbors* of the core sample. This tells
+that there exist `min_samples` other samples within a distance of
+`eps`, which are defined as *neighbors* of the core sample. This tells
 us that the core sample is in a dense area of the vector space. A cluster
 is a set of core samples, that can be built by recursively by taking a core
 sample, finding all of its neighbors that are core samples, finding all of
@@ -520,24 +521,24 @@ are on the fringes of a cluster.
 
 Any core sample is part of a cluster, by definition. Further, any cluster has
 at least `min_samples` points in it, following the definition of a core
-sample. For any sample that is not a core sample, and does not have a
-similarity higher than `eps` to a core sample, it is considered an outlier by
+sample. For any sample that is not a core sample, and does have a
+distance higher than `eps` to any core sample, it is considered an outlier by
 the algorithm.
 
 The algorithm is non-deterministic, however the core samples themselves will
 always belong to the same clusters (although the labels themselves may be
 different). The non-determinism comes from deciding on which cluster a
-non-core sample belongs to. A non-core sample can be have a similarity higher
+non-core sample belongs to. A non-core sample can have a distance lower
 than `eps` to two core samples in different classes. Following from the
-triangular inequality, those two core samples would be less similar than
+triangular inequality, those two core samples would be more distant than
 `eps` from each other -- else they would be in the same class. The non-core
 sample is simply assigned to which ever cluster is generated first, where
 the order is determined randomly within the code. Other than the ordering of,
 the dataset, the algorithm is deterministic, making the results relatively
 stable between iterations on the same data.
 
 In the figure below, the color indicates cluster membership, with large circles
-indicating core samples found by the algorithm. Smaller circles are non-core 
+indicating core samples found by the algorithm. Smaller circles are non-core
 samples that are still part of a cluster. Moreover, the outliers are indicated
 by black points below.
 
@@ -819,7 +820,7 @@ Drawbacks
 
  * :ref:`example_cluster_plot_adjusted_for_chance_measures.py`: Analysis of
    the impact of the dataset size on the value of clustering measures
-   for random assignments. This example also includes the Adjusted Rand 
+   for random assignments. This example also includes the Adjusted Rand
    Index.
 
 
@@ -864,7 +865,7 @@ following equation, from Vinh, Epps, and Bailey, (2009). In this equation,
    \frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!
    (N-a_i-b_j+n_{ij})!}
 
-Using the expected value, the adjusted mutual information can then be 
+Using the expected value, the adjusted mutual information can then be
 calculated using a similar form to that of the adjusted Rand index:
 
 .. math:: \text{AMI} = \frac{\text{MI} - E[\text{MI}]}{\max(H(U), H(V)) - E[\text{MI}]}
@@ -875,7 +876,7 @@ calculated using a similar form to that of the adjusted Rand index:
    knowledge reuse framework for combining multiple partitions". Journal of
    Machine Learning Research 3: 583–617. doi:10.1162/153244303321897735
 
- * Vinh, Epps, and Bailey, (2009). "Information theoretic measures 
+ * Vinh, Epps, and Bailey, (2009). "Information theoretic measures
    for clusterings comparison". Proceedings of the 26th Annual International
    Conference on Machine Learning - ICML '09.
    doi:10.1145/1553374.1553511. ISBN 9781605585161.
@@ -1045,7 +1046,7 @@ mean of homogeneity and completeness**:
 
  .. [B2011] `Identication and Characterization of Events in Social Media
    <http://www.cs.columbia.edu/~hila/hila-thesis-distributed.pdf>`_, Hila
-   Becker, PhD Thesis. 
+   Becker, PhD Thesis.
 
 .. _silhouette_coefficient:
 
@@ -1073,7 +1074,7 @@ The Silhoeutte Coefficient *s* for a single sample is then given as:
 
 .. math:: s = \frac{b - a}{max(a, b)}
 
-The Silhouette Coefficient for a set of samples is given as the mean of the 
+The Silhouette Coefficient for a set of samples is given as the mean of the
 Silhouette Coefficient for each sample.
 
 
@@ -1091,7 +1092,7 @@ cluster analysis.
   >>> from sklearn.cluster import KMeans
   >>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
   >>> labels = kmeans_model.labels_
-  >>> metrics.silhouette_score(X, labels, metric='euclidean')  
+  >>> metrics.silhouette_score(X, labels, metric='euclidean')
   ...                                                      # doctest: +ELLIPSIS
   0.55...