Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit f84e646

Browse filesBrowse files
committed
Merge remote-tracking branch 'upstream/master' into dbscan-doc-enh
2 parents 233c7dc + 4063d9e commit f84e646
Copy full SHA for f84e646

25 files changed

+728
-347
lines changed

‎doc/modules/classes.rst

Copy file name to clipboardExpand all lines: doc/modules/classes.rst
+1Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,7 @@ Samples generator
217217
decomposition.KernelPCA
218218
decomposition.FactorAnalysis
219219
decomposition.FastICA
220+
decomposition.TruncatedSVD
220221
decomposition.NMF
221222
decomposition.SparsePCA
222223
decomposition.MiniBatchSparsePCA

‎doc/modules/cross_validation.rst

Copy file name to clipboardExpand all lines: doc/modules/cross_validation.rst
+5-2Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -277,8 +277,11 @@ not waste much data as only one sample is removed from the learning set::
277277
Leave-P-Out - LPO
278278
-----------------
279279

280-
:class:`LeavePOut` is very similar to *Leave-One-Out*, as it creates all the
281-
possible training/test sets by removing :math:`P` samples from the complete set.
280+
:class:`LeavePOut` is very similar to :class:`LeaveOneOut` as it creates all
281+
the possible training/test sets by removing :math:`p` samples from the complete
282+
set. For :math:`n` samples, this produces :math:`{n \choose p}` train-test
283+
pairs. Unlike :class:`LeaveOneOut` and :class:`KFold`, the test sets will
284+
overlap for :math:`p > 1`.
282285

283286
Example of Leave-2-Out::
284287

‎doc/modules/decomposition.rst

Copy file name to clipboardExpand all lines: doc/modules/decomposition.rst
+80Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,86 @@ factorization, while larger values shrink many coefficients to zero.
232232
R. Jenatton, G. Obozinski, F. Bach, 2009
233233
234234
235+
.. _LSA:
236+
237+
Truncated singular value decomposition and latent semantic analysis
238+
===================================================================
239+
240+
:class:`TruncatedSVD` implements a variant of singular value decomposition
241+
(SVD) that only computes the :math:`k` largest singular values,
242+
where :math:`k` is a user-specified parameter.
243+
244+
When truncated SVD is applied to term-document matrices
245+
(as returned by ``CountVectorizer`` or ``TfidfVectorizer``),
246+
this transformation is known as
247+
`latent semantic analysis <http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
248+
(LSA), because it transforms such matrices
249+
to a "semantic" space of low dimensionality.
250+
In particular, LSA is known to combat the effects of synonymy and polysemy
251+
(both of which roughly mean there are multiple meanings per word),
252+
which cause term-document matrices to be overly sparse
253+
and exhibit poor similarity under measures such as cosine similarity.
254+
255+
.. note::
256+
LSA is also known as latent semantic indexing, LSI,
257+
though strictly that refers to its use in persistent indexes
258+
for information retrieval purposes.
259+
260+
Mathematically, truncated SVD applied to training samples :math:`X`
261+
produces a low-rank approximation :math:`X`:
262+
263+
.. math::
264+
X \approx X_k = U_k \Sigma_k V_k^\top
265+
266+
After this operation, :math:`U_k \Sigma_k^\top`
267+
is the transformed training set with :math:`k` features
268+
(called ``n_components`` in the API).
269+
270+
To also transform a test set :math:`X`, we multiply it with :math:`V_k`:
271+
272+
.. math::
273+
X' = X V_k^\top
274+
275+
.. note::
276+
Most treatments of LSA in the natural language processing (NLP)
277+
and information retrieval (IR) literature
278+
swap the axis of the matrix :math:`X` so that it has shape
279+
``n_features`` × ``n_samples``.
280+
We present LSA in a different way that matches the scikit-learn API better,
281+
but the singular values found are the same.
282+
283+
:class:`TruncatedSVD` is very similar to :class:`PCA`, but differs
284+
in that it works on sample matrices :math:`X` directly
285+
instead of their covariance matrices.
286+
When the columnwise (per-feature) means of :math:`X`
287+
are subtracted from the feature values,
288+
truncated SVD on the resulting matrix is equivalent to PCA.
289+
In practical terms, this means
290+
that the :class:`TruncatedSVD` transformer accepts ``scipy.sparse``
291+
matrices without the need to densify them,
292+
as densifying may fill up memory even for medium-sized document collections.
293+
294+
While the :class:`TruncatedSVD` transformer
295+
works with any (sparse) feature matrix,
296+
using it on tf–idf matrices is recommended over raw frequency counts
297+
in an LSA/document processing setting.
298+
In particular, sublinear scaling and inverse document frequency
299+
should be turned on (``sublinear_tf=True, use_idf=True``)
300+
to bring the feature values closer to a Gaussian distribution,
301+
compensating for LSA's erroneous assumptions about textual data.
302+
303+
.. topic:: Examples:
304+
305+
* :ref:`example_document_clustering.py`
306+
307+
.. topic:: References:
308+
309+
* Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008),
310+
*Introduction to Information Retrieval*, Cambridge University Press,
311+
chapter 18: `Matrix decompositions & latent semantic indexing
312+
<http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf>`_
313+
314+
235315
.. _DictionaryLearning:
236316

237317
Dictionary Learning

‎doc/whats_new.rst

Copy file name to clipboardExpand all lines: doc/whats_new.rst
+8Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,11 @@ Changelog
102102
- Refactored and vectorized implementation of :func:`metrics.roc_curve`
103103
and :func:`metrics.precision_recall_curve`. By `Joel Nothman`_.
104104

105+
- The new estimator :class:`sklearn.decomposition.TruncatedSVD`
106+
performs dimensionality reduction using SVD on sparse matrices,
107+
and can be used for latent semantic analysis (LSA).
108+
By `Lars Buitinck`_.
109+
105110

106111
API changes summary
107112
-------------------
@@ -121,6 +126,9 @@ API changes summary
121126
- ``gcv_mode="auto"`` no longer tries to perform SVD on a densified
122127
sparse matrix in :class:`sklearn.linear_model.RidgeCV`.
123128

129+
- Sparse matrix support in :class:`sklearn.decomposition.RandomizedPCA`
130+
is now deprecated in favor of the new ``TruncatedSVD``.
131+
124132

125133
.. _changes_0_13_1:
126134

‎examples/document_clustering.py

Copy file name to clipboardExpand all lines: examples/document_clustering.py
+23-3Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,12 @@
5353
from __future__ import print_function
5454

5555
from sklearn.datasets import fetch_20newsgroups
56+
from sklearn.decomposition import TruncatedSVD
5657
from sklearn.feature_extraction.text import TfidfVectorizer
5758
from sklearn.feature_extraction.text import HashingVectorizer
5859
from sklearn.feature_extraction.text import TfidfTransformer
5960
from sklearn.pipeline import Pipeline
61+
from sklearn.preprocessing import Normalizer
6062
from sklearn import metrics
6163

6264
from sklearn.cluster import KMeans, MiniBatchKMeans
@@ -75,6 +77,9 @@
7577

7678
# parse commandline arguments
7779
op = OptionParser()
80+
op.add_option("--lsa",
81+
dest="n_components", type="int",
82+
help="Preprocess documents with latent semantic analysis.")
7883
op.add_option("--no-minibatch",
7984
action="store_false", dest="minibatch", default=True,
8085
help="Use ordinary k-means algorithm (in batch mode).")
@@ -87,6 +92,9 @@
8792
op.add_option("--n-features", type=int, default=10000,
8893
help="Maximum number of features (dimensions)"
8994
"to extract from text.")
95+
op.add_option("--verbose",
96+
action="store_true", dest="verbose", default=False,
97+
help="Print progress reports inside k-means algorithm.")
9098

9199
print(__doc__)
92100
op.print_help()
@@ -147,17 +155,29 @@
147155
print("n_samples: %d, n_features: %d" % X.shape)
148156
print()
149157

158+
if opts.n_components:
159+
print("Performing dimensionality reduction using LSA")
160+
t0 = time()
161+
lsa = TruncatedSVD(opts.n_components)
162+
X = lsa.fit_transform(X)
163+
# Vectorizer results are normalized, which makes KMeans behave as
164+
# spherical k-means for better results. Since LSA/SVD results are
165+
# not normalized, we have to redo the normalization.
166+
X = Normalizer(copy=False).fit_transform(X)
167+
168+
print("done in %fs" % (time() - t0))
169+
print()
170+
150171

151172
###############################################################################
152173
# Do the actual clustering
153174

154175
if opts.minibatch:
155176
km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
156-
init_size=1000,
157-
batch_size=1000, verbose=1)
177+
init_size=1000, batch_size=1000, verbose=opts.verbose)
158178
else:
159179
km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
160-
verbose=1)
180+
verbose=opts.verbose)
161181

162182
print("Clustering sparse data with %s" % km)
163183
t0 = time()

‎sklearn/cluster/_feature_agglomeration.py

Copy file name to clipboardExpand all lines: sklearn/cluster/_feature_agglomeration.py
+3-11Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99

1010
from ..base import TransformerMixin
1111
from ..utils import array2d
12+
from ..utils.fixes import unique
1213

1314

1415
###############################################################################
@@ -60,14 +61,5 @@ def inverse_transform(self, Xred):
6061
A vector of size nb_samples with the values of Xred assigned to
6162
each of the cluster of samples.
6263
"""
63-
if np.size((Xred.shape)) == 1:
64-
X = np.zeros([self.labels_.shape[0]])
65-
else:
66-
X = np.zeros([Xred.shape[0], self.labels_.shape[0]])
67-
unil = np.unique(self.labels_)
68-
for i in range(len(unil)):
69-
if np.size((Xred.shape)) == 1:
70-
X[self.labels_ == unil[i]] = Xred[i]
71-
else:
72-
X[:, self.labels_ == unil[i]] = array2d(Xred[:, i]).T
73-
return X
64+
unil, inverse = unique(self.labels_, return_inverse=True)
65+
return Xred[..., inverse]

‎sklearn/cluster/_hierarchical.pyx

Copy file name to clipboardExpand all lines: sklearn/cluster/_hierarchical.pyx
+2-2Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,8 +47,8 @@ def _hc_get_descendent(int node, children, int n_leaves):
4747
n_leaves : int
4848
Number of leaves.
4949
50-
Return
51-
------
50+
Returns
51+
-------
5252
descendent : list of int
5353
"""
5454
ind = [node]

‎sklearn/cluster/tests/test_hierarchical.py

Copy file name to clipboardExpand all lines: sklearn/cluster/tests/test_hierarchical.py
+2Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
from sklearn.utils.testing import assert_true
1515
from sklearn.utils.testing import assert_raises
1616
from sklearn.utils.testing import assert_equal
17+
from sklearn.utils.testing import assert_array_almost_equal
1718

1819
from sklearn.cluster import Ward, WardAgglomeration, ward_tree
1920
from sklearn.cluster.hierarchical import _hc_cut
@@ -119,6 +120,7 @@ def test_ward_agglomeration():
119120
assert_true(Xred.shape[1] == 5)
120121
Xfull = ward.inverse_transform(Xred)
121122
assert_true(np.unique(Xfull[0]).size == 5)
123+
assert_array_almost_equal(ward.transform(Xfull), Xred)
122124

123125

124126
def assess_same_labelling(cut1, cut2):

‎sklearn/covariance/empirical_covariance_.py

Copy file name to clipboardExpand all lines: sklearn/covariance/empirical_covariance_.py
+4-4Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,8 @@
2323
def log_likelihood(emp_cov, precision):
2424
"""Computes the log_likelihood of the data
2525
26-
Params
27-
------
26+
Parameters
27+
----------
2828
emp_cov: 2D ndarray (n_features, n_features)
2929
Maximum Likelihood Estimator of covariance
3030
precision: 2D ndarray (n_features, n_features)
@@ -101,8 +101,8 @@ def _set_covariance(self, covariance):
101101
Storage is done accordingly to `self.store_precision`.
102102
Precision stored only if invertible.
103103
104-
Params
105-
------
104+
Parameters
105+
----------
106106
covariance: 2D ndarray, shape (n_features, n_features)
107107
Estimated covariance matrix to be stored, and from which precision
108108
is computed.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.