ENH Add inverse_transform to random projection transformers #21701

ageron · Nov 17, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds a fit_inverse_transform parameter to all transformers in the sklearn.random_projection module: GaussianRandomProjection and SparseRandomProjection. When set to True, the pseudo-inverse of the components is computed during fit() and stored in components_pinv_, and inverse_transform() becomes available.

Any other comments?

Using the pseudo-inverse makes sense to me, and it seems to work fine, in the sense that rnd_proj.transform(rnd_proj.inverse_transform(X)) equals X. However, this implementation uses scipy.linalg.pinv(), which scales very poorly to large datasets, which is a major use case for random projections. Perhaps it would make sense to use another approach if we detect that the components_ array is large?

And perhaps there's a mathematical way to generate both a random matrix and its inverse more efficiently?

For the SparseRandomProjection transformer, computing the pseudo-inverse breaks sparsity. Perhaps there's a way to generate a sparse matrix that is "close enough" rather than using the pseudo-inverse?

In short: if there's a Random Projection expert in the room, please speak up!

That said, it seems to work fine now, so performance improvements could be pushed in follow-up PRs.

ageron · Nov 17, 2021

The test failure in sklearn/tests/test_calibration.py looks unrelated to this PR.

ageron · Nov 18, 2021

Acccording to this SO answer:

numpy.linalg.pinv() approximates the Moore-Penrose pseudo inverse using an SVD (the LAPACK method dgesdd to be precise), whereas scipy.linalg.pinv() solves a model linear system in the least squares sense to approximate the pseudo inverse (using dgelss).

They perform differently, both in terms of speed and memory usage, depending on the size of the input. So perhaps a follow-up PR should give the user the option to select the algorithm to use?

adrinjalali · Nov 18, 2021

This looks pretty good an clean, thanks @ageron . The only thing I'd add is a section to the relevant user guide (the relevant .rst) to explain the inverse transform, and how it's calculated, and how it's enabled.

ogrisel · Nov 18, 2021

numpy.linalg.pinv() approximates the Moore-Penrose pseudo inverse using an SVD (the LAPACK method dgesdd to be precise), whereas scipy.linalg.pinv() solves a model linear system in the least squares sense to approximate the pseudo inverse (using dgelss).

Are you sure this is still valid with recent version of scipy? https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html

It's not possible to select the LAPACK driver though.

glemaitre · Nov 18, 2021

sklearn/random_projection.py

+            components = self.components_
+            if sp.issparse(components):
+                components = components.toarray()
+            self.components_pinv_ = linalg.pinv(components, check_finite=False)


Is it useful to expose publicly the pseudo-inverse or can we make it kinda of private (i.e. self._components_pinv

You mentioned that pinv does not scale with large matrices. I have 2 questions:

Should we use pinv or pinv2 depending on the shape of the X?

Is there a way to approximate the inverse? I am recalling the following PR where we wanted to do so with an approximation in Nystroem (I did not look at the PR thought). I don't know if we could have a trick here to have such an approximation here?

I think it's useful to expose publicly. Maybe we could change the name to inverse_components_ though to make the name less dependent on an implementation detail.

Should we use pinv or pinv2 depending on the shape of the X?

pinv2 is deprecated.

sklearn/random_projection.py

sklearn/tests/test_random_projection.py

ogrisel

Thanks for the PR, it's an interesting contrib. However I have the following concerns + some suggestions:

sklearn/random_projection.py

sklearn/tests/test_random_projection.py

…o inverse_components_, and let inverse_transform() accept sparse arrays

ageron · Nov 19, 2021

Thanks @adrinjalali , @glemaitre , and @ogrisel for taking such a thorough look at this PR, and for the interesting feedback.

@ogrisel , I tested both np.linalg.pinv() and scipy.linalg.pinv() on an array of shape [5_000, 20_000], and SciPy took 25.5 minutes while NumPy took only 11.6 minutes (on Colab). So NumPy's implementation was twice as fast! SciPy uses its decomp_svd() function, while NumPy uses its svd() function. I haven't looked further yet, so I'm still not sure why NumPy's implementation is twice as fast as SciPy's.

I'll work on the pinv() implementation for sparse arrays soon, but in the meantime I made the rest of the changes you suggested, including:

Added a section in the User Guide in random_projection.rst
Renamed components_pinv_ to inverse_components_. I didn't like the name either!
Added a note to say that even if X is sparse, X_original is dense, which may use a lot of RAM.
Fixed: X : {array-like, sparse matrix} of shape (n_samples, n_components)
fit_transform() now accepts both dense and sparse arrays
Added testing for both dense and sparse inputs in test_random_projection.py
Removed the ######### header in test_random_projection.py
Added the test that random_projection.components_ @ random_projection.inverse_components_ is the identity matrix

ageron · Nov 21, 2021

Ok, SparseRandomProjection.fit() now computes the pseudo-inverse (when fit_inverse_transform is set to True) without converting the components_ to a dense array.

The pseudo-inverse implementation is based on scipy.sparse.linalg.svds(), as you suggested @ogrisel . I had to call it twice, though: once for the first half of the components, and once for the second half. That's because scipy.sparse.linalg.svds() does not support getting all the components at once.

The rest of the pseudo-inverse implementation was taken in large part from scipy.linalg.pinv(), but I simplified it a lot, removing every parameter except the sparse matrix. I just wanted to keep things as basic as possible, since this is meant to be a temporary workaround until SciPy's pinv() supports sparse matrices.

ogrisel

LGTM. Just a few minor improvements:

sklearn/random_projection.py

ageron · Nov 24, 2021

Done

ageron · Nov 25, 2021

Yikes, something's weird: I'm getting no errors in my tests locally, but it's failing in CI. I'll have to investigate this. I'm thinking it might be an error that happens occasionally, if the first k//2 components returned by the first call to scipy.sparse.linalg.svds() are not compatible with the k - k//2 components returned by the second call. If that's the case, I don't really see a way to fix that. Any ideas? If there's no solution, I'll just remove inverse_transform() for the sparse case. All this hard work for nothing! 😭

Edit: I've run some tests and I can confirm that my implementation of svd() based on calling svds() twice only worked about 90% of the time, it depended on the dataset. Since the random seed was fixed, I was always testing on the same dataset, so I never ran into this issue. I wrote a new implementation that fixes this issue, see below.

ageron · Nov 25, 2021

Okay, I've rewritten _svd_for_sparse_matrix(), and everything seems to work fine now. Instead of calling svds() twice, the code just calls it once. To workaround the fact that svds can only return min(a.shape) - 1 components (missing one), the code now adds an extra row or column (or both) full of zeros and crops U or Vt appropriately.

Note that I removed the test assert_array_almost_equal(random_projection.components_ @ random_projection.inverse_components_, np.eye(random_projection.n_components)). Indeed, a matrix multiplied by its pseudo-inverse is not always equal to the identity matrix, for example:

>>> import numpy as np
>>> from scipy.linalg import pinv
>>> a = np.array([[1, 2, 0], [2, 4, 0]])
>>> a @ pinv(a)
array([[0.2, 0.4],
       [0.4, 0.8]])

However, I have added a test that inverse_components_ is equal to scipy.linalg.pinv(components_), or equal to scipy.linalg.pinv(components_.toarray()) if components_ is sparse.

Also, the test now runs many times with different random states and different shapes: with n_rows < n_cols, or n_rows > n_cols, or n_rows == n_cols. I had to silence an unrelated warning in the case where n_components > n_cols.

…n_rows, or n_cols == n_rows and update tests

ageron · Mar 11, 2022

Thanks @ageron !

Since the inverted components are always dense, why do we bother to implement a pinv for sparse ? Densify the array first and call numpy's pinv could (maybe ?) be more efficient. I understand that the goal was to mitigate the memory usage but we are creating the dense inverse components array anyway.

My pleasure @jeremiedbb !

I think there were two reasons: (1) as you pointed out, it saves saves memory usage (probably halves it), and (2) someone (perhaps @ogrisel ? I can't recall) told me that it may be useful in other places until scipy offers the same functionality, which I believe is on their roadmap.

…sing

… n_samples and n_features, clarify filter warning

jeremiedbb · Mar 11, 2022

I think there were two reasons: (1) as you pointed out, it saves saves memory usage (probably halves it), and (2) someone (perhaps @ogrisel ? I can't recall) told me that it may be useful in other places until scipy offers the same functionality, which I believe is on their roadmap.

Yes, the worst case is temporarily having 2 dense arrays like components instead of 1. After an irl discussion with @ogrisel, we agreed that it's acceptable if it allows to avoid the burden of maintaining our own version of a pseudo inverse for sparse matrices.

jeremiedbb · Mar 11, 2022

Tbh I don't have much time right now, I'm trying to finish writing the 3rd edition of my book, so if a kind soul wants to investigate this point, that would be great.

If you don't have much time I can directly push these changes if you want.

ageron · Mar 11, 2022

Hi @jeremiedbb ,
Thanks for reviewing, and for offering to push the last changes, I really appreciate it. I understand the trade-off about the sparse pinv. There was a bit of a gambler's fallacy on my part: I spent so much time working on the sparse pinv that I was relunctant to just drop it! But not worries, please feel free to delete it.

jeremiedbb · Mar 13, 2022

I spent so much time working on the sparse pinv that I was relunctant to just drop it!

I understand the feeling :)

jeremiedbb

LGTM

ogrisel

LGTM again, including the changes suggested and implemented by @jeremiedbb.

Sorry for the back and forth review @ageron :)

I just pushed a merge with main and a new commit to leverage the newly introduced global_random_seed fixture in the new test to keep it deterministic while making sure this test is not seed sensitive.

Will merge when green.

jeremiedbb · Mar 14, 2022

Thanks @ageron !

ageron · Mar 14, 2022

Thanks @ogrisel, @jeremiedbb and @glemaitre for the thorough review, I'm impressed by how pro and helpful you all are, it's no wonder Scikit-Learn is so good! 👍

…earn#21701) Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ageron added 2 commits November 18, 2021 11:29

Add inverse_transform to random projection transformers

1c87aff

Add PR number. This PR fixes scikit-learn#21687

da07374

ageron added 2 commits November 18, 2021 14:21

Test the inverse transform's output shape

f41f062

Add missing newline in test

b107078

glemaitre changed the title ~~Add inverse_transform to random projection transformers~~ ENH Add inverse_transform to random projection transformers Nov 18, 2021

glemaitre reviewed Nov 18, 2021

View reviewed changes

ogrisel reviewed Nov 18, 2021

View reviewed changes

Add doc on inverse transform in user guide, rename components_pinv_ t…

33f141d

…o inverse_components_, and let inverse_transform() accept sparse arrays

ageron added 2 commits November 21, 2021 22:35

Use scipy's svds to compute pinv when components_ is sparse

91eff82

Remove warning about components converted to dense array

a3584f1

Add docstrings for the new sparse svd and pinv functions

baa5b4f

ogrisel approved these changes Nov 23, 2021

View reviewed changes

sklearn/random_projection.py Outdated Show resolved Hide resolved

sklearn/random_projection.py Outdated Show resolved Hide resolved

sklearn/random_projection.py Outdated Show resolved Hide resolved

Clarify docstrings as suggested by @ogrisel

f1ee941

ageron added 3 commits November 26, 2021 08:00

Merge branch 'main' into rnd_proj_inverse

d4d0319

Add missing newline in whats_new/v1.1.rst

dc343e8

Add axis=0 to np.flip()

d5ea32c

Fix _svd_for_sparse_matrix and remove comp @ pinv(comp) = eye test

675b330

ageron added 4 commits November 26, 2021 12:39

Add test that inverse_components_ is equal to pinv(inverse_components_)

f540118

Apply black formatting

af3be3c

Use scipy.linalg.pinv rather than np.linalg.pinv

47a2efe

Make sparse svd support all cases where n_rows < n_cols, or n_cols < …

47b760d

…n_rows, or n_cols == n_rows and update tests

ageron and others added 6 commits March 11, 2022 20:30

Merge branch 'main' into rnd_proj_inverse

b2324a8

Remove technical detail about how the pseudo-inverse is computed

57cfa01

Unrelated but flake8 was complaining that a double empty line was mis…

54ae7a9

…sing

Do not parametrize test for random_state, rename n_cols and n_rows to…

3571265

… n_samples and n_features, clarify filter warning

take true zeros into account for assert_allclose

90bff52

black

f4d02b6

jeremiedbb added 7 commits March 13, 2022 19:32

always expose inverse_transform + densify instead of sparse pinv

0849cf7

black

3b212a0

lint

daf1fd4

update doc

0bf501c

cln

a1fdf38

updatewhat's new

9403a62

cln

243d360

increase coverage

c7ba371

jeremiedbb approved these changes Mar 14, 2022

View reviewed changes

ogrisel mentioned this pull request Mar 14, 2022

Improve tests by using global_random_seed fixture to make them less seed-sensitive #22827

Open

ogrisel added 2 commits March 14, 2022 14:26

Merge branch 'main' into rnd_proj_inverse

4179072

Use the global_random_seed fixture in the new test

ff1b1da

ogrisel approved these changes Mar 14, 2022

View reviewed changes

jeremiedbb merged commit 7723d93 into scikit-learn:main Mar 14, 2022

ogrisel removed the Waiting for Reviewer label Mar 15, 2022

svenstehle mentioned this pull request Jun 5, 2022

TST use global_random_seed in sklearn/linear_model/tests/test_base.py #23465

Merged

17 tasks

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH Add inverse_transform to random projection transformers #21701

ENH Add inverse_transform to random projection transformers #21701

Uh oh!

Conversation

ageron commented Nov 17, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ageron commented Nov 17, 2021

Uh oh!

ageron commented Nov 18, 2021

Uh oh!

adrinjalali commented Nov 18, 2021

Uh oh!

ogrisel commented Nov 18, 2021

Uh oh!

glemaitre Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

glemaitre Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 18, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ageron commented Nov 19, 2021

Uh oh!

ageron commented Nov 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ageron commented Nov 24, 2021

Uh oh!

ageron commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ageron commented Nov 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ageron commented Mar 11, 2022

Uh oh!

jeremiedbb commented Mar 11, 2022

Uh oh!

jeremiedbb commented Mar 11, 2022

Uh oh!

ageron commented Mar 11, 2022

Uh oh!

jeremiedbb commented Mar 13, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Mar 14, 2022

ogrisel Nov 18, 2021 •

edited

Loading

ageron commented Nov 21, 2021 •

edited

Loading

ageron commented Nov 25, 2021 •

edited

Loading

ageron commented Nov 25, 2021 •

edited

Loading