Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Add inverse_transform to random projection transformers #21701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Mar 14, 2022

Conversation

ageron
Copy link
Contributor

@ageron ageron commented Nov 17, 2021

Reference Issues/PRs

Fixes #21687

What does this implement/fix? Explain your changes.

Adds a fit_inverse_transform parameter to all transformers in the sklearn.random_projection module: GaussianRandomProjection and SparseRandomProjection. When set to True, the pseudo-inverse of the components is computed during fit() and stored in components_pinv_, and inverse_transform() becomes available.

Any other comments?

Using the pseudo-inverse makes sense to me, and it seems to work fine, in the sense that rnd_proj.transform(rnd_proj.inverse_transform(X)) equals X. However, this implementation uses scipy.linalg.pinv(), which scales very poorly to large datasets, which is a major use case for random projections. Perhaps it would make sense to use another approach if we detect that the components_ array is large?

And perhaps there's a mathematical way to generate both a random matrix and its inverse more efficiently?

For the SparseRandomProjection transformer, computing the pseudo-inverse breaks sparsity. Perhaps there's a way to generate a sparse matrix that is "close enough" rather than using the pseudo-inverse?

In short: if there's a Random Projection expert in the room, please speak up!

That said, it seems to work fine now, so performance improvements could be pushed in follow-up PRs.

@ageron
Copy link
Contributor Author

ageron commented Nov 17, 2021

The test failure in sklearn/tests/test_calibration.py looks unrelated to this PR.

@ageron
Copy link
Contributor Author

ageron commented Nov 18, 2021

Acccording to this SO answer:

numpy.linalg.pinv() approximates the Moore-Penrose pseudo inverse using an SVD (the LAPACK method dgesdd to be precise), whereas scipy.linalg.pinv() solves a model linear system in the least squares sense to approximate the pseudo inverse (using dgelss).

They perform differently, both in terms of speed and memory usage, depending on the size of the input. So perhaps a follow-up PR should give the user the option to select the algorithm to use?

@adrinjalali
Copy link
Member

This looks pretty good an clean, thanks @ageron . The only thing I'd add is a section to the relevant user guide (the relevant .rst) to explain the inverse transform, and how it's calculated, and how it's enabled.

@ogrisel
Copy link
Member

ogrisel commented Nov 18, 2021

numpy.linalg.pinv() approximates the Moore-Penrose pseudo inverse using an SVD (the LAPACK method dgesdd to be precise), whereas scipy.linalg.pinv() solves a model linear system in the least squares sense to approximate the pseudo inverse (using dgelss).

Are you sure this is still valid with recent version of scipy? https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html

It's not possible to select the LAPACK driver though.

@glemaitre glemaitre changed the title Add inverse_transform to random projection transformers ENH Add inverse_transform to random projection transformers Nov 18, 2021
components = self.components_
if sp.issparse(components):
components = components.toarray()
self.components_pinv_ = linalg.pinv(components, check_finite=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to expose publicly the pseudo-inverse or can we make it kinda of private (i.e. self._components_pinv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned that pinv does not scale with large matrices. I have 2 questions:

  • Should we use pinv or pinv2 depending on the shape of the X?
  • Is there a way to approximate the inverse? I am recalling the following PR where we wanted to do so with an approximation in Nystroem (I did not look at the PR thought). I don't know if we could have a trick here to have such an approximation here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's useful to expose publicly. Maybe we could change the name to inverse_components_ though to make the name less dependent on an implementation detail.

Copy link
Member

@ogrisel ogrisel Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use pinv or pinv2 depending on the shape of the X?

pinv2 is deprecated.

sklearn/random_projection.py Outdated Show resolved Hide resolved
sklearn/tests/test_random_projection.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, it's an interesting contrib. However I have the following concerns + some suggestions:

sklearn/random_projection.py Outdated Show resolved Hide resolved
sklearn/random_projection.py Show resolved Hide resolved
sklearn/random_projection.py Outdated Show resolved Hide resolved
sklearn/random_projection.py Outdated Show resolved Hide resolved
sklearn/tests/test_random_projection.py Outdated Show resolved Hide resolved
sklearn/tests/test_random_projection.py Outdated Show resolved Hide resolved
sklearn/tests/test_random_projection.py Outdated Show resolved Hide resolved
…o inverse_components_, and let inverse_transform() accept sparse arrays
@ageron
Copy link
Contributor Author

ageron commented Nov 19, 2021

Thanks @adrinjalali , @glemaitre , and @ogrisel for taking such a thorough look at this PR, and for the interesting feedback.

@ogrisel , I tested both np.linalg.pinv() and scipy.linalg.pinv() on an array of shape [5_000, 20_000], and SciPy took 25.5 minutes while NumPy took only 11.6 minutes (on Colab). So NumPy's implementation was twice as fast! SciPy uses its decomp_svd() function, while NumPy uses its svd() function. I haven't looked further yet, so I'm still not sure why NumPy's implementation is twice as fast as SciPy's.

I'll work on the pinv() implementation for sparse arrays soon, but in the meantime I made the rest of the changes you suggested, including:

  • Added a section in the User Guide in random_projection.rst
  • Renamed components_pinv_ to inverse_components_. I didn't like the name either!
  • Added a note to say that even if X is sparse, X_original is dense, which may use a lot of RAM.
  • Fixed: X : {array-like, sparse matrix} of shape (n_samples, n_components)
  • fit_transform() now accepts both dense and sparse arrays
  • Added testing for both dense and sparse inputs in test_random_projection.py
  • Removed the ######### header in test_random_projection.py
  • Added the test that random_projection.components_ @ random_projection.inverse_components_ is the identity matrix

@ageron
Copy link
Contributor Author

ageron commented Nov 21, 2021

Ok, SparseRandomProjection.fit() now computes the pseudo-inverse (when fit_inverse_transform is set to True) without converting the components_ to a dense array.

The pseudo-inverse implementation is based on scipy.sparse.linalg.svds(), as you suggested @ogrisel . I had to call it twice, though: once for the first half of the components, and once for the second half. That's because scipy.sparse.linalg.svds() does not support getting all the components at once.

The rest of the pseudo-inverse implementation was taken in large part from scipy.linalg.pinv(), but I simplified it a lot, removing every parameter except the sparse matrix. I just wanted to keep things as basic as possible, since this is meant to be a temporary workaround until SciPy's pinv() supports sparse matrices.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few minor improvements:

sklearn/random_projection.py Outdated Show resolved Hide resolved
sklearn/random_projection.py Outdated Show resolved Hide resolved
sklearn/random_projection.py Outdated Show resolved Hide resolved
@ageron
Copy link
Contributor Author

ageron commented Nov 24, 2021

Done

@ageron
Copy link
Contributor Author

ageron commented Nov 25, 2021

Yikes, something's weird: I'm getting no errors in my tests locally, but it's failing in CI. I'll have to investigate this. I'm thinking it might be an error that happens occasionally, if the first k//2 components returned by the first call to scipy.sparse.linalg.svds() are not compatible with the k - k//2 components returned by the second call. If that's the case, I don't really see a way to fix that. Any ideas? If there's no solution, I'll just remove inverse_transform() for the sparse case. All this hard work for nothing! 😭

Edit: I've run some tests and I can confirm that my implementation of svd() based on calling svds() twice only worked about 90% of the time, it depended on the dataset. Since the random seed was fixed, I was always testing on the same dataset, so I never ran into this issue. I wrote a new implementation that fixes this issue, see below.

@ageron
Copy link
Contributor Author

ageron commented Nov 25, 2021

Okay, I've rewritten _svd_for_sparse_matrix(), and everything seems to work fine now. Instead of calling svds() twice, the code just calls it once. To workaround the fact that svds can only return min(a.shape) - 1 components (missing one), the code now adds an extra row or column (or both) full of zeros and crops U or Vt appropriately.

Note that I removed the test assert_array_almost_equal(random_projection.components_ @ random_projection.inverse_components_, np.eye(random_projection.n_components)). Indeed, a matrix multiplied by its pseudo-inverse is not always equal to the identity matrix, for example:

>>> import numpy as np
>>> from scipy.linalg import pinv
>>> a = np.array([[1, 2, 0], [2, 4, 0]])
>>> a @ pinv(a)
array([[0.2, 0.4],
       [0.4, 0.8]])

However, I have added a test that inverse_components_ is equal to scipy.linalg.pinv(components_), or equal to scipy.linalg.pinv(components_.toarray()) if components_ is sparse.

Also, the test now runs many times with different random states and different shapes: with n_rows < n_cols, or n_rows > n_cols, or n_rows == n_cols. I had to silence an unrelated warning in the case where n_components > n_cols.

@ageron
Copy link
Contributor Author

ageron commented Mar 11, 2022

Thanks @ageron !

Since the inverted components are always dense, why do we bother to implement a pinv for sparse ? Densify the array first and call numpy's pinv could (maybe ?) be more efficient. I understand that the goal was to mitigate the memory usage but we are creating the dense inverse components array anyway.

My pleasure @jeremiedbb !

I think there were two reasons: (1) as you pointed out, it saves saves memory usage (probably halves it), and (2) someone (perhaps @ogrisel ? I can't recall) told me that it may be useful in other places until scipy offers the same functionality, which I believe is on their roadmap.

@jeremiedbb
Copy link
Member

I think there were two reasons: (1) as you pointed out, it saves saves memory usage (probably halves it), and (2) someone (perhaps @ogrisel ? I can't recall) told me that it may be useful in other places until scipy offers the same functionality, which I believe is on their roadmap.

Yes, the worst case is temporarily having 2 dense arrays like components instead of 1. After an irl discussion with @ogrisel, we agreed that it's acceptable if it allows to avoid the burden of maintaining our own version of a pseudo inverse for sparse matrices.

@jeremiedbb
Copy link
Member

Tbh I don't have much time right now, I'm trying to finish writing the 3rd edition of my book, so if a kind soul wants to investigate this point, that would be great.

If you don't have much time I can directly push these changes if you want.

@ageron
Copy link
Contributor Author

ageron commented Mar 11, 2022

Hi @jeremiedbb ,
Thanks for reviewing, and for offering to push the last changes, I really appreciate it. I understand the trade-off about the sparse pinv. There was a bit of a gambler's fallacy on my part: I spent so much time working on the sparse pinv that I was relunctant to just drop it! But not worries, please feel free to delete it.

@jeremiedbb
Copy link
Member

I spent so much time working on the sparse pinv that I was relunctant to just drop it!

I understand the feeling :)

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM again, including the changes suggested and implemented by @jeremiedbb.

Sorry for the back and forth review @ageron :)

I just pushed a merge with main and a new commit to leverage the newly introduced global_random_seed fixture in the new test to keep it deterministic while making sure this test is not seed sensitive.

Will merge when green.

@jeremiedbb jeremiedbb merged commit 7723d93 into scikit-learn:main Mar 14, 2022
@jeremiedbb
Copy link
Member

Thanks @ageron !

@ageron
Copy link
Contributor Author

ageron commented Mar 14, 2022

Thanks @ogrisel, @jeremiedbb and @glemaitre for the thorough review, I'm impressed by how pro and helpful you all are, it's no wonder Scikit-Learn is so good! 👍

glemaitre pushed a commit to glemaitre/scikit-learn that referenced this pull request Apr 6, 2022
…earn#21701)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add inverse_transform() to random projection classes
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.