-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Add inverse_transform to random projection transformers #21701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The test failure in |
Acccording to this SO answer:
They perform differently, both in terms of speed and memory usage, depending on the size of the input. So perhaps a follow-up PR should give the user the option to select the algorithm to use? |
This looks pretty good an clean, thanks @ageron . The only thing I'd add is a section to the relevant user guide (the relevant .rst) to explain the inverse transform, and how it's calculated, and how it's enabled. |
Are you sure this is still valid with recent version of scipy? https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.pinv.html It's not possible to select the LAPACK driver though. |
sklearn/random_projection.py
Outdated
components = self.components_ | ||
if sp.issparse(components): | ||
components = components.toarray() | ||
self.components_pinv_ = linalg.pinv(components, check_finite=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it useful to expose publicly the pseudo-inverse or can we make it kinda of private (i.e. self._components_pinv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned that pinv
does not scale with large matrices. I have 2 questions:
- Should we use
pinv
orpinv2
depending on the shape of theX
? - Is there a way to approximate the inverse? I am recalling the following PR where we wanted to do so with an approximation in Nystroem (I did not look at the PR thought). I don't know if we could have a trick here to have such an approximation here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's useful to expose publicly. Maybe we could change the name to inverse_components_
though to make the name less dependent on an implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use pinv or pinv2 depending on the shape of the X?
pinv2
is deprecated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, it's an interesting contrib. However I have the following concerns + some suggestions:
…o inverse_components_, and let inverse_transform() accept sparse arrays
Thanks @adrinjalali , @glemaitre , and @ogrisel for taking such a thorough look at this PR, and for the interesting feedback. @ogrisel , I tested both I'll work on the
|
Ok, The pseudo-inverse implementation is based on The rest of the pseudo-inverse implementation was taken in large part from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a few minor improvements:
Done |
Edit: I've run some tests and I can confirm that my implementation of |
Okay, I've rewritten Note that I removed the test >>> import numpy as np
>>> from scipy.linalg import pinv
>>> a = np.array([[1, 2, 0], [2, 4, 0]])
>>> a @ pinv(a)
array([[0.2, 0.4],
[0.4, 0.8]]) However, I have added a test that Also, the test now runs many times with different random states and different shapes: with |
My pleasure @jeremiedbb ! I think there were two reasons: (1) as you pointed out, it saves saves memory usage (probably halves it), and (2) someone (perhaps @ogrisel ? I can't recall) told me that it may be useful in other places until scipy offers the same functionality, which I believe is on their roadmap. |
… n_samples and n_features, clarify filter warning
Yes, the worst case is temporarily having 2 dense arrays like components instead of 1. After an irl discussion with @ogrisel, we agreed that it's acceptable if it allows to avoid the burden of maintaining our own version of a pseudo inverse for sparse matrices. |
If you don't have much time I can directly push these changes if you want. |
Hi @jeremiedbb , |
I understand the feeling :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM again, including the changes suggested and implemented by @jeremiedbb.
Sorry for the back and forth review @ageron :)
I just pushed a merge with main and a new commit to leverage the newly introduced global_random_seed
fixture in the new test to keep it deterministic while making sure this test is not seed sensitive.
Will merge when green.
Thanks @ageron ! |
Thanks @ogrisel, @jeremiedbb and @glemaitre for the thorough review, I'm impressed by how pro and helpful you all are, it's no wonder Scikit-Learn is so good! 👍 |
…earn#21701) Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Fixes #21687
What does this implement/fix? Explain your changes.
Adds a
fit_inverse_transform
parameter to all transformers in thesklearn.random_projection
module:GaussianRandomProjection
andSparseRandomProjection
. When set toTrue
, the pseudo-inverse of the components is computed duringfit()
and stored incomponents_pinv_
, andinverse_transform()
becomes available.Any other comments?
Using the pseudo-inverse makes sense to me, and it seems to work fine, in the sense that
rnd_proj.transform(rnd_proj.inverse_transform(X))
equalsX
. However, this implementation usesscipy.linalg.pinv()
, which scales very poorly to large datasets, which is a major use case for random projections. Perhaps it would make sense to use another approach if we detect that thecomponents_
array is large?And perhaps there's a mathematical way to generate both a random matrix and its inverse more efficiently?
For the
SparseRandomProjection
transformer, computing the pseudo-inverse breaks sparsity. Perhaps there's a way to generate a sparse matrix that is "close enough" rather than using the pseudo-inverse?In short: if there's a Random Projection expert in the room, please speak up!
That said, it seems to work fine now, so performance improvements could be pushed in follow-up PRs.