Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[MRG] Refactor feature_selection.f_regression and introduce feature_selection.r_regression #17169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
Apr 21, 2021

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented May 9, 2020

Reference Issues/PRs

Follow-up of #8353
Fixes #7195

What does this implement/fix? Explain your changes.

Refactors feature_selection.f_regression for convenience: feature_selection.f_regression now relies on feature_selection.r_regression, a public function which pre-computes Pearson's R

Other remarks

feature_selection.abs_r_regression, introduced in #8353, which returned the absolute values of feature_selection.r_regression was initially considered.

@jjerphan jjerphan changed the title R regression [WIP+1] r_regression and abs_r_regression added May 9, 2020
jjerphan added 4 commits May 9, 2020 16:15
I made this change to have the test ran as I previously got:

```
sklearn/feature_selection/tests/test_feature_select.py:None (sklearn/feature_selection/tests/test_feature_select.py)
/home/jsquared/.virtualenvs/sk/lib64/python3.8/site-packages/py/_path/local.py:701: in pyimport
    __import__(modname)
../__init__.py:27: in <module>
    from ._mutual_info import mutual_info_regression, mutual_info_classif
../_mutual_info.py:9: in <module>
    from ..neighbors import NearestNeighbors
../../neighbors/__init__.py:17: in <module>
    from ._nca import NeighborhoodComponentsAnalysis
../../neighbors/_nca.py:22: in <module>
    from ..decomposition import PCA
../../decomposition/__init__.py:17: in <module>
    from .dict_learning import dict_learning
E   ModuleNotFoundError: No module named
'sklearn.decomposition.dict_learning'
```

It seems that there are reason for this special handling
to exist according to the comment above.

This might need to be reverted.
@jjerphan
Copy link
Member Author

jjerphan commented May 9, 2020

In fact, it looks like this PR is just a really small re-factoring to have two new coefficients accessible.
@jnothman: regarding what you've said in the previous PR (#8353), is it still relevant to benchmark it?

@jjerphan
Copy link
Member Author

What do you think, @jnothman? :)

@jjerphan
Copy link
Member Author

Hi @jnothman, this is a friendly reminder for this stalled PR follow-up.
Do you think that it is still worth benchmarking it?

@cmarmo
Copy link
Contributor

cmarmo commented Nov 23, 2020

Hi @jjerphan thanks for your pull request. There are linting issues and a conflict that should be solved before review. Also if you think that this is ready for review, do you mind changing the title from [WIP] to [MRG]? Thanks a lot for your patience!

Base automatically changed from master to main January 22, 2021 10:52
@jjerphan jjerphan changed the title [WIP+1] r_regression and abs_r_regression added [MRG] Refactor f_regression and introduce r_regression and abs_r_regression Feb 9, 2021
@jjerphan
Copy link
Member Author

@cmarmo: are you interested in reviewing this PR? 🙂

@cmarmo
Copy link
Contributor

cmarmo commented Feb 15, 2021

Hi @jjerphan, thanks for your patience! The documentation for the two new functions is not generated: this is because a reference to the function should be added in the doc/modules/classes.rst file.
To bring more attention to your pull request I believe it could be a good idea to propose the requested benchmarks: scikit-learn contains a lot of lines of code, adding more features adds maintenance costs. Nice benchmarks provide more motivation to feature addition. Thanks for your collaboration.

jjerphan and others added 9 commits April 12, 2021 08:30
Use accurate descriptions for the parameters.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@thomasjpfan
Copy link
Member

With this PR, r_regression and abs_r_regression would be the only functions in sklearn.feature_selection that does not return p-values. Given the computational cost of p-values is not high: #17169 (comment), I see the benefit of this PR is the convenience method for absolute value, which is is shorthand for:

from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, f_regression
import numpy as np

X, y = make_regression(random_state=0)

f_abs_regression = lambda *args: np.abs(f_regression(*args)[0])
X_new = SelectKBest(f_abs_regression, k=10).fit_transform(X, y)

If we want to make abs easier to use, we can add a parameter to SelectKBest:

X_new = SelectKBest(f_regression, k=10, transform_func=np.abs).fit_transform(X, y)

I am wondering if adding 2 new functions worth the maintenance cost for making SelectKBest easier to use.

Also check against numpy's implementation of the
correlation coefficient.

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@jjerphan
Copy link
Member Author

Thanks for the comprehensive review, @glemaitre. I should have adressed all your requests.

IMO, the value of this PR is the introduction of r_regression, and I see no problem to remove the originally proposed abs_r_regression.

I would go for @thomasjpfan's first suggestion, hence assuming that it is up to users to define the score_func they want SelectKBest to be used with.

@thomasjpfan
Copy link
Member

r_regression is f_regression without the p value computation. For consistency, are we add going to add anova_classif which is f_classif without the p value computation?

There is also the option of adding a return_p_value to f_regression, which gives the option to compute the p value.

@jjerphan
Copy link
Member Author

r_regression is f_regression without the p value computation.

f_regression relies on r_regression but return different statistics.

Still, it would make sense allowing a way to specify the computation of the p-values as you propose.
I don't think that I am the best to know what the most suitable choice here is.

@glemaitre
Copy link
Member

I think that we could drop abs_f_regression. Then, I still think that we need r_regression and f_regression as proposed. Additionally, it might be nice to add a keyword return_p_values (True by default to be back-compatible) that allows bypassing the computation if p-values are not needed.

@glemaitre
Copy link
Member

So we can limit the scope for this PR to introduce r_regression only.

@jjerphan jjerphan changed the title [MRG] Refactor f_regression and introduce r_regression and abs_r_regression [MRG] Refactor feature_selection.f_regression and introduce feature_selection.r_regression Apr 16, 2021
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small improvement suggestions:

sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_univariate_selection.py Outdated Show resolved Hide resolved
jjerphan and others added 2 commits April 20, 2021 18:54
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel ogrisel merged commit d5ebdca into scikit-learn:main Apr 21, 2021
@ogrisel
Copy link
Member

ogrisel commented Apr 21, 2021

Merged! Thanks @jjerphan.

@jjerphan
Copy link
Member Author

'welcome!

Thanks to @DSLituiev for the original PR!

@jjerphan jjerphan deleted the r_regression branch April 21, 2021 09:20
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[refactor] f_regression into r_regression and a wrapper for performance
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.