-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH add x and y to importance getter rfe #21935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ENH add x and y to importance getter rfe #21935
Conversation
@ClaudioSalvatoreArcidiacono I think this is a great idea and will make RFE much more useful. How is the PR going? FWIW this is a way to overcome this until this PR is merged (as you can see it is not very pretty)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
sklearn/feature_selection/_base.py
Outdated
importances = getter(estimator) | ||
|
||
elif callable(getter): | ||
if len(signature(getter).parameters) == 3 and X is not None and y is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the callable
accepts 3 parameters, what do you think of passing in X
and y
directly without checking X
and y
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it makes sense. Thanks for noticing it!
doc/whats_new/v1.1.rst
Outdated
:mod:`sklearn.feature_selection` | ||
................................ | ||
|
||
- |Enhancement| the initialization parameter `importance_getter` of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that 1.1 is released can you move this changelog to 1.2?
sklearn/feature_selection/_rfe.py
Outdated
@@ -287,6 +288,8 @@ def _fit(self, X, y, step_score=None, **fit_params): | ||
estimator, | ||
self.importance_getter, | ||
transform_func="square", | ||
X=X[:, features], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If X
is a sparse matrix, then X[: features]
will make a copy. Can you store X[: features]
in a variable, before the estimator.fit
call a few lines above and use it in both places?
The callable is passed with the fitted estimator and optionally the | ||
training input samples `X` and the target labels `y`. The callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think supporting callables with two signatures leads to a more confusing API.
What do you think about deprecating the old signature that only accepts the estimator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @thomasjpfan, thanks for reviewing my PR!
I agree that accepting two signatures makes the API more complex. On the other hand, deprecating the old signature would lead to a breaking change. In my opinion, adding a little more complexity is better than introducing a breaking change, what is your view on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecating means that the 1 parameter callable still works for the next two release. This gives users two releases to update their code to adapt to the 3 parameter callable.
Concretely, lets say we release the 3 parameter callable in v1.2, then a warning is shown stating that in v1.4, we will pass in X
and y
into the callable.
I prefer to deprecate the 1 parameter callable and move to the new 3 parameter callable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, now I understand what you meant by that, in that case I think that it makes sense. I will add the deprecation warning and refactor the tests to take care of that. Stay tuned!
Hey @mattiasAngqvist, I was waiting for someone to review it, now that @thomasjpfan started to review the PR I will continue working on it. Regarding your proposed temporary workaround, you should keep a reference to def make_custom_importance(df_train):
feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}
def custom_importance(model):
features = model.feature_name_
new_features = [feature_map[f] for f in features]
r = permutation_importance(model,
df_train[new_features],
df_train['y'],
n_repeats=1,
random_state=0)
return r.importance_mean
return custom_importance
rfe = RFE(estimator = LGBMRegressor(),
n_features_to_select=20,
step=50,
importance_getter=make_custom_importance(df_train))
selector = rfe.fit(X=df_train, y=df_train['y']) But I agree that it looks a bit clunky since you need to pass df_train twice. Moreover, this does not play well in situations when you want to use RFE within a cross validation pipeline. |
Store X[:, features] in a variable to prevent creating a copy twice
…importance_getter_rfe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Now that I see the implementation for deprecating, I am leaning toward going back to your original idea of supporting 1 and 3 parameters.
if len(signature(getter).parameters) == 3: | ||
importances = getter(estimator, X, y) | ||
else: | ||
importances = getter(estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, _get_feature_importances
is shared with SelectFromModel
, which only uses a one parameter. SelectFromModel
can accept a prefitted estimator and use that for feature selection. In those cases, SelectFromModel
never saw that data which means it can not use the three parameter callable.
It feels like a inconsistent API for SelectFromModel
to only accept a callable with 1 parameter and RFE
only accept a callable with 3 parameter. Also, there is some added complexity to _get_feature_importances
if we want to restrict the number of parameters.
With that in mind, I am leaning toward your original idea of supporting both 1 and 3 parameters in RFE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me.
I have reverted the commits where I deprecated accepting a callable with 1 parameter.
What do you think about extending the documentation for the argument importance_getter
as it is in this PR to SelectFromModel
as well? Another option might be to deprecate accepting a callable with 1 parameter there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in https://github.com/scikit-learn/scikit-learn/pull/21935/files/5c81eef0390b0b96b6e0ca74256430d0a9e5a88a..4e74568e8912082647a302f548d541c821718c99#r940622865, the way SelectFromModel
can accept prefitted estimators means that it will always need to support a callable with one parameter. Specifically, if SelectFromModel
is configured with a prefited model, then SelectFromModel.transform
works without calling fit
and seeing the training data.
For this PR, I prefer not to expand scope to SelectFromModel
. Usually expanding scope makes a PR harder to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I that case I think the PR should be ready to be merged.
I also thought about adding an example about using a custom importance getter in the recursive feature elimination section, but I will open another PR about it once this one has been merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan
Would it make sense to potentially wrap the callable importance_getter
passed to SelectFromModel
internally so that it can match the 3-argument signature forced by the internal API? The user experience wouldn't change and it would allow for a consistent internal API.
If we wanted to (in a future PR) we could even explore the path of allowing 3-argument callables for SelectFromModel
when working with yet-unfit estimators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there @ClaudioSalvatoreArcidiacono, thanks for the PR! Just had a couple small suggestions, mostly focused on wording.
Also could you remove the cosmetic changes made to the import statements? We'll probably improve them (e.g. via isort
) separately, but keeping them as they are now helps with preserving git blame
history.
sklearn/feature_selection/_base.py
Outdated
X : {array-like, sparse matrix} of shape (n_samples, n_features) | ||
The training input samples. | ||
|
||
y : array-like of shape (n_samples,) | ||
The target values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add documentation of their default values:
X : {array-like, sparse matrix} of shape (n_samples, n_features) | |
The training input samples. | |
y : array-like of shape (n_samples,) | |
The target values. | |
X : {array-like, sparse matrix} of shape (n_samples, n_features), default=None | |
The training input samples. | |
y : array-like of shape (n_samples,), default=None | |
The target values. |
sklearn/feature_selection/_rfe.py
Outdated
Added support for custom importance getter with estimator, training input | ||
samples `X` and the target labels `y`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we could improve wording here. What do you think? (May need to format w/ black)
Added support for custom importance getter with estimator, training input | |
samples `X` and the target labels `y`. | |
Added support for a callable `importance_getter` which accepts estimator, training input | |
samples `X` and the target labels `y` as arguments. |
sklearn/feature_selection/_rfe.py
Outdated
Added support for custom importance getter with estimator, training input | ||
samples `X` and the target labels `y`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above:
Added support for custom importance getter with estimator, training input | |
samples `X` and the target labels `y`. | |
Added support for a callable `importance_getter` which accepts estimator, training input | |
samples `X` and the target labels `y` as arguments. |
if len(signature(getter).parameters) == 3: | ||
importances = getter(estimator, X, y) | ||
else: | ||
importances = getter(estimator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan
Would it make sense to potentially wrap the callable importance_getter
passed to SelectFromModel
internally so that it can match the 3-argument signature forced by the internal API? The user experience wouldn't change and it would allow for a consistent internal API.
If we wanted to (in a future PR) we could even explore the path of allowing 3-argument callables for SelectFromModel
when working with yet-unfit estimators.
Hey @Micky774, thanks a lot for reviewing my PR and for leaving your suggestions! I have updated the PR accordingly. |
sklearn/feature_selection/_base.py
Outdated
else: | ||
raise ValueError("`importance_getter` has to be a string or `callable`") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ClaudioSalvatoreArcidiacono , codecov is complaining that this ValueError is not checked in the tests.
Do you mind having a look? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cmarmo! thanks for your comment. I have removed the line not covered by tests, for more details check my comment below.
…importance_getter_rfe
This is already being checked in BaseEstimator._validate_params
@@ -214,10 +223,14 @@ def _get_feature_importances(estimator, getter, transform_func=None, norm_order= | ||
) | ||
else: | ||
getter = attrgetter(getter) | ||
elif not callable(getter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check has been removed because in
self._validate_params() |
instance_getter
is either a string or a callable. Furthermore here I have added a test case to verify that a ValueError is raised when RFE.fit
is called with an importance getter that is not a callable or a string.
…importance_getter_rfe
Reference Issues/PRs
Fixes #21934
What does this implement/fix? Explain your changes.
Adds possibility of passing also train instances and labels to importance_getter in Recursive Feature Elimination