Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH add x and y to importance getter rfe #21935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
Loading
from

Conversation

ClaudioSalvatoreArcidiacono
Copy link
Contributor

Reference Issues/PRs

Fixes #21934

What does this implement/fix? Explain your changes.

Adds possibility of passing also train instances and labels to importance_getter in Recursive Feature Elimination

@mattiasAngqvist
Copy link

@ClaudioSalvatoreArcidiacono I think this is a great idea and will make RFE much more useful. How is the PR going?

FWIW this is a way to overcome this until this PR is merged (as you can see it is not very pretty)

feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}

def custom_importance(model):
    features = model.feature_name_
    new_features = [feature_map[f] for f in features]
    r = permutation_importance(model,
                              df_train[new_features],
                              df_train['y'],
                              n_repeats=1,
                              random_state=0)
    return r.importance_mean

rfe = RFE(estimator = LGBMRegressor(),
 			n_features_to_select=20,
 			step=50,
 			importance_getter=custom_importance)
selector = rfe.fit(X=df_train, y=df_train['y'])

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

importances = getter(estimator)

elif callable(getter):
if len(signature(getter).parameters) == 3 and X is not None and y is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the callable accepts 3 parameters, what do you think of passing in X and y directly without checking X and y?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it makes sense. Thanks for noticing it!

:mod:`sklearn.feature_selection`
................................

- |Enhancement| the initialization parameter `importance_getter` of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that 1.1 is released can you move this changelog to 1.2?

@@ -287,6 +288,8 @@ def _fit(self, X, y, step_score=None, **fit_params):
estimator,
self.importance_getter,
transform_func="square",
X=X[:, features],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X is a sparse matrix, then X[: features] will make a copy. Can you store X[: features] in a variable, before the estimator.fit call a few lines above and use it in both places?

Comment on lines +518 to +519
The callable is passed with the fitted estimator and optionally the
training input samples `X` and the target labels `y`. The callable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think supporting callables with two signatures leads to a more confusing API.

What do you think about deprecating the old signature that only accepts the estimator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @thomasjpfan, thanks for reviewing my PR!

I agree that accepting two signatures makes the API more complex. On the other hand, deprecating the old signature would lead to a breaking change. In my opinion, adding a little more complexity is better than introducing a breaking change, what is your view on it?

Copy link
Member

@thomasjpfan thomasjpfan Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecating means that the 1 parameter callable still works for the next two release. This gives users two releases to update their code to adapt to the 3 parameter callable.

Concretely, lets say we release the 3 parameter callable in v1.2, then a warning is shown stating that in v1.4, we will pass in X and y into the callable.

I prefer to deprecate the 1 parameter callable and move to the new 3 parameter callable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, now I understand what you meant by that, in that case I think that it makes sense. I will add the deprecation warning and refactor the tests to take care of that. Stay tuned!

@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

@ClaudioSalvatoreArcidiacono I think this is a great idea and will make RFE much more useful. How is the PR going?

FWIW this is a way to overcome this until this PR is merged (as you can see it is not very pretty)

feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}

def custom_importance(model):
    features = model.feature_name_
    new_features = [feature_map[f] for f in features]
    r = permutation_importance(model,
                              df_train[new_features],
                              df_train['y'],
                              n_repeats=1,
                              random_state=0)
    return r.importance_mean

rfe = RFE(estimator = LGBMRegressor(),
 			n_features_to_select=20,
 			step=50,
 			importance_getter=custom_importance)
selector = rfe.fit(X=df_train, y=df_train['y'])

Hey @mattiasAngqvist, I was waiting for someone to review it, now that @thomasjpfan started to review the PR I will continue working on it.

Regarding your proposed temporary workaround, you should keep a reference to df_train in your custom_importance function. You could use a decorator for that:

def make_custom_importance(df_train):
  feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}

  def custom_importance(model):
      features = model.feature_name_
      new_features = [feature_map[f] for f in features]
      r = permutation_importance(model,
                                df_train[new_features],
                                df_train['y'],
                                n_repeats=1,
                                random_state=0)
      return r.importance_mean

  return custom_importance


rfe = RFE(estimator = LGBMRegressor(),
 			n_features_to_select=20,
 			step=50,
 			importance_getter=make_custom_importance(df_train))
selector = rfe.fit(X=df_train, y=df_train['y'])

But I agree that it looks a bit clunky since you need to pass df_train twice. Moreover, this does not play well in situations when you want to use RFE within a cross validation pipeline.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update. Now that I see the implementation for deprecating, I am leaning toward going back to your original idea of supporting 1 and 3 parameters.

if len(signature(getter).parameters) == 3:
importances = getter(estimator, X, y)
else:
importances = getter(estimator)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, _get_feature_importances is shared with SelectFromModel, which only uses a one parameter. SelectFromModel can accept a prefitted estimator and use that for feature selection. In those cases, SelectFromModel never saw that data which means it can not use the three parameter callable.

It feels like a inconsistent API for SelectFromModel to only accept a callable with 1 parameter and RFE only accept a callable with 3 parameter. Also, there is some added complexity to _get_feature_importances if we want to restrict the number of parameters.

With that in mind, I am leaning toward your original idea of supporting both 1 and 3 parameters in RFE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me.

I have reverted the commits where I deprecated accepting a callable with 1 parameter.

What do you think about extending the documentation for the argument importance_getter as it is in this PR to SelectFromModel as well? Another option might be to deprecate accepting a callable with 1 parameter there as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in https://github.com/scikit-learn/scikit-learn/pull/21935/files/5c81eef0390b0b96b6e0ca74256430d0a9e5a88a..4e74568e8912082647a302f548d541c821718c99#r940622865, the way SelectFromModel can accept prefitted estimators means that it will always need to support a callable with one parameter. Specifically, if SelectFromModel is configured with a prefited model, then SelectFromModel.transform works without calling fit and seeing the training data.

For this PR, I prefer not to expand scope to SelectFromModel. Usually expanding scope makes a PR harder to merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I that case I think the PR should be ready to be merged.

I also thought about adding an example about using a custom importance getter in the recursive feature elimination section, but I will open another PR about it once this one has been merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan
Would it make sense to potentially wrap the callable importance_getter passed to SelectFromModel internally so that it can match the 3-argument signature forced by the internal API? The user experience wouldn't change and it would allow for a consistent internal API.

If we wanted to (in a future PR) we could even explore the path of allowing 3-argument callables for SelectFromModel when working with yet-unfit estimators.

claudio.arcidiacono added 3 commits August 10, 2022 13:42
@ClaudioSalvatoreArcidiacono ClaudioSalvatoreArcidiacono changed the title 21934 add x and y to importance getter rfe ENH add x and y to importance getter rfe Aug 13, 2022
Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there @ClaudioSalvatoreArcidiacono, thanks for the PR! Just had a couple small suggestions, mostly focused on wording.

Also could you remove the cosmetic changes made to the import statements? We'll probably improve them (e.g. via isort) separately, but keeping them as they are now helps with preserving git blame history.

Comment on lines 184 to 188
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples.

y : array-like of shape (n_samples,)
The target values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add documentation of their default values:

Suggested change
X : {array-like, sparse matrix} of shape (n_samples, n_features)
The training input samples.
y : array-like of shape (n_samples,)
The target values.
X : {array-like, sparse matrix} of shape (n_samples, n_features), default=None
The training input samples.
y : array-like of shape (n_samples,), default=None
The target values.

Comment on lines 113 to 114
Added support for custom importance getter with estimator, training input
samples `X` and the target labels `y`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we could improve wording here. What do you think? (May need to format w/ black)

Suggested change
Added support for custom importance getter with estimator, training input
samples `X` and the target labels `y`.
Added support for a callable `importance_getter` which accepts estimator, training input
samples `X` and the target labels `y` as arguments.

Comment on lines 538 to 539
Added support for custom importance getter with estimator, training input
samples `X` and the target labels `y`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above:

Suggested change
Added support for custom importance getter with estimator, training input
samples `X` and the target labels `y`.
Added support for a callable `importance_getter` which accepts estimator, training input
samples `X` and the target labels `y` as arguments.

if len(signature(getter).parameters) == 3:
importances = getter(estimator, X, y)
else:
importances = getter(estimator)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomasjpfan
Would it make sense to potentially wrap the callable importance_getter passed to SelectFromModel internally so that it can match the 3-argument signature forced by the internal API? The user experience wouldn't change and it would allow for a consistent internal API.

If we wanted to (in a future PR) we could even explore the path of allowing 3-argument callables for SelectFromModel when working with yet-unfit estimators.

@ClaudioSalvatoreArcidiacono
Copy link
Contributor Author

Hey @Micky774, thanks a lot for reviewing my PR and for leaving your suggestions!

I have updated the PR accordingly.

Comment on lines 234 to 235
else:
raise ValueError("`importance_getter` has to be a string or `callable`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ClaudioSalvatoreArcidiacono , codecov is complaining that this ValueError is not checked in the tests.
Do you mind having a look? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cmarmo! thanks for your comment. I have removed the line not covered by tests, for more details check my comment below.

claudio.arcidiacono added 3 commits October 9, 2022 16:49
@@ -214,10 +223,14 @@ def _get_feature_importances(estimator, getter, transform_func=None, norm_order=
)
else:
getter = attrgetter(getter)
elif not callable(getter):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check has been removed because in

self._validate_params()
we already check whether instance_getter is either a string or a callable. Furthermore here I have added a test case to verify that a ValueError is raised when RFE.fit is called with an importance getter that is not a callable or a string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optionally passing X and y to importance_getter callable in Recursive Feature Elimination
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.