ENH add x and y to importance getter rfe #21935

ClaudioSalvatoreArcidiacono · Dec 9, 2021

Reference Issues/PRs

Fixes #21934

What does this implement/fix? Explain your changes.

Adds possibility of passing also train instances and labels to importance_getter in Recursive Feature Elimination

mattiasAngqvist · Aug 3, 2022

@ClaudioSalvatoreArcidiacono I think this is a great idea and will make RFE much more useful. How is the PR going?

FWIW this is a way to overcome this until this PR is merged (as you can see it is not very pretty)

feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}

def custom_importance(model):
    features = model.feature_name_
    new_features = [feature_map[f] for f in features]
    r = permutation_importance(model,
                              df_train[new_features],
                              df_train['y'],
                              n_repeats=1,
                              random_state=0)
    return r.importance_mean

rfe = RFE(estimator = LGBMRegressor(),
 			n_features_to_select=20,
 			step=50,
 			importance_getter=custom_importance)
selector = rfe.fit(X=df_train, y=df_train['y'])

thomasjpfan

Thank you for the PR!

thomasjpfan · Jun 5, 2022

sklearn/feature_selection/_base.py

+        importances = getter(estimator)
+
+    elif callable(getter):
+        if len(signature(getter).parameters) == 3 and X is not None and y is not None:


If the callable accepts 3 parameters, what do you think of passing in X and y directly without checking X and y?

I think that it makes sense. Thanks for noticing it!

thomasjpfan · Jun 5, 2022

doc/whats_new/v1.1.rst

+:mod:`sklearn.feature_selection`
+................................
+
+- |Enhancement| the initialization parameter `importance_getter` of


Now that 1.1 is released can you move this changelog to 1.2?

thomasjpfan · Jun 5, 2022

sklearn/feature_selection/_rfe.py

@@ -287,6 +288,8 @@ def _fit(self, X, y, step_score=None, **fit_params):
                estimator,
                self.importance_getter,
                transform_func="square",
+                X=X[:, features],


If X is a sparse matrix, then X[: features] will make a copy. Can you store X[: features] in a variable, before the estimator.fit call a few lines above and use it in both places?

thomasjpfan · Aug 3, 2022

sklearn/feature_selection/_rfe.py

+        The callable is passed with the fitted estimator and optionally the
+        training input samples `X` and the target labels `y`. The callable


I think supporting callables with two signatures leads to a more confusing API.

What do you think about deprecating the old signature that only accepts the estimator?

Hey @thomasjpfan, thanks for reviewing my PR!

I agree that accepting two signatures makes the API more complex. On the other hand, deprecating the old signature would lead to a breaking change. In my opinion, adding a little more complexity is better than introducing a breaking change, what is your view on it?

Deprecating means that the 1 parameter callable still works for the next two release. This gives users two releases to update their code to adapt to the 3 parameter callable.

Concretely, lets say we release the 3 parameter callable in v1.2, then a warning is shown stating that in v1.4, we will pass in X and y into the callable.

I prefer to deprecate the 1 parameter callable and move to the new 3 parameter callable.

Alright, now I understand what you meant by that, in that case I think that it makes sense. I will add the deprecation warning and refactor the tests to take care of that. Stay tuned!

ClaudioSalvatoreArcidiacono · Aug 4, 2022

@ClaudioSalvatoreArcidiacono I think this is a great idea and will make RFE much more useful. How is the PR going?

FWIW this is a way to overcome this until this PR is merged (as you can see it is not very pretty)

feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}

def custom_importance(model):
    features = model.feature_name_
    new_features = [feature_map[f] for f in features]
    r = permutation_importance(model,
                              df_train[new_features],
                              df_train['y'],
                              n_repeats=1,
                              random_state=0)
    return r.importance_mean

rfe = RFE(estimator = LGBMRegressor(),
 			n_features_to_select=20,
 			step=50,
 			importance_getter=custom_importance)
selector = rfe.fit(X=df_train, y=df_train['y'])

Hey @mattiasAngqvist, I was waiting for someone to review it, now that @thomasjpfan started to review the PR I will continue working on it.

Regarding your proposed temporary workaround, you should keep a reference to df_train in your custom_importance function. You could use a decorator for that:

def make_custom_importance(df_train):
  feature_map = {f'Column_{i}':f for i, f in enumerate(df_train)}

  def custom_importance(model):
      features = model.feature_name_
      new_features = [feature_map[f] for f in features]
      r = permutation_importance(model,
                                df_train[new_features],
                                df_train['y'],
                                n_repeats=1,
                                random_state=0)
      return r.importance_mean

  return custom_importance


rfe = RFE(estimator = LGBMRegressor(),
 			n_features_to_select=20,
 			step=50,
 			importance_getter=make_custom_importance(df_train))
selector = rfe.fit(X=df_train, y=df_train['y'])

But I agree that it looks a bit clunky since you need to pass df_train twice. Moreover, this does not play well in situations when you want to use RFE within a cross validation pipeline.

Store X[:, features] in a variable to prevent creating a copy twice

…importance_getter_rfe

thomasjpfan

Thanks for the update. Now that I see the implementation for deprecating, I am leaning toward going back to your original idea of supporting 1 and 3 parameters.

thomasjpfan · Aug 8, 2022

sklearn/feature_selection/_base.py

+        if len(signature(getter).parameters) == 3:
+            importances = getter(estimator, X, y)
+        else:
+            importances = getter(estimator)


Unfortunately, _get_feature_importances is shared with SelectFromModel, which only uses a one parameter. SelectFromModel can accept a prefitted estimator and use that for feature selection. In those cases, SelectFromModel never saw that data which means it can not use the three parameter callable.

It feels like a inconsistent API for SelectFromModel to only accept a callable with 1 parameter and RFE only accept a callable with 3 parameter. Also, there is some added complexity to _get_feature_importances if we want to restrict the number of parameters.

With that in mind, I am leaning toward your original idea of supporting both 1 and 3 parameters in RFE.

Sounds good to me.

I have reverted the commits where I deprecated accepting a callable with 1 parameter.

What do you think about extending the documentation for the argument importance_getter as it is in this PR to SelectFromModel as well? Another option might be to deprecate accepting a callable with 1 parameter there as well.

As noted in https://github.com/scikit-learn/scikit-learn/pull/21935/files/5c81eef0390b0b96b6e0ca74256430d0a9e5a88a..4e74568e8912082647a302f548d541c821718c99#r940622865, the way SelectFromModel can accept prefitted estimators means that it will always need to support a callable with one parameter. Specifically, if SelectFromModel is configured with a prefited model, then SelectFromModel.transform works without calling fit and seeing the training data.

For this PR, I prefer not to expand scope to SelectFromModel. Usually expanding scope makes a PR harder to merge.

Makes sense. I that case I think the PR should be ready to be merged.

I also thought about adding an example about using a custom importance getter in the recursive feature elimination section, but I will open another PR about it once this one has been merged.

@thomasjpfan
Would it make sense to potentially wrap the callable importance_getter passed to SelectFromModel internally so that it can match the 3-argument signature forced by the internal API? The user experience wouldn't change and it would allow for a consistent internal API.

If we wanted to (in a future PR) we could even explore the path of allowing 3-argument callables for SelectFromModel when working with yet-unfit estimators.

This reverts commit 5c81eef.

This reverts commit cdbb651.

…importance_getter_rfe

Micky774

Hey there @ClaudioSalvatoreArcidiacono, thanks for the PR! Just had a couple small suggestions, mostly focused on wording.

Also could you remove the cosmetic changes made to the import statements? We'll probably improve them (e.g. via isort) separately, but keeping them as they are now helps with preserving git blame history.

Micky774 · Aug 31, 2022

sklearn/feature_selection/_base.py

+    X : {array-like, sparse matrix} of shape (n_samples, n_features)
+        The training input samples.
+
+    y : array-like of shape (n_samples,)
+        The target values.


Let's add documentation of their default values:

Suggested change

X : {array-like, sparse matrix} of shape (n_samples, n_features)

The training input samples.

y : array-like of shape (n_samples,)

The target values.

X : {array-like, sparse matrix} of shape (n_samples, n_features), default=None

The training input samples.

y : array-like of shape (n_samples,), default=None

The target values.

Micky774 · Aug 31, 2022

sklearn/feature_selection/_rfe.py

+           Added support for custom importance getter with estimator, training input
+           samples `X` and the target labels `y`.


I feel like we could improve wording here. What do you think? (May need to format w/ black)

Suggested change

Added support for custom importance getter with estimator, training input

samples `X` and the target labels `y`.

Added support for a callable `importance_getter` which accepts estimator, training input

samples `X` and the target labels `y` as arguments.

Micky774 · Aug 31, 2022

sklearn/feature_selection/_rfe.py

+           Added support for custom importance getter with estimator, training input
+           samples `X` and the target labels `y`.


Same comment as above:

Suggested change

Added support for custom importance getter with estimator, training input

samples `X` and the target labels `y`.

Added support for a callable `importance_getter` which accepts estimator, training input

samples `X` and the target labels `y` as arguments.

Micky774 · Aug 31, 2022

sklearn/feature_selection/_base.py

+        if len(signature(getter).parameters) == 3:
+            importances = getter(estimator, X, y)
+        else:
+            importances = getter(estimator)


@thomasjpfan
Would it make sense to potentially wrap the callable importance_getter passed to SelectFromModel internally so that it can match the 3-argument signature forced by the internal API? The user experience wouldn't change and it would allow for a consistent internal API.

If we wanted to (in a future PR) we could even explore the path of allowing 3-argument callables for SelectFromModel when working with yet-unfit estimators.

ClaudioSalvatoreArcidiacono · Sep 9, 2022

Hey @Micky774, thanks a lot for reviewing my PR and for leaving your suggestions!

I have updated the PR accordingly.

cmarmo · Sep 15, 2022

sklearn/feature_selection/_base.py

+    else:
+        raise ValueError("`importance_getter` has to be a string or `callable`")


Hi @ClaudioSalvatoreArcidiacono , codecov is complaining that this ValueError is not checked in the tests.
Do you mind having a look? Thanks!

Hi @cmarmo! thanks for your comment. I have removed the line not covered by tests, for more details check my comment below.

…importance_getter_rfe

This is already being checked in BaseEstimator._validate_params

ClaudioSalvatoreArcidiacono · Oct 9, 2022

sklearn/feature_selection/_base.py

@@ -214,10 +223,14 @@ def _get_feature_importances(estimator, getter, transform_func=None, norm_order=
                )
        else:
            getter = attrgetter(getter)
-    elif not callable(getter):


This check has been removed because in

scikit-learn/sklearn/feature_selection/_rfe.py

Line 250 in 4ee3fdd

self._validate_params()

we already check whether instance_getter is either a string or a callable. Furthermore here I have added a test case to verify that a ValueError is raised when RFE.fit is called with an importance getter that is not a callable or a string.

…importance_getter_rfe

claudio.arcidiacono added 2 commits December 9, 2021 19:54

Add parameters, modify tests

52a08c7

add test importance getter with train instances

79b8afd

github-actions bot added the module:feature_selection label Dec 9, 2021

claudio.arcidiacono added 4 commits December 9, 2021 20:22

Add parameters of importance_getter to docstring

f1e803d

Add Changelog entry

cbd075f

Merge branch 'main' into 21934_add_X_and_y_to_importance_getter_rfe

82c7c77

Fix linting issues in docs

5cca261

thomasjpfan reviewed Aug 3, 2022

View reviewed changes

claudio.arcidiacono added 9 commits August 4, 2022 10:01

Merge branch 'main' into 21934_add_X_and_y_to_importance_getter_rfe

a79abde

move changelog entry from v1.1 to v1.2

f05b359

Fix bug introduced during merge

0348a36

Improve performances

6ac29dc

Store X[:, features] in a variable to prevent creating a copy twice

remove check for X and y not None

eb93014

fix changelog

85b757a

deprecate importance_getter 1 parameter

cdbb651

Merge remote-tracking branch 'origin/main' into 21934_add_X_and_y_to_…

6d481e0

…importance_getter_rfe

Fix linting issue

5c81eef

thomasjpfan reviewed Aug 8, 2022

View reviewed changes

claudio.arcidiacono added 3 commits August 10, 2022 13:42

Revert "Fix linting issue"

534cbf7

This reverts commit 5c81eef.

Revert "deprecate importance_getter 1 parameter"

9d892b8

This reverts commit cdbb651.

rename test

4e74568

ClaudioSalvatoreArcidiacono changed the title ~~21934 add x and y to importance getter rfe~~ ENH add x and y to importance getter rfe Aug 13, 2022

claudio.arcidiacono added 2 commits August 13, 2022 18:02

Merge remote-tracking branch 'origin/main' into 21934_add_X_and_y_to_…

ee58eef

…importance_getter_rfe

Add versionchanged in importance_getter doctring

4de7a26

mattiasAngqvist mentioned this pull request Aug 18, 2022

Allow X in func or transformer in TransformedTargetRegressor #24199

Open

Micky774 reviewed Aug 31, 2022

View reviewed changes

claudio.arcidiacono added 2 commits September 9, 2022 18:03

Merge branch 'main' into 21934_add_X_and_y_to_importance_getter_rfe

4cb51c8

Remove cosmetic changes on imports

363b20b

claudio.arcidiacono added 2 commits September 9, 2022 18:15

Add default values in docstring for X and y args

8c1eaaf

Change wording on the dostring on versionchanged

0e13834

cmarmo reviewed Sep 15, 2022

View reviewed changes

claudio.arcidiacono added 3 commits October 9, 2022 16:49

Merge remote-tracking branch 'origin/main' into 21934_add_X_and_y_to_…

20b945e

…importance_getter_rfe

Add test case to cover value error

0b25ad4

Remove instance getter callable check

d9cde81

This is already being checked in BaseEstimator._validate_params

ClaudioSalvatoreArcidiacono commented Oct 9, 2022

View reviewed changes

cmarmo added the Waiting for Reviewer label Oct 9, 2022

ClaudioSalvatoreArcidiacono mentioned this pull request Jun 12, 2023

feature_importances_ should be a method in the ideal design #9606

Open

Merge remote-tracking branch 'origin/main' into 21934_add_X_and_y_to_…

c60d631

…importance_getter_rfe

		The callable is passed with the fitted estimator and optionally the
		training input samples `X` and the target labels `y`. The callable

		Added support for custom importance getter with estimator, training input
		samples `X` and the target labels `y`.

		else:
		raise ValueError("`importance_getter` has to be a string or `callable`")

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH add x and y to importance getter rfe #21935

Are you sure you want to change the base?

ENH add x and y to importance getter rfe #21935

Uh oh!

Conversation

ClaudioSalvatoreArcidiacono commented Dec 9, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

mattiasAngqvist commented Aug 3, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaudioSalvatoreArcidiacono commented Aug 4, 2022

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Micky774 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaudioSalvatoreArcidiacono commented Sep 9, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ClaudioSalvatoreArcidiacono Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan Aug 4, 2022 •

edited

Loading

ClaudioSalvatoreArcidiacono Oct 9, 2022 •

edited

Loading