FIX Feature Selectors fail to route metadata when inside a Pipeline #30529

kschluns · Dec 22, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

See #30527 for more context on the root issue.

The problem appears to be stemming from the fact that the Pipeline's get_metadata_routing function checks if each step's transformer (trans) has a fit_transform function, and if it does, it adds the following router mappings (source):

            if hasattr(trans, "fit_transform"):
                (
                    method_mapping.add(caller="fit", callee="fit_transform")
                    .add(caller="fit_transform", callee="fit_transform")
                    .add(caller="fit_predict", callee="fit_transform")
                )

All four impacted Feature Selector classes have a fit_transform function, which therefore requires a .add(caller="fit_transform", callee=<???>) mapping to exist in the downstream feature selector's get_metadata_routing function. The lack of this mapping is preventing metadata from being routed when feature_selector.fit_transform() is called, whereas metadata is successfully routed when feature_selector.fit() is called.

Here is the current code for the SelectFromModel class which demonstrates the missing mapping (source):

    def get_metadata_routing(self):
        router = MetadataRouter(owner=self.__class__.__name__).add(
            estimator=self.estimator,
            method_mapping=MethodMapping()
            .add(caller="partial_fit", callee="partial_fit")
            .add(caller="fit", callee="fit"),
        )
        return router

I have tested that the issue is fixed by adding .add(caller="fit_transform", callee="fit_transform") to the method_mapping definition. It turns out that it doesn't matter what value you put for callee for reasons explained in this comment. The only caveat to this is that the callee value has to be a method that supports metadata routing.

Task List (from the Pull Request Checklist)

Give your pull request a helpful title
Make sure your code passes the tests
Make sure your code is properly commented and documented, and make sure the documentation renders properly
Add non-regression tests specific to the issue and the bug fix
Add a changelog entry describing your PR changes (if necessary)

github-actions · Dec 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 16612e6. Link to the linter CI: here}

glemaitre

So we should change also sklearn/tests/test_metaestimators_metadata_routing/py file. I assume that we forgot a specific method there for those selectors.

@StefanieSenger @adrinjalali Is there in the tests something where we nest those meta-estimator inside another meta-estimator, e.g. put them inside a Pipeline and make sure that the metadata are routed.

glemaitre · Dec 23, 2024

sklearn/feature_selection/_from_model.py

@@ -494,6 +494,7 @@ def get_metadata_routing(self):
        router = MetadataRouter(owner=self.__class__.__name__).add(
            estimator=self.estimator,
            method_mapping=MethodMapping()
+            .add(caller="fit_transform", callee="fit_transform")


Suggested change

.add(caller="fit_transform", callee="fit_transform")

.add(caller="fit_transform", callee="fit")

On the top of the head, I would expect callee to be fit because this is the step that requires metadata for this specific selector.

However, I'm always confused with the routing/caller/callee :).

@StefanieSenger @adrinjalali would you mind to have a look.

we shouldn't have to add these composite methods here, they should be auto generated from the simple methods. This might be an issue in the _metadata_requests.py that I need to debug to see where the issue is.

I was also curious about @glemaitre's question. It actually doesn't matter whether you use callee="fit_transform" or callee="fit" for the fix. Both values will result in successful metadata routing.

I think this is because the Feature Selector's fit_transform function passes the **fit_params to self.fit() without explicitly using the metadata routing functionality (because technically it doesn't have to). Code snippet below (source) shows that its only using the metadata routing for validation and has no impact on the **fit_params.

def fit_transform(self, X, y=None, **fit_params): if _routing_enabled(): transform_params = self.get_metadata_routing().consumes( method="transform", params=fit_params.keys() ) if transform_params: warnings.warn( ( f"This object ({self.__class__.__name__}) has a `transform`" " method which consumes metadata, but `fit_transform` does not" " forward metadata to `transform`. Please implement a custom" " `fit_transform` method to forward metadata to `transform` as" " well. Alternatively, you can explicitly do" " `set_transform_request`and set all values to `False` to" " disable metadata routed to `transform`, if that's an option." ), UserWarning, ) if y is None: # fit method of arity 1 (unsupervised transformation) return self.fit(X, **fit_params).transform(X) else: # fit method of arity 2 (supervised transformation) return self.fit(X, y, **fit_params).transform(X)

To illustrate the function calls, in order:

pipeline.fit():

First calls process_routing(_method='fit') to request the metadata to pass forward

Then calls feature_selector.fit_transform() and includes the routed metadata

feature_selector.fit_transform():

does NOT call process_routing()

calls self.fit(X, y, **fit_params)

feature_selector.fit():

First calls process_routing(_method='fit') to request the metadata to pass forward

Then calls estimator.fit() and includes the routed metadata

Because step 2 doesn't participate in the metadata routing, it doesn't matter what the callee value is for the feature_selector. It just has to be one of the generically valid values defined by the class (e.g., fit, transform, fit_transform, etc).

StefanieSenger · Dec 23, 2024

@StefanieSenger @adrinjalali Is there in the tests something where we nest those meta-estimator inside another meta-estimator, e.g. put them inside a Pipeline and make sure that the metadata are routed.

No, I don't think that we have those tests that go over more than two levels. It's rather a step-by-step testing.

adrinjalali · Dec 23, 2024

So our MetadataRequest object (used by non-routing estimators) does an automatic creation of composite methods (such as fit_transform). However, MetadataRouter (used by routing estimators) does not.

Pipeline, for instance, has this code:

            if hasattr(trans, "fit_transform"):
                (
                    method_mapping.add(caller="fit", callee="fit_transform")
                    .add(caller="fit_transform", callee="fit_transform")
                    .add(caller="fit_predict", callee="fit_transform")
                )
            else:
                (
                    method_mapping.add(caller="fit", callee="fit")
                    .add(caller="fit", callee="transform")
                    .add(caller="fit_transform", callee="fit")
                    .add(caller="fit_transform", callee="transform")
                    .add(caller="fit_predict", callee="fit")
                    .add(caller="fit_predict", callee="transform")
                )

And looking back at it, I'm wondering if MetadataRouter should also create the composite requests from existing data the same way that MetadataRequest does.

kschluns · Dec 23, 2024

And looking back at it, I'm wondering if MetadataRouter should also create the composite requests from existing data the same way that MetadataRequest does.

Do I understand right that both Pipeline and SelectFromModel are routing estimators? Therefore it doesn't take away from the fact that SelectFromModel still needs to supply a .add(caller="fit_transform", callee=<???>) in it's get_metadata_routing() function right? (i.e. regardless if it's auto-generated or not).

I ask because I'm wondering if we can allow this PR to go through with the short-term manual fix and then leave it as a future enhancement to automate the composite routing in the MetadataRouter class?

adrinjalali · Dec 23, 2024

I ask because I'm wondering if we can allow this PR to go through with the short-term manual fix and then leave it as a future enhancement to automate the composite routing in the MetadataRouter class?

That's true. I wouldn't mind that. This would need tests though, and ideally in the metadata routing common tests.

kschluns · Dec 23, 2024

That's true. I wouldn't mind that. This would need tests though, and ideally in the metadata routing common tests.

Sounds good! I can help with that. I'm new to contributing into sklearn though, so I may be in a bit over my head. But I'd like to give it a try first if that's okay. Is this the only file that we would need to add the tests to? --> sklearn/tests/test_metaestimators_metadata_routing.py

adrinjalali · Dec 23, 2024

Yep, that's the file. And thanks for contributing 😊

kschluns · Jan 1, 2025

Hey @adrinjalali I just finished the remaining tasks. Could you please review and let me know if this is satisfactory?

Make sure your code passes the tests
Ran the following tests. Let me know if I should run any other tests.
pytest sklearn/feature_selection
pytest sklearn/tests/test_metadata_routing.py
pytest sklearn/tests/test_metaestimators_metadata_routing.py
pytest sklearn/tests/test_metaestimators.py
pytest doc/modules/feature_selection.rst
Add non-regression tests specific to the issue and the bug fix
Let me know if the new test_feature_selectors_in_pipeline test is sufficent!
Add a changelog entry describing your PR changes (if necessary)

Note: I changed the PR description to be Towards #30527 instead of Fixes #30527 because of our discussion about this being a short-term fix. My plan was to make a comment on the Issue after this PR is merged that explains that the short vs long-term fix. Let me know if this sounds good.

kschluns · Jan 6, 2025

@glemaitre or @adrinjalali just pinging on the above, as the PR is ready for review. Thanks!

doc/whats_new/upcoming_changes/sklearn.feature_selection/30529.fix.rst

sklearn/tests/test_metaestimators_metadata_routing.py

….fix.rst Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

kschluns · Jan 14, 2025

@adrinjalali thank you for your review! I agreed with all the suggestions and made improvements to the tests accordingly. Can you please review again?

Note that in the process of making these changes, I realized that the prior implementation of the check_recorded_metadata function is resulting in many tests passing trivially, particularly in test_setting_request_on_sub_estimator_removes_error. Please see the note left in the check_recorded_metadata docstring for an explanation.

I added functionality to check_recorded_metadata that allows the parent parameter to be None, which successfully works around the issue, but is only valid if the test doesn't care about the value of parent (which is true for the majority of tests, but not all). I tested setting parent=None for all tests and there's only three failures to deal with, but I don't want to increase the scope of this PR beyond what's intended. Therefore, I propose only setting parent=none for test_metaestimators_in_pipeline and making a future PR to fix the issue of trivial passed tests in the other test sets.

What do you think?

kschluns · Jan 20, 2025

@adrinjalali @glemaitre pinging on the above request for a review? Thanks!

kschluns · Jan 27, 2025

Hey @adrinjalali, are you able to review this PR sometime this week?

adrinjalali · Jan 27, 2025

@kschluns I'll have a look this week.

kschluns · Feb 3, 2025

Hey @adrinjalali any update here? Is it normal for a PR to take this long to get approved or is there something about the proposed changes here that is making it difficult to review and approve? If you're just too busy, can we see if someone else is able to review this PR?

kschluns · Feb 11, 2025

Hello? It's me.

kschluns · Feb 13, 2025

Hey @adrinjalali! Can you please take a look at this PR?

adrinjalali

Thanks a lot @kschluns , and sorry for the late review. This is looking much better now.

adrinjalali · Feb 18, 2025

sklearn/tests/metadata_routing_common.py

+        Sub-estimator metadata is only checked if the `caller` method matches
+        the value defined by `parent`. If `parent` is None, the target
+        sub-estimator metadata is checked regardless of the `caller` method.
+
+        NOTE: many metaestimators call the subestimator in roundabout ways
+        and this makes it very difficult to know what method name to use for
+        `parent`. If misspecified, it results in tests passing trivially. For
+        example, when fitting the RFE metaestimator, RFE.fit() calls RFE._fit()
+        , which then calls subestimator.fit(). In this case, the user
+        configuring the test should set method="fit" and parent="_fit",
+        otherwise the test will pass trivially.


I'm sure you've done a great job going through this bit of code figuring it out, but from the change in the docstring, it's not clear to me exactly what the issue is. Could you please add a test specifically for the change here, to make it clear for future developers?

Should I actually add a test in the sklearn/tests/test_metaestimators_metadata_routing.py, mark it with @pytest.mark.skip since it is expected to fail, then just add a note in the docstring referencing the test?

Since I don't really understand this comment, seeing that test helps.

My question was more around the logistics of adding a test that I know is going to fail, without it messing up any CI pipelines. For example, at work we enforce 100% of our tests to pass in order for PRs to get merged, so I just haven't encountered this situation before. Wasn't sure if that would be a problem here and if it is, is @pytest.mark.skip the right way to handle it?

We won't be merging w/o CI being green. Once you write a test, I'd know what needs to be done to fix it. Here I just don't understand what the comment is saying.

got it, will write up a test shortly

sklearn/tests/test_metaestimators_metadata_routing.py

fixes metadata routing for feature selector fit_transform

2d93398

github-actions bot added the module:feature_selection label Dec 22, 2024

glemaitre reviewed Dec 23, 2024

View reviewed changes

kschluns added 3 commits December 31, 2024 21:03

added non-regression tests for feature_selector fix

4a2e0f5

Merge branch 'main' into fix_feature_selector_metadata_routing

015acc9

linter fix

9da6d43

kschluns requested a review from adrinjalali January 1, 2025 03:18

kschluns added 2 commits December 31, 2024 21:39

simplified feature_selector metadata routing test

eef1ca2

Add change log for PR 30529

fc070f0

kschluns marked this pull request as ready for review January 1, 2025 17:41

kschluns added 5 commits January 1, 2025 14:49

fix test coverage issue

4e9650a

flake8 fix

76d017e

Merge branch 'main' into fix_feature_selector_metadata_routing

bf56193

Merge branch 'main' into fix_feature_selector_metadata_routing

0460e8d

Merge branch 'main' into fix_feature_selector_metadata_routing

bd96b65

kschluns requested a review from glemaitre January 3, 2025 16:48

kschluns added 2 commits January 3, 2025 18:42

Merge branch 'main' into fix_feature_selector_metadata_routing

55fadc4

Merge branch 'main' into fix_feature_selector_metadata_routing

00d505d

adrinjalali reviewed Jan 7, 2025

View reviewed changes

kschluns and others added 2 commits January 7, 2025 10:34

Update doc/whats_new/upcoming_changes/sklearn.feature_selection/30529…

db637a3

….fix.rst Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Added metadata routing via fit_transform for IterativeImputer

2e89ed4

kschluns added 5 commits January 14, 2025 09:50

Improved test_metaestimators_in_pipeline

eb6edf3

Merge branch 'main' into fix_feature_selector_metadata_routing

a67fb95

Update change log 30529.fix.rst

bd16800

removing unused variables

3cbf0c4

ruff linter fix

e2d8b3f

kschluns added 2 commits January 14, 2025 13:01

fixed test coverage issue

23a66c7

Merge branch 'main' into fix_feature_selector_metadata_routing

e2538a9

kschluns requested a review from adrinjalali January 15, 2025 15:58

kschluns added 2 commits January 16, 2025 14:02

Merge branch 'main' into fix_feature_selector_metadata_routing

f64995a

Merge branch 'main' into fix_feature_selector_metadata_routing

64b4050

Merge branch 'main' into fix_feature_selector_metadata_routing

0ac3aa0

kschluns added 3 commits January 27, 2025 09:17

Merge branch 'main' into fix_feature_selector_metadata_routing

6a6be39

Merge branch 'main' into fix_feature_selector_metadata_routing

de084e5

Merge branch 'main' into fix_feature_selector_metadata_routing

b109e13

kschluns added 2 commits February 3, 2025 12:06

Merge branch 'main' into fix_feature_selector_metadata_routing

abaf65e

Merge branch 'main' into fix_feature_selector_metadata_routing

b53f05c

adrinjalali reviewed Feb 18, 2025

View reviewed changes

adrinjalali mentioned this pull request Feb 20, 2025

FIX routing to composite methods #30869

Draft

Merge branch 'main' into fix_feature_selector_metadata_routing

16612e6

	.add(caller="fit_transform", callee="fit_transform")
	.add(caller="fit_transform", callee="fit")

Search code, repositories, users, issues, pull requests...

Uh oh!

FIX Feature Selectors fail to route metadata when inside a Pipeline #30529

Are you sure you want to change the base?

FIX Feature Selectors fail to route metadata when inside a Pipeline #30529

Uh oh!

Conversation

kschluns commented Dec 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Task List (from the Pull Request Checklist)

Uh oh!

github-actions bot commented Dec 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kschluns Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StefanieSenger commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Dec 23, 2024

Uh oh!

kschluns commented Dec 23, 2024

Uh oh!

adrinjalali commented Dec 23, 2024

Uh oh!

kschluns commented Dec 23, 2024

Uh oh!

adrinjalali commented Dec 23, 2024

Uh oh!

kschluns commented Jan 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kschluns commented Jan 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kschluns commented Jan 14, 2025

Uh oh!

kschluns commented Jan 20, 2025

Uh oh!

kschluns commented Jan 27, 2025

Uh oh!

adrinjalali commented Jan 27, 2025

Uh oh!

kschluns commented Feb 3, 2025

Uh oh!

kschluns commented Feb 11, 2025

Uh oh!

kschluns commented Feb 13, 2025

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kschluns commented Dec 22, 2024 •

edited

Loading

github-actions bot commented Dec 22, 2024 •

edited

Loading

kschluns Dec 23, 2024 •

edited

Loading

StefanieSenger commented Dec 23, 2024 •

edited

Loading

kschluns commented Jan 1, 2025 •

edited

Loading