Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX Feature Selectors fail to route metadata when inside a Pipeline #30529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
Loading
from

Conversation

kschluns
Copy link

@kschluns kschluns commented Dec 22, 2024

Reference Issues/PRs

Towards #30527

What does this implement/fix? Explain your changes.

See #30527 for more context on the root issue.

The problem appears to be stemming from the fact that the Pipeline's get_metadata_routing function checks if each step's transformer (trans) has a fit_transform function, and if it does, it adds the following router mappings (source):

            if hasattr(trans, "fit_transform"):
                (
                    method_mapping.add(caller="fit", callee="fit_transform")
                    .add(caller="fit_transform", callee="fit_transform")
                    .add(caller="fit_predict", callee="fit_transform")
                )

All four impacted Feature Selector classes have a fit_transform function, which therefore requires a .add(caller="fit_transform", callee=<???>) mapping to exist in the downstream feature selector's get_metadata_routing function. The lack of this mapping is preventing metadata from being routed when feature_selector.fit_transform() is called, whereas metadata is successfully routed when feature_selector.fit() is called.

Here is the current code for the SelectFromModel class which demonstrates the missing mapping (source):

    def get_metadata_routing(self):
        router = MetadataRouter(owner=self.__class__.__name__).add(
            estimator=self.estimator,
            method_mapping=MethodMapping()
            .add(caller="partial_fit", callee="partial_fit")
            .add(caller="fit", callee="fit"),
        )
        return router

I have tested that the issue is fixed by adding .add(caller="fit_transform", callee="fit_transform") to the method_mapping definition. It turns out that it doesn't matter what value you put for callee for reasons explained in this comment. The only caveat to this is that the callee value has to be a method that supports metadata routing.

Task List (from the Pull Request Checklist)

  • Give your pull request a helpful title
  • Make sure your code passes the tests
  • Make sure your code is properly commented and documented, and make sure the documentation renders properly
  • Add non-regression tests specific to the issue and the bug fix
  • Add a changelog entry describing your PR changes (if necessary)

Copy link

github-actions bot commented Dec 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 16612e6. Link to the linter CI: here

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we should change also sklearn/tests/test_metaestimators_metadata_routing/py file. I assume that we forgot a specific method there for those selectors.

@StefanieSenger @adrinjalali Is there in the tests something where we nest those meta-estimator inside another meta-estimator, e.g. put them inside a Pipeline and make sure that the metadata are routed.

@@ -494,6 +494,7 @@ def get_metadata_routing(self):
router = MetadataRouter(owner=self.__class__.__name__).add(
estimator=self.estimator,
method_mapping=MethodMapping()
.add(caller="fit_transform", callee="fit_transform")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.add(caller="fit_transform", callee="fit_transform")
.add(caller="fit_transform", callee="fit")

On the top of the head, I would expect callee to be fit because this is the step that requires metadata for this specific selector.

However, I'm always confused with the routing/caller/callee :).

@StefanieSenger @adrinjalali would you mind to have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't have to add these composite methods here, they should be auto generated from the simple methods. This might be an issue in the _metadata_requests.py that I need to debug to see where the issue is.

Copy link
Author

@kschluns kschluns Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also curious about @glemaitre's question. It actually doesn't matter whether you use callee="fit_transform" or callee="fit" for the fix. Both values will result in successful metadata routing.

I think this is because the Feature Selector's fit_transform function passes the **fit_params to self.fit() without explicitly using the metadata routing functionality (because technically it doesn't have to). Code snippet below (source) shows that its only using the metadata routing for validation and has no impact on the **fit_params.

    def fit_transform(self, X, y=None, **fit_params):
        if _routing_enabled():
            transform_params = self.get_metadata_routing().consumes(
                method="transform", params=fit_params.keys()
            )
            if transform_params:
                warnings.warn(
                    (
                        f"This object ({self.__class__.__name__}) has a `transform`"
                        " method which consumes metadata, but `fit_transform` does not"
                        " forward metadata to `transform`. Please implement a custom"
                        " `fit_transform` method to forward metadata to `transform` as"
                        " well. Alternatively, you can explicitly do"
                        " `set_transform_request`and set all values to `False` to"
                        " disable metadata routed to `transform`, if that's an option."
                    ),
                    UserWarning,
                )
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
            return self.fit(X, **fit_params).transform(X)
        else:
            # fit method of arity 2 (supervised transformation)
            return self.fit(X, y, **fit_params).transform(X)

To illustrate the function calls, in order:

  1. pipeline.fit():
    • First calls process_routing(_method='fit') to request the metadata to pass forward
    • Then calls feature_selector.fit_transform() and includes the routed metadata
  2. feature_selector.fit_transform():
    • does NOT call process_routing()
    • calls self.fit(X, y, **fit_params)
  3. feature_selector.fit():
    • First calls process_routing(_method='fit') to request the metadata to pass forward
    • Then calls estimator.fit() and includes the routed metadata

Because step 2 doesn't participate in the metadata routing, it doesn't matter what the callee value is for the feature_selector. It just has to be one of the generically valid values defined by the class (e.g., fit, transform, fit_transform, etc).

@StefanieSenger
Copy link
Contributor

StefanieSenger commented Dec 23, 2024

@StefanieSenger @adrinjalali Is there in the tests something where we nest those meta-estimator inside another meta-estimator, e.g. put them inside a Pipeline and make sure that the metadata are routed.

No, I don't think that we have those tests that go over more than two levels. It's rather a step-by-step testing.

@adrinjalali
Copy link
Member

So our MetadataRequest object (used by non-routing estimators) does an automatic creation of composite methods (such as fit_transform). However, MetadataRouter (used by routing estimators) does not.

Pipeline, for instance, has this code:

            if hasattr(trans, "fit_transform"):
                (
                    method_mapping.add(caller="fit", callee="fit_transform")
                    .add(caller="fit_transform", callee="fit_transform")
                    .add(caller="fit_predict", callee="fit_transform")
                )
            else:
                (
                    method_mapping.add(caller="fit", callee="fit")
                    .add(caller="fit", callee="transform")
                    .add(caller="fit_transform", callee="fit")
                    .add(caller="fit_transform", callee="transform")
                    .add(caller="fit_predict", callee="fit")
                    .add(caller="fit_predict", callee="transform")
                )

And looking back at it, I'm wondering if MetadataRouter should also create the composite requests from existing data the same way that MetadataRequest does.

@kschluns
Copy link
Author

And looking back at it, I'm wondering if MetadataRouter should also create the composite requests from existing data the same way that MetadataRequest does.

Do I understand right that both Pipeline and SelectFromModel are routing estimators? Therefore it doesn't take away from the fact that SelectFromModel still needs to supply a .add(caller="fit_transform", callee=<???>) in it's get_metadata_routing() function right? (i.e. regardless if it's auto-generated or not).

I ask because I'm wondering if we can allow this PR to go through with the short-term manual fix and then leave it as a future enhancement to automate the composite routing in the MetadataRouter class?

@adrinjalali
Copy link
Member

I ask because I'm wondering if we can allow this PR to go through with the short-term manual fix and then leave it as a future enhancement to automate the composite routing in the MetadataRouter class?

That's true. I wouldn't mind that. This would need tests though, and ideally in the metadata routing common tests.

@kschluns
Copy link
Author

That's true. I wouldn't mind that. This would need tests though, and ideally in the metadata routing common tests.

Sounds good! I can help with that. I'm new to contributing into sklearn though, so I may be in a bit over my head. But I'd like to give it a try first if that's okay. Is this the only file that we would need to add the tests to? --> sklearn/tests/test_metaestimators_metadata_routing.py

@adrinjalali
Copy link
Member

Yep, that's the file. And thanks for contributing 😊

@kschluns kschluns requested a review from adrinjalali January 1, 2025 03:18
@kschluns
Copy link
Author

kschluns commented Jan 1, 2025

Hey @adrinjalali I just finished the remaining tasks. Could you please review and let me know if this is satisfactory?

  • Make sure your code passes the tests
    Ran the following tests. Let me know if I should run any other tests.
    pytest sklearn/feature_selection
    pytest sklearn/tests/test_metadata_routing.py
    pytest sklearn/tests/test_metaestimators_metadata_routing.py
    pytest sklearn/tests/test_metaestimators.py
    pytest doc/modules/feature_selection.rst
  • Add non-regression tests specific to the issue and the bug fix
    Let me know if the new test_feature_selectors_in_pipeline test is sufficent!
  • Add a changelog entry describing your PR changes (if necessary)

Note: I changed the PR description to be Towards #30527 instead of Fixes #30527 because of our discussion about this being a short-term fix. My plan was to make a comment on the Issue after this PR is merged that explains that the short vs long-term fix. Let me know if this sounds good.

@kschluns kschluns marked this pull request as ready for review January 1, 2025 17:41
@kschluns kschluns requested a review from glemaitre January 3, 2025 16:48
@kschluns
Copy link
Author

kschluns commented Jan 6, 2025

@glemaitre or @adrinjalali just pinging on the above, as the PR is ready for review. Thanks!

sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
sklearn/tests/test_metaestimators_metadata_routing.py Outdated Show resolved Hide resolved
@kschluns
Copy link
Author

@adrinjalali thank you for your review! I agreed with all the suggestions and made improvements to the tests accordingly. Can you please review again?

Note that in the process of making these changes, I realized that the prior implementation of the check_recorded_metadata function is resulting in many tests passing trivially, particularly in test_setting_request_on_sub_estimator_removes_error. Please see the note left in the check_recorded_metadata docstring for an explanation.

I added functionality to check_recorded_metadata that allows the parent parameter to be None, which successfully works around the issue, but is only valid if the test doesn't care about the value of parent (which is true for the majority of tests, but not all). I tested setting parent=None for all tests and there's only three failures to deal with, but I don't want to increase the scope of this PR beyond what's intended. Therefore, I propose only setting parent=none for test_metaestimators_in_pipeline and making a future PR to fix the issue of trivial passed tests in the other test sets.

What do you think?

@kschluns kschluns requested a review from adrinjalali January 15, 2025 15:58
@kschluns
Copy link
Author

@adrinjalali @glemaitre pinging on the above request for a review? Thanks!

@kschluns
Copy link
Author

Hey @adrinjalali, are you able to review this PR sometime this week?

@adrinjalali
Copy link
Member

@kschluns I'll have a look this week.

@kschluns
Copy link
Author

kschluns commented Feb 3, 2025

Hey @adrinjalali any update here? Is it normal for a PR to take this long to get approved or is there something about the proposed changes here that is making it difficult to review and approve? If you're just too busy, can we see if someone else is able to review this PR?

@kschluns
Copy link
Author

Hello? It's me.

@kschluns
Copy link
Author

Hey @adrinjalali! Can you please take a look at this PR?

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @kschluns , and sorry for the late review. This is looking much better now.

Comment on lines +66 to +76
Sub-estimator metadata is only checked if the `caller` method matches
the value defined by `parent`. If `parent` is None, the target
sub-estimator metadata is checked regardless of the `caller` method.

NOTE: many metaestimators call the subestimator in roundabout ways
and this makes it very difficult to know what method name to use for
`parent`. If misspecified, it results in tests passing trivially. For
example, when fitting the RFE metaestimator, RFE.fit() calls RFE._fit()
, which then calls subestimator.fit(). In this case, the user
configuring the test should set method="fit" and parent="_fit",
otherwise the test will pass trivially.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you've done a great job going through this bit of code figuring it out, but from the change in the docstring, it's not clear to me exactly what the issue is. Could you please add a test specifically for the change here, to make it clear for future developers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I actually add a test in the sklearn/tests/test_metaestimators_metadata_routing.py, mark it with @pytest.mark.skip since it is expected to fail, then just add a note in the docstring referencing the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't really understand this comment, seeing that test helps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more around the logistics of adding a test that I know is going to fail, without it messing up any CI pipelines. For example, at work we enforce 100% of our tests to pass in order for PRs to get merged, so I just haven't encountered this situation before. Wasn't sure if that would be a problem here and if it is, is @pytest.mark.skip the right way to handle it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be merging w/o CI being green. Once you write a test, I'd know what needs to be done to fix it. Here I just don't understand what the comment is saying.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will write up a test shortly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.