-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX Feature Selectors fail to route metadata when inside a Pipeline #30529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX Feature Selectors fail to route metadata when inside a Pipeline #30529
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we should change also sklearn/tests/test_metaestimators_metadata_routing/py
file. I assume that we forgot a specific method there for those selectors.
@StefanieSenger @adrinjalali Is there in the tests something where we nest those meta-estimator inside another meta-estimator, e.g. put them inside a Pipeline
and make sure that the metadata are routed.
@@ -494,6 +494,7 @@ def get_metadata_routing(self): | ||
router = MetadataRouter(owner=self.__class__.__name__).add( | ||
estimator=self.estimator, | ||
method_mapping=MethodMapping() | ||
.add(caller="fit_transform", callee="fit_transform") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.add(caller="fit_transform", callee="fit_transform") | |
.add(caller="fit_transform", callee="fit") |
On the top of the head, I would expect callee
to be fit
because this is the step that requires metadata for this specific selector.
However, I'm always confused with the routing/caller/callee :).
@StefanieSenger @adrinjalali would you mind to have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't have to add these composite methods here, they should be auto generated from the simple methods. This might be an issue in the _metadata_requests.py
that I need to debug to see where the issue is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also curious about @glemaitre's question. It actually doesn't matter whether you use callee="fit_transform"
or callee="fit"
for the fix. Both values will result in successful metadata routing.
I think this is because the Feature Selector's fit_transform
function passes the **fit_params
to self.fit()
without explicitly using the metadata routing functionality (because technically it doesn't have to). Code snippet below (source) shows that its only using the metadata routing for validation and has no impact on the **fit_params
.
def fit_transform(self, X, y=None, **fit_params):
if _routing_enabled():
transform_params = self.get_metadata_routing().consumes(
method="transform", params=fit_params.keys()
)
if transform_params:
warnings.warn(
(
f"This object ({self.__class__.__name__}) has a `transform`"
" method which consumes metadata, but `fit_transform` does not"
" forward metadata to `transform`. Please implement a custom"
" `fit_transform` method to forward metadata to `transform` as"
" well. Alternatively, you can explicitly do"
" `set_transform_request`and set all values to `False` to"
" disable metadata routed to `transform`, if that's an option."
),
UserWarning,
)
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)
To illustrate the function calls, in order:
pipeline.fit()
:- First calls
process_routing(_method='fit')
to request the metadata to pass forward - Then calls
feature_selector.fit_transform()
and includes the routed metadata
- First calls
feature_selector.fit_transform()
:- does NOT call
process_routing()
- calls
self.fit(X, y, **fit_params)
- does NOT call
feature_selector.fit()
:- First calls
process_routing(_method='fit')
to request the metadata to pass forward - Then calls
estimator.fit()
and includes the routed metadata
- First calls
Because step 2 doesn't participate in the metadata routing, it doesn't matter what the callee
value is for the feature_selector
. It just has to be one of the generically valid values defined by the class (e.g., fit
, transform
, fit_transform
, etc).
No, I don't think that we have those tests that go over more than two levels. It's rather a step-by-step testing. |
So our
if hasattr(trans, "fit_transform"):
(
method_mapping.add(caller="fit", callee="fit_transform")
.add(caller="fit_transform", callee="fit_transform")
.add(caller="fit_predict", callee="fit_transform")
)
else:
(
method_mapping.add(caller="fit", callee="fit")
.add(caller="fit", callee="transform")
.add(caller="fit_transform", callee="fit")
.add(caller="fit_transform", callee="transform")
.add(caller="fit_predict", callee="fit")
.add(caller="fit_predict", callee="transform")
) And looking back at it, I'm wondering if |
Do I understand right that both I ask because I'm wondering if we can allow this PR to go through with the short-term manual fix and then leave it as a future enhancement to automate the composite routing in the |
That's true. I wouldn't mind that. This would need tests though, and ideally in the metadata routing common tests. |
Sounds good! I can help with that. I'm new to contributing into sklearn though, so I may be in a bit over my head. But I'd like to give it a try first if that's okay. Is this the only file that we would need to add the tests to? --> |
Yep, that's the file. And thanks for contributing 😊 |
Hey @adrinjalali I just finished the remaining tasks. Could you please review and let me know if this is satisfactory?
Note: I changed the PR description to be |
@glemaitre or @adrinjalali just pinging on the above, as the PR is ready for review. Thanks! |
doc/whats_new/upcoming_changes/sklearn.feature_selection/30529.fix.rst
Outdated
Show resolved
Hide resolved
….fix.rst Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@adrinjalali thank you for your review! I agreed with all the suggestions and made improvements to the tests accordingly. Can you please review again? Note that in the process of making these changes, I realized that the prior implementation of the I added functionality to What do you think? |
@adrinjalali @glemaitre pinging on the above request for a review? Thanks! |
Hey @adrinjalali, are you able to review this PR sometime this week? |
@kschluns I'll have a look this week. |
Hey @adrinjalali any update here? Is it normal for a PR to take this long to get approved or is there something about the proposed changes here that is making it difficult to review and approve? If you're just too busy, can we see if someone else is able to review this PR? |
Hello? It's me. |
Hey @adrinjalali! Can you please take a look at this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @kschluns , and sorry for the late review. This is looking much better now.
Sub-estimator metadata is only checked if the `caller` method matches | ||
the value defined by `parent`. If `parent` is None, the target | ||
sub-estimator metadata is checked regardless of the `caller` method. | ||
|
||
NOTE: many metaestimators call the subestimator in roundabout ways | ||
and this makes it very difficult to know what method name to use for | ||
`parent`. If misspecified, it results in tests passing trivially. For | ||
example, when fitting the RFE metaestimator, RFE.fit() calls RFE._fit() | ||
, which then calls subestimator.fit(). In this case, the user | ||
configuring the test should set method="fit" and parent="_fit", | ||
otherwise the test will pass trivially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure you've done a great job going through this bit of code figuring it out, but from the change in the docstring, it's not clear to me exactly what the issue is. Could you please add a test specifically for the change here, to make it clear for future developers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I actually add a test in the sklearn/tests/test_metaestimators_metadata_routing.py
, mark it with @pytest.mark.skip
since it is expected to fail, then just add a note in the docstring referencing the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I don't really understand this comment, seeing that test helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question was more around the logistics of adding a test that I know is going to fail, without it messing up any CI pipelines. For example, at work we enforce 100% of our tests to pass in order for PRs to get merged, so I just haven't encountered this situation before. Wasn't sure if that would be a problem here and if it is, is @pytest.mark.skip
the right way to handle it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't be merging w/o CI being green. Once you write a test, I'd know what needs to be done to fix it. Here I just don't understand what the comment is saying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, will write up a test shortly
Reference Issues/PRs
Towards #30527
What does this implement/fix? Explain your changes.
See #30527 for more context on the root issue.
The problem appears to be stemming from the fact that the Pipeline's
get_metadata_routing
function checks if each step's transformer (trans
) has afit_transform
function, and if it does, it adds the following router mappings (source):All four impacted Feature Selector classes have a
fit_transform
function, which therefore requires a.add(caller="fit_transform", callee=<???>)
mapping to exist in the downstream feature selector'sget_metadata_routing
function. The lack of this mapping is preventing metadata from being routed whenfeature_selector.fit_transform()
is called, whereas metadata is successfully routed whenfeature_selector.fit()
is called.Here is the current code for the
SelectFromModel
class which demonstrates the missing mapping (source):I have tested that the issue is fixed by adding
.add(caller="fit_transform", callee="fit_transform")
to themethod_mapping
definition. It turns out that it doesn't matter what value you put forcallee
for reasons explained in this comment. The only caveat to this is that thecallee
value has to be a method that supports metadata routing.Task List (from the Pull Request Checklist)