-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
FIX Feature Selectors fail to route metadata when inside a Pipeline #30529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
2d93398
4a2e0f5
015acc9
9da6d43
eef1ca2
fc070f0
4e9650a
76d017e
bf56193
0460e8d
bd96b65
55fadc4
00d505d
db637a3
2e89ed4
eb6edf3
a67fb95
bd16800
3cbf0c4
e2d8b3f
23a66c7
e2538a9
f64995a
64b4050
0ac3aa0
6a6be39
de084e5
b109e13
abaf65e
b53f05c
16612e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
- , :class:`impute.IterativeImputer`, :class:`feature_selection.RFE`, | ||
:class:`feature_selection.RFECV`, :class:`feature_selection.SelectFromModel`, and | ||
:class:`feature_selection.SequentialFeatureSelector` now properly route metadata when | ||
used inside a Pipeline object. Prior to this fix, when sample_weight was provided in | ||
the Pipeline's `**fit_params`, the `sample_weight` was not being routed to the feature | ||
selector's estimator. In certain cases, this would result in incorrect feature | ||
selection. By :user:`Kyle Schluns <kschluns>` |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -57,20 +57,41 @@ def check_recorded_metadata(obj, method, parent, split_params=tuple(), **kwargs) | |
obj : estimator object | ||
sub-estimator to check routed params for | ||
method : str | ||
sub-estimator's method where metadata is routed to, or otherwise in | ||
the context of metadata routing referred to as 'callee' | ||
parent : str | ||
the parent method which should have called `method`, or otherwise in | ||
the context of metadata routing referred to as 'caller' | ||
The target sub-estimator method for which to check recorded metadata, | ||
otherwise in the context of metadata routing referred to as 'callee'. | ||
parent : str or None | ||
The parent method which should have called `method`, or otherwise in | ||
the context of metadata routing referred to as 'caller'. | ||
|
||
Sub-estimator metadata is only checked if the `caller` method matches | ||
the value defined by `parent`. If `parent` is None, the target | ||
sub-estimator metadata is checked regardless of the `caller` method. | ||
|
||
NOTE: many metaestimators call the subestimator in roundabout ways | ||
and this makes it very difficult to know what method name to use for | ||
`parent`. If misspecified, it results in tests passing trivially. For | ||
example, when fitting the RFE metaestimator, RFE.fit() calls RFE._fit() | ||
, which then calls subestimator.fit(). In this case, the user | ||
configuring the test should set method="fit" and parent="_fit", | ||
otherwise the test will pass trivially. | ||
Comment on lines
+66
to
+76
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm sure you've done a great job going through this bit of code figuring it out, but from the change in the docstring, it's not clear to me exactly what the issue is. Could you please add a test specifically for the change here, to make it clear for future developers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should I actually add a test in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since I don't really understand this comment, seeing that test helps. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My question was more around the logistics of adding a test that I know is going to fail, without it messing up any CI pipelines. For example, at work we enforce 100% of our tests to pass in order for PRs to get merged, so I just haven't encountered this situation before. Wasn't sure if that would be a problem here and if it is, is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We won't be merging w/o CI being green. Once you write a test, I'd know what needs to be done to fix it. Here I just don't understand what the comment is saying. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. got it, will write up a test shortly |
||
split_params : tuple, default=empty | ||
specifies any parameters which are to be checked as being a subset | ||
of the original values | ||
**kwargs : dict | ||
passed metadata | ||
""" | ||
all_records = ( | ||
getattr(obj, "_records", dict()).get(method, dict()).get(parent, list()) | ||
) | ||
|
||
if parent: | ||
all_records = ( | ||
getattr(obj, "_records", dict()).get(method, dict()).get(parent, list()) | ||
) | ||
else: | ||
all_records = [ | ||
r | ||
for record in getattr(obj, "_records", dict()).get(method, dict()).values() | ||
for r in record | ||
] | ||
|
||
for record in all_records: | ||
# first check that the names of the metadata passed are the same as | ||
# expected. The names are stored as keys in `record`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the top of the head, I would expect
callee
to befit
because this is the step that requires metadata for this specific selector.However, I'm always confused with the routing/caller/callee :).
@StefanieSenger @adrinjalali would you mind to have a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't have to add these composite methods here, they should be auto generated from the simple methods. This might be an issue in the
_metadata_requests.py
that I need to debug to see where the issue is.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also curious about @glemaitre's question. It actually doesn't matter whether you use
callee="fit_transform"
orcallee="fit"
for the fix. Both values will result in successful metadata routing.I think this is because the Feature Selector's
fit_transform
function passes the**fit_params
toself.fit()
without explicitly using the metadata routing functionality (because technically it doesn't have to). Code snippet below (source) shows that its only using the metadata routing for validation and has no impact on the**fit_params
.To illustrate the function calls, in order:
pipeline.fit()
:process_routing(_method='fit')
to request the metadata to pass forwardfeature_selector.fit_transform()
and includes the routed metadatafeature_selector.fit_transform()
:process_routing()
self.fit(X, y, **fit_params)
feature_selector.fit()
:process_routing(_method='fit')
to request the metadata to pass forwardestimator.fit()
and includes the routed metadataBecause step 2 doesn't participate in the metadata routing, it doesn't matter what the
callee
value is for thefeature_selector
. It just has to be one of the generically valid values defined by the class (e.g.,fit
,transform
,fit_transform
, etc).