Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX Feature Selectors fail to route metadata when inside a Pipeline #30529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
Loading
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
2d93398
fixes metadata routing for feature selector fit_transform
kschluns Dec 22, 2024
4a2e0f5
added non-regression tests for feature_selector fix
kschluns Jan 1, 2025
015acc9
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 1, 2025
9da6d43
linter fix
kschluns Jan 1, 2025
eef1ca2
simplified feature_selector metadata routing test
kschluns Jan 1, 2025
fc070f0
Add change log for PR 30529
kschluns Jan 1, 2025
4e9650a
fix test coverage issue
kschluns Jan 1, 2025
76d017e
flake8 fix
kschluns Jan 1, 2025
bf56193
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 2, 2025
0460e8d
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 3, 2025
bd96b65
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 3, 2025
55fadc4
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 3, 2025
00d505d
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 6, 2025
db637a3
Update doc/whats_new/upcoming_changes/sklearn.feature_selection/30529…
kschluns Jan 7, 2025
2e89ed4
Added metadata routing via fit_transform for IterativeImputer
kschluns Jan 14, 2025
eb6edf3
Improved test_metaestimators_in_pipeline
kschluns Jan 14, 2025
a67fb95
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 14, 2025
bd16800
Update change log 30529.fix.rst
kschluns Jan 14, 2025
3cbf0c4
removing unused variables
kschluns Jan 14, 2025
e2d8b3f
ruff linter fix
kschluns Jan 14, 2025
23a66c7
fixed test coverage issue
kschluns Jan 14, 2025
e2538a9
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 15, 2025
f64995a
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 16, 2025
64b4050
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 17, 2025
0ac3aa0
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 20, 2025
6a6be39
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 27, 2025
de084e5
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Jan 31, 2025
b109e13
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Feb 3, 2025
abaf65e
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Feb 3, 2025
b53f05c
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Feb 4, 2025
16612e6
Merge branch 'main' into fix_feature_selector_metadata_routing
kschluns Mar 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
- , :class:`impute.IterativeImputer`, :class:`feature_selection.RFE`,
:class:`feature_selection.RFECV`, :class:`feature_selection.SelectFromModel`, and
:class:`feature_selection.SequentialFeatureSelector` now properly route metadata when
used inside a Pipeline object. Prior to this fix, when sample_weight was provided in
the Pipeline's `**fit_params`, the `sample_weight` was not being routed to the feature
selector's estimator. In certain cases, this would result in incorrect feature
selection. By :user:`Kyle Schluns <kschluns>`
1 change: 1 addition & 0 deletions 1 sklearn/feature_selection/_from_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -494,6 +494,7 @@ def get_metadata_routing(self):
router = MetadataRouter(owner=self.__class__.__name__).add(
estimator=self.estimator,
method_mapping=MethodMapping()
.add(caller="fit_transform", callee="fit_transform")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.add(caller="fit_transform", callee="fit_transform")
.add(caller="fit_transform", callee="fit")

On the top of the head, I would expect callee to be fit because this is the step that requires metadata for this specific selector.

However, I'm always confused with the routing/caller/callee :).

@StefanieSenger @adrinjalali would you mind to have a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't have to add these composite methods here, they should be auto generated from the simple methods. This might be an issue in the _metadata_requests.py that I need to debug to see where the issue is.

Copy link
Author

@kschluns kschluns Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also curious about @glemaitre's question. It actually doesn't matter whether you use callee="fit_transform" or callee="fit" for the fix. Both values will result in successful metadata routing.

I think this is because the Feature Selector's fit_transform function passes the **fit_params to self.fit() without explicitly using the metadata routing functionality (because technically it doesn't have to). Code snippet below (source) shows that its only using the metadata routing for validation and has no impact on the **fit_params.

    def fit_transform(self, X, y=None, **fit_params):
        if _routing_enabled():
            transform_params = self.get_metadata_routing().consumes(
                method="transform", params=fit_params.keys()
            )
            if transform_params:
                warnings.warn(
                    (
                        f"This object ({self.__class__.__name__}) has a `transform`"
                        " method which consumes metadata, but `fit_transform` does not"
                        " forward metadata to `transform`. Please implement a custom"
                        " `fit_transform` method to forward metadata to `transform` as"
                        " well. Alternatively, you can explicitly do"
                        " `set_transform_request`and set all values to `False` to"
                        " disable metadata routed to `transform`, if that's an option."
                    ),
                    UserWarning,
                )
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
            return self.fit(X, **fit_params).transform(X)
        else:
            # fit method of arity 2 (supervised transformation)
            return self.fit(X, y, **fit_params).transform(X)

To illustrate the function calls, in order:

  1. pipeline.fit():
    • First calls process_routing(_method='fit') to request the metadata to pass forward
    • Then calls feature_selector.fit_transform() and includes the routed metadata
  2. feature_selector.fit_transform():
    • does NOT call process_routing()
    • calls self.fit(X, y, **fit_params)
  3. feature_selector.fit():
    • First calls process_routing(_method='fit') to request the metadata to pass forward
    • Then calls estimator.fit() and includes the routed metadata

Because step 2 doesn't participate in the metadata routing, it doesn't matter what the callee value is for the feature_selector. It just has to be one of the generically valid values defined by the class (e.g., fit, transform, fit_transform, etc).

.add(caller="partial_fit", callee="partial_fit")
.add(caller="fit", callee="fit"),
)
Expand Down
5 changes: 4 additions & 1 deletion 5 sklearn/feature_selection/_rfe.py
Original file line number Diff line number Diff line change
Expand Up @@ -542,6 +542,7 @@ def get_metadata_routing(self):
router = MetadataRouter(owner=self.__class__.__name__).add(
estimator=self.estimator,
method_mapping=MethodMapping()
.add(caller="fit_transform", callee="fit_transform")
.add(caller="fit", callee="fit")
.add(caller="predict", callee="predict")
.add(caller="score", callee="score"),
Expand Down Expand Up @@ -968,7 +969,9 @@ def get_metadata_routing(self):
router = MetadataRouter(owner=self.__class__.__name__)
router.add(
estimator=self.estimator,
method_mapping=MethodMapping().add(caller="fit", callee="fit"),
method_mapping=MethodMapping()
.add(caller="fit", callee="fit")
.add(caller="fit_transform", callee="fit_transform"),
)
router.add(
splitter=check_cv(self.cv),
Expand Down
4 changes: 3 additions & 1 deletion 4 sklearn/feature_selection/_sequential.py
Original file line number Diff line number Diff line change
Expand Up @@ -350,7 +350,9 @@ def get_metadata_routing(self):
router = MetadataRouter(owner=self.__class__.__name__)
router.add(
estimator=self.estimator,
method_mapping=MethodMapping().add(caller="fit", callee="fit"),
method_mapping=MethodMapping()
.add(caller="fit", callee="fit")
.add(caller="fit_transform", callee="fit_transform"),
)
router.add(
splitter=check_cv(self.cv, classifier=is_classifier(self.estimator)),
Expand Down
4 changes: 3 additions & 1 deletion 4 sklearn/impute/_iterative.py
Original file line number Diff line number Diff line change
Expand Up @@ -1025,6 +1025,8 @@ def get_metadata_routing(self):
"""
router = MetadataRouter(owner=self.__class__.__name__).add(
estimator=self.estimator,
method_mapping=MethodMapping().add(callee="fit", caller="fit"),
method_mapping=MethodMapping()
.add(callee="fit", caller="fit")
.add(callee="fit_transform", caller="fit_transform"),
)
return router
37 changes: 29 additions & 8 deletions 37 sklearn/tests/metadata_routing_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,20 +57,41 @@ def check_recorded_metadata(obj, method, parent, split_params=tuple(), **kwargs)
obj : estimator object
sub-estimator to check routed params for
method : str
sub-estimator's method where metadata is routed to, or otherwise in
the context of metadata routing referred to as 'callee'
parent : str
the parent method which should have called `method`, or otherwise in
the context of metadata routing referred to as 'caller'
The target sub-estimator method for which to check recorded metadata,
otherwise in the context of metadata routing referred to as 'callee'.
parent : str or None
The parent method which should have called `method`, or otherwise in
the context of metadata routing referred to as 'caller'.

Sub-estimator metadata is only checked if the `caller` method matches
the value defined by `parent`. If `parent` is None, the target
sub-estimator metadata is checked regardless of the `caller` method.

NOTE: many metaestimators call the subestimator in roundabout ways
and this makes it very difficult to know what method name to use for
`parent`. If misspecified, it results in tests passing trivially. For
example, when fitting the RFE metaestimator, RFE.fit() calls RFE._fit()
, which then calls subestimator.fit(). In this case, the user
configuring the test should set method="fit" and parent="_fit",
otherwise the test will pass trivially.
Comment on lines +66 to +76
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure you've done a great job going through this bit of code figuring it out, but from the change in the docstring, it's not clear to me exactly what the issue is. Could you please add a test specifically for the change here, to make it clear for future developers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I actually add a test in the sklearn/tests/test_metaestimators_metadata_routing.py, mark it with @pytest.mark.skip since it is expected to fail, then just add a note in the docstring referencing the test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I don't really understand this comment, seeing that test helps.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more around the logistics of adding a test that I know is going to fail, without it messing up any CI pipelines. For example, at work we enforce 100% of our tests to pass in order for PRs to get merged, so I just haven't encountered this situation before. Wasn't sure if that would be a problem here and if it is, is @pytest.mark.skip the right way to handle it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't be merging w/o CI being green. Once you write a test, I'd know what needs to be done to fix it. Here I just don't understand what the comment is saying.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, will write up a test shortly

split_params : tuple, default=empty
specifies any parameters which are to be checked as being a subset
of the original values
**kwargs : dict
passed metadata
"""
all_records = (
getattr(obj, "_records", dict()).get(method, dict()).get(parent, list())
)

if parent:
all_records = (
getattr(obj, "_records", dict()).get(method, dict()).get(parent, list())
)
else:
all_records = [
r
for record in getattr(obj, "_records", dict()).get(method, dict()).values()
for r in record
]

for record in all_records:
# first check that the names of the metadata passed are the same as
# expected. The names are stored as keys in `record`.
Expand Down
89 changes: 89 additions & 0 deletions 89 sklearn/tests/test_metaestimators_metadata_routing.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,14 @@
MultiOutputRegressor,
RegressorChain,
)
from sklearn.pipeline import Pipeline
from sklearn.semi_supervised import SelfTrainingClassifier
from sklearn.tests.metadata_routing_common import (
ConsumingClassifier,
ConsumingRegressor,
ConsumingScorer,
ConsumingSplitter,
ConsumingTransformer,
NonConsumingClassifier,
NonConsumingRegressor,
_Registry,
Expand Down Expand Up @@ -421,6 +423,7 @@
"estimator": "classifier",
"X": X,
"y": y,
"preserves_metadata": "subset",
"estimator_routing_methods": ["fit"],
"scorer_name": "scoring",
"scorer_routing_methods": ["fit"],
Expand All @@ -439,6 +442,7 @@
"metaestimator": RFECV,
"estimator": "classifier",
"estimator_name": "estimator",
"preserves_metadata": "subset",
"estimator_routing_methods": ["fit"],
"cv_name": "cv",
"cv_routing_methods": ["fit"],
Expand Down Expand Up @@ -708,6 +712,91 @@ def test_error_on_missing_requests_for_sub_estimator(metaestimator):
method(X, **method_kwargs)


@pytest.mark.parametrize("metaestimator", METAESTIMATORS, ids=METAESTIMATOR_IDS)
@config_context(enable_metadata_routing=True)
def test_metaestimators_in_pipeline(metaestimator):
# Check that metadata is routed correctly to the sub-estimator when the
# metaestimator is an intermediate step within a Pipeline.
if "estimator" not in metaestimator:
# This test only makes sense for metaestimators which have a
# sub-estimator, e.g. MyMetaEstimator(estimator=MySubEstimator())
return

metaestimator_class = metaestimator["metaestimator"]
X = metaestimator["X"]
y = metaestimator["y"]
method_name = "fit"
method_mapping = metaestimator.get("method_mapping", {})
preserves_metadata = metaestimator.get("preserves_metadata", True)
kschluns marked this conversation as resolved.
Show resolved Hide resolved

for key in ["sample_weight", "metadata"]:
val = {"sample_weight": sample_weight, "metadata": metadata}[key]
method_kwargs = {key: val}

kwargs, (estimator, registry), (scorer, _), (cv, _) = get_init_args(
metaestimator, sub_estimator_consumes=True
)
if scorer:
set_requests(
scorer, method_mapping={}, methods=["score"], metadata_name=key
)
if cv:
cv.set_split_request(groups=True, metadata=True)

final_estimator = ConsumingTransformer()

# `set_{method}_request({metadata}==True)` on the underlying estimator
set_requests(
estimator,
method_mapping=method_mapping,
methods=[method_name],
metadata_name=key,
)

# `set_{method}_request({metadata}==True)` on the final estimator
# (required to avoid UnsetMetadataPassedError on final estimator)
set_requests(
final_estimator,
method_mapping={},
methods=[method_name],
metadata_name=key,
)

metaestimator_instance = metaestimator_class(**kwargs)
extra_method_args = metaestimator.get("method_args", {}).get(method_name, {})

if not hasattr(metaestimator_instance, "transform"):
# This test only makes sense for metaestimators that can be
# intermediate steps in a Pipeline (e.g. transform method exists)
return
kschluns marked this conversation as resolved.
Show resolved Hide resolved

pipe = Pipeline(
[
("feature_selector", metaestimator_instance),
("final_estimator", final_estimator),
]
)

pipe_method = getattr(pipe, method_name)
pipe_method(X, y, **method_kwargs, **extra_method_args)

# sanity check that registry is not empty, or else the test passes
# trivially
for registry_ in [registry]:
assert registry_
split_params = (
method_kwargs.keys() if preserves_metadata == "subset" else ()
)
for estimator in registry_:
check_recorded_metadata(
estimator,
method=method_name,
parent=None,
split_params=split_params,
**method_kwargs,
)


@pytest.mark.parametrize("metaestimator", METAESTIMATORS, ids=METAESTIMATOR_IDS)
@config_context(enable_metadata_routing=True)
def test_setting_request_on_sub_estimator_removes_error(metaestimator):
Expand Down
Loading
Morty Proxy This is a proxified and sanitized view of the page, visit original site.