Description
Describe the bug
According to the metadata routing docs, Feature Selectors only have four classes that support metadata routing (as of v1.6):
- sklearn.feature_selection.RFE
- sklearn.feature_selection.RFECV
- sklearn.feature_selection.SelectFromModel
- sklearn.feature_selection.SequentialFeatureSelector
Each of these classes fail to route metadata when used inside a Pipeline object. When sample_weight
is provided in the Pipeline's **fit_params
, the failure to pass sample_weight
to the feature selector's estimator may result in incorrect feature selection (e.g., when the relationship between the features and the response are materially impacted by sample_weight
).
Steps/Code to Reproduce
import numpy as np
import sklearn
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
sklearn.set_config(enable_metadata_routing=True)
X, y = load_iris(return_X_y=True, as_frame=True)
w = np.arange(len(X)) + 1
reg = LinearRegression().set_fit_request(sample_weight=True)
pipeline_reg = LinearRegression().set_fit_request(sample_weight=True)
pipeline_fs = SelectFromModel(
reg,
threshold=-np.inf,
prefit=False,
max_features=len(X.columns),
)
pipeline = Pipeline(
[
("feature_selector", pipeline_fs),
("regressor", pipeline_reg),
]
)
pipeline.fit(X, y, sample_weight=w)
reg.fit(X, y, sample_weight=w)
test_passed = (
pipeline["feature_selector"].estimator_.coef_.tolist()
== reg.coef_.tolist()
)
Expected Results
The expected result is test_passed = True
.
i.e., the internal estimator of the pipeline's feature_selector
should have coef_
that exactly match the coef_
from having a copied estimator fit on the same input (e.g., (X, y, sample_weight)
).
Actual Results
The coefficients don't match between the pipeline's feature_selector.estimator_
and the copied estimator trained on the same input (X,y,sample_weight)
.
>>> pipeline["feature_selector"].estimator_.coef_.tolist() == reg.coef_.tolist()
False
>>> pipeline["feature_selector"].estimator_.coef_
array([-0.11190585, -0.04007949, 0.22864503, 0.60925205])
>>> reg.coef_
array([-0.14681895, -0.07652903, 0.28196639, 0.5732906 ])
Rather the coefficients of the pipeline's feature_selector.estimator_
matches those of a copied estimator fit only on (X,y)
without sample_weight
.
>>> reg.fit(X,y).coef_
array([-0.11190585, -0.04007949, 0.22864503, 0.60925205])
Versions
System:
python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]
executable: /Users/kschluns/Library/Caches/pypoetry/virtualenvs/ds-sbraf-edgrceiw-py3.11/bin/python
machine: macOS-15.1.1-arm64-arm-64bit
Python dependencies:
sklearn: 1.6.0
pip: 23.1.2
setuptools: 75.6.0
numpy: 1.26.4
scipy: 1.14.1
Cython: None
pandas: 2.2.3
matplotlib: 3.10.0
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /Users/kschluns/Library/Caches/pypoetry/virtualenvs/ds-sbraf-edgrceiw-py3.11/lib/python3.11/site-packages/numpy/.dylibs/libopenblas64_.0.dylib
version: 0.3.23.dev
threading_layer: pthreads
architecture: armv8
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /Users/kschluns/Library/Caches/pypoetry/virtualenvs/ds-sbraf-edgrceiw-py3.11/lib/python3.11/site-packages/sklearn/.dylibs/libomp.dylib
version: None