-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
TST Add a test for meta-estimators with non tabular data #19755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST Add a test for meta-estimators with non tabular data #19755
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @jeremiedbb !
I am +1 on enforcing check_meta_estimators_validation
for our meta-estimators, but it may be too much of a requirement for third-party meta-estimators.
Unless I'm wrong I only enabled this check for our meta-estimators. Is there something I didn't catch ? |
This PR looks okay. I was concerned with adding a "public function", |
I see, thanks for the clarification. I'll move everything to test_common only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think it's best to have meta-estimators delegate input validation by default. We might still want to have exceptions to this rule on a case by case basis but they should be motivated.
Once this PR is merged, one should add an issue in the tracker with a TODO list of meta-estimators to update. Some meta-estimators might require some code change, for instance by using |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
sklearn/tests/test_metaestimators.py
Outdated
|
||
if "param_grid" in sig or "param_distributions" in sig: | ||
# SearchCV estimators | ||
yield Estimator(estimator, param_grid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This raises a UserWarning:
UserWarning: The total space of parameters 2 is smaller than n_iter=10.
Running 2 iterations. For exhaustive searches, use GridSearchCV.
for RandomSearchCV
. I think this is okay to ignore with a ignore_warnings(category=UserWarning)
decorator on test_meta_estimators_delegate_data_validation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I set n_iter=2 for RandomizedSearchCV instead of ignoring the warnings for the whole test.
Thinking about this more, I think there are some benefit with having the metaestimator validate:
|
…eremiedbb/scikit-learn into common-test-pipeline-in-meta-estimator
I agree that there are benefits to validate in the meta-estimator. Based on our previous discussions, I think the main goal of not doing any validation is to make a step towards being able to pass estimators that can process cupy arrays, dask arrays, ... However, as you pointed out, we could check the inner estimators tags to decide if we can safely do some partial (or complete) validation. |
Yea, I think the benefit of supporting more array types outweighs the advantages I mentioned out in #19755 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, otherwise LGTM
sklearn/tests/test_metaestimators.py
Outdated
"base_estimator" or "estimators". | ||
""" | ||
for _, Estimator in sorted(all_estimators()): | ||
sig = list(signature(Estimator).parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nit:
sig = list(signature(Estimator).parameters) | |
sig = set(signature(Estimator).parameters) |
sklearn/tests/test_metaestimators.py
Outdated
sig = list(signature(Estimator).parameters) | ||
|
||
if "estimator" in sig or "base_estimator" in sig: | ||
if issubclass(Estimator, RegressorMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there an issue with using is_regressor
here?
if issubclass(Estimator, RegressorMixin): | |
if is_regressor(Estimator): |
sklearn/tests/test_metaestimators.py
Outdated
|
||
elif "estimators" in sig: | ||
# stacking, voting | ||
if issubclass(Estimator, RegressorMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here:
if issubclass(Estimator, RegressorMixin): | |
if is_regressor(Estimator): |
…n#19755) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Meta-estimators should delegate data validation to their inner estimator(s), which is currently not the case (only SearchCV estimators don't).
This PR proposes to introduce a new common test for meta-estimators only, to check that they work with non tabular data as long as the inner estimator does (a pipeline with text preprocessing followed by ridge or logreg for instance).
This is also related to
n_features_in_
since meta estimators should delegate setting n_features_in_ to their inner estimator, when applicable (see #19333).There's a long blacklist for now from which meta-estimators should be removed as they are fixed
CC @ogrisel