Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

TST Add a test for meta-estimators with non tabular data #19755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Mar 23, 2021

Meta-estimators should delegate data validation to their inner estimator(s), which is currently not the case (only SearchCV estimators don't).

This PR proposes to introduce a new common test for meta-estimators only, to check that they work with non tabular data as long as the inner estimator does (a pipeline with text preprocessing followed by ridge or logreg for instance).

This is also related to n_features_in_ since meta estimators should delegate setting n_features_in_ to their inner estimator, when applicable (see #19333).

There's a long blacklist for now from which meta-estimators should be removed as they are fixed

CC @ogrisel

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @jeremiedbb !

I am +1 on enforcing check_meta_estimators_validation for our meta-estimators, but it may be too much of a requirement for third-party meta-estimators.

sklearn/utils/estimator_checks.py Outdated Show resolved Hide resolved
@jeremiedbb
Copy link
Member Author

but it may be too much of a requirement for third-party meta-estimators.

Unless I'm wrong I only enabled this check for our meta-estimators. Is there something I didn't catch ?

@thomasjpfan
Copy link
Member

Unless I'm wrong I only enabled this check for our meta-estimators. Is there something I didn't catch ?

This PR looks okay. I was concerned with adding a "public function", check_meta_estimators_validation, to estimator_checks.py, but it would only be used to validate our own meta-estimators. I think most of the checks in estimator_checks.py were meant to be place in check_estimator at some point.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Mar 23, 2021

I think most of the checks in estimator_checks.py were meant to be place in check_estimator at some point.

I see, thanks for the clarification. I'll move everything to test_common only.
EDIT: Actually maybe it's more appropriate to put this test in test_metaestimators.

@jeremiedbb jeremiedbb changed the title TST Add a common test for meta-estimators with non tabular data TST Add a test for meta-estimators with non tabular data Mar 23, 2021
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think it's best to have meta-estimators delegate input validation by default. We might still want to have exceptions to this rule on a case by case basis but they should be motivated.

sklearn/tests/test_metaestimators.py Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Mar 24, 2021

Once this PR is merged, one should add an issue in the tracker with a TODO list of meta-estimators to update.

Some meta-estimators might require some code change, for instance by using safe_indexing on X and y instead of relying on sample-wise fancy indexing.


if "param_grid" in sig or "param_distributions" in sig:
# SearchCV estimators
yield Estimator(estimator, param_grid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a UserWarning:

UserWarning: The total space of parameters 2 is smaller than n_iter=10.
Running 2 iterations. For exhaustive searches, use GridSearchCV.

for RandomSearchCV. I think this is okay to ignore with a ignore_warnings(category=UserWarning) decorator on test_meta_estimators_delegate_data_validation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I set n_iter=2 for RandomizedSearchCV instead of ignoring the warnings for the whole test.

@thomasjpfan
Copy link
Member

Thinking about this more, I think there are some benefit with having the metaestimator validate:

  1. If the metaestimator validates for finite values, then it can configure config_context(assume_finite=True) for the inner estimators. Deciding when to validate for finite values would mean checking the 'allow_nan' tag in the inner estimators.
  2. If the metaestimator validates and converts the input into a numpy array, that can take advantage of the memmapping in joblib.

…eremiedbb/scikit-learn into common-test-pipeline-in-meta-estimator
@jeremiedbb
Copy link
Member Author

Thinking about this more, I think there are some benefit with having the metaestimator validate

I agree that there are benefits to validate in the meta-estimator. Based on our previous discussions, I think the main goal of not doing any validation is to make a step towards being able to pass estimators that can process cupy arrays, dask arrays, ...

However, as you pointed out, we could check the inner estimators tags to decide if we can safely do some partial (or complete) validation.

@thomasjpfan
Copy link
Member

Based on our previous discussions, I think the main goal of not doing any validation is to make a step towards being able to pass estimators that can process cupy arrays, dask arrays,

Yea, I think the benefit of supporting more array types outweighs the advantages I mentioned out in #19755 (comment)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise LGTM

"base_estimator" or "estimators".
"""
for _, Estimator in sorted(all_estimators()):
sig = list(signature(Estimator).parameters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit:

Suggested change
sig = list(signature(Estimator).parameters)
sig = set(signature(Estimator).parameters)

sig = list(signature(Estimator).parameters)

if "estimator" in sig or "base_estimator" in sig:
if issubclass(Estimator, RegressorMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there an issue with using is_regressor here?

Suggested change
if issubclass(Estimator, RegressorMixin):
if is_regressor(Estimator):


elif "estimators" in sig:
# stacking, voting
if issubclass(Estimator, RegressorMixin):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here:

Suggested change
if issubclass(Estimator, RegressorMixin):
if is_regressor(Estimator):

@thomasjpfan thomasjpfan merged commit 1ce1715 into scikit-learn:main Apr 8, 2021
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021
…n#19755)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.