Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Uses _validate_data in other methods in the neural_network module #18514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Related to #18010

What does this implement/fix? Explain your changes.

This PR adds the use of _validate_data to non-fit methods. N_FEATURES_IN_AFTER_FIT_MODULES_TO_IGNORE is used to ignore modules so that we can work on this issues with smaller PRs. Currently, this PR only adds _validate_data to the neural_network module.

Other comments?

While we work on the details of #18010, we can add _validate_data in non-fit methods concurrently. #18010 will be "free" if we can have _validate_data in non-fit methods.

CC @NicolasHug @adrinjalali @amueller

@adrinjalali
Copy link
Member

interesting choice of a first module to fix :D tests failing :)

sklearn/base.py Outdated
@@ -360,6 +360,10 @@ def _check_n_features(self, X, reset):
If True, the `n_features_in_` attribute is set to `X.shape[1]`.
Else, the attribute must already exist and the function checks
that it is equal to `X.shape[1]`.
.. note::
It is recommended to call reset=True in `fit` and in the first
call to `partial_fit`. All other methods that validates `X`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
call to `partial_fit`. All other methods that validates `X`
call to `partial_fit`. All other methods that validate `X`

same below

@@ -3121,6 +3121,66 @@ def check_requires_y_none(name, estimator_orig, strict_mode=True):
warnings.warn(warning_msg, FutureWarning)


def check_n_features_in_after_fitting(name, estimator_orig, strict_mode=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once all modules are supported, should this be merge with the already-existing check_n_features_in check?

We also have another check that specifically checks for error when the number of features are inconsistent (I don't remember the name). Should this one be removed then? (If yes let's document it here and next to N_FEATURES_IN_AFTER_FIT_MODULES_TO_IGNORE please)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea we should merge it into check_n_features_in.

We also have another check that specifically checks for error when the number of features are inconsistent (I don't remember the name). Should this one be removed then?

I think you are referring to check_estimators_partial_fit_n_features. This new check adds two new requirements on top of check_estimators_partial_fit_n_features:

  1. n_features_in_ is set during the first call to partial_fit.
  2. More strict when it comes to the error message.

I updated the comment with the above message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more referring to e.g. check_classifiers_train :

        msg = ("The classifier {} does not raise an error when the number of "
               "features in {} is different from the number of features in "
               "fit.")

but we can keep it as-is and remove later (or not, as long as it passes)


estimator = clone(estimator_orig)

has_classes = 'classes' in signature(estimator.partial_fit).parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using this to detect if parital_fit accepted classes. Turns out all classifiers has this in partial_fit, so I updated with just is_classifer.

Comment on lines 3158 to 3160
func = getattr(estimator, method, None)
if func is None:
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit but I find that using hasattr is cleaner

sklearn/base.py Outdated
@@ -410,7 +418,7 @@ def _validate_data(self, X, y=None, reset=True,
"""

if y is None:
if self._get_tags()['requires_y']:
if reset and self._get_tags()['requires_y']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that if y isn't passed the second time partial_fit is called, then the error won't be properly raised. Do we want that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of makes this inconvenient for calling in non-fit methods. For predict, we would set reset=False and not provide a y. This if statement almost assumes that _validate_data is only called in fit or partial_fit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke about it here: #18001

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solution would be to add a requires_y kwargs to _validate_data:

    def _validate_data(self, X, y=None, reset=True, requires_y='auto',
                       validate_separately=False, **check_params):

Where requires_y='auto' means using the tag and it can also be a bool.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC our chat with @amueller a few months ago, the cleanest solution we had agreed on was to call _validate_data only in fit and in the first partial_fit, and to define another method for the rest (predict, transform, subsequent partial_fit, etc.). This might require changing _validate_data a little, which is fine since it's private

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means all fit calls must always explicitly pass y.

You mean that y must be explicitly passed to fit, or that y must be explicitly passed to _validate_data in fit?

Either way, I'm not sure I understand the reason.

Copy link
Member Author

@thomasjpfan thomasjpfan Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to use the __NO_OP placeholder. This means that it will be the callers responsibly to call this with y to get the requires_y check:

self._validate_data(X, None)  # will check that `y` is consistent with `require_y` tag

self._validate_data(X)  # will ignore `requires_y` tag

The alternative is to recommend setting y (in _validate_data) everywhere in fit. In other words:

def fit(self, X, y=None):
    # note that `y` is set explicitly to `None`, because the docstring
    # says `y` is ignored
    X = self._validate_data(X, None)

This way the requires_y will always be checked with y.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that it will be the callers responsibly to call this with y to get the requires_y check

I think that's fine because the reason we introduced the tag was to give a decent error message when y should be passed (e.g. for supervized estimators), but the user left it to None. Without the tag, they would get a weird tuple unpacking error with X, y = _validate_data(X, y). In other words, the tag is only useful when we pass both X and y, so that's fine to ignore it when we just pass X.

Could you remove the use of the requires_y parameter? I still see it in a few places ;)

Copy link
Member Author

@thomasjpfan thomasjpfan Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am okay with delegating the responsibility of passing y to the _validate_data's caller:

  • If the caller passes y, then _validate_data will validate y.
  • If the caller does not pass y, then do nothing (in terms of y).

Edit: I wrote this before github updated my UI with your above message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the use of the requires_y parameter? I still see it in a few places ;)

All gone now!

@thomasjpfan
Copy link
Member Author

interesting choice of a first module to fix :D tests failing :)

It was a module with a transformer + regressor + classifier and something with partial_fit. :)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @thomasjpfan! The new check is nice.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @thomasjpfan , I think this LGTM.

The approach is a bit unconventional, so maybe we should wait for more eyeballs. In particular I think that the approval from @ogrisel was for a different implementation.

sklearn/base.py Outdated
y : array-like of shape (n_samples,), default=None
The targets. If None, `check_array` is called on `X` and
`check_X_y` is called otherwise.
y : array-like of shape (n_samples,), default=__NO_Y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I think this might be the only case where it would make sense to use optional instead of indicating default=.... But we can leave it as-is since it's still private.

sklearn/base.py Outdated
requires_y tag is ignored. This is a default placeholder and is
never meant to be explicitly set.
- Otherwise, both `X` and `y` are checked with either `check_array`
or `check_X_y`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or `check_X_y`.
or `check_X_y` depending on `validate_separately`

@glemaitre
Copy link
Member

Using __NO_Y looks good as well. It would only supper nice to get a developer guide to have these documented in a narration but it is outside of the scope here.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of implementing the following slight variation to make this solution clearer. Other than that LGTM.

sklearn/base.py Outdated
@@ -156,6 +156,8 @@ class BaseEstimator:
at the class level in their ``__init__`` as explicit keyword
arguments (no ``*args`` or ``**kwargs``).
"""
# used by _validate_data when `y` is not validated
__NO_Y = '__NO_Y'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you really want to keep a module level constant as a default placeholder, I would rather use single underscore. It's enough to tell it's private. But I think the code would be clearer without defining a module level constant.

sklearn/base.py Outdated
@@ -378,7 +384,7 @@ def _check_n_features(self, X, reset):
self.n_features_in_)
)

def _validate_data(self, X, y=None, reset=True,
def _validate_data(self, X, y=__NO_Y, reset=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just set y="no_validation" instead of defining a module level constant as default placeholder with a rather implicit name.

Copy link
Member

@ogrisel ogrisel Oct 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also make the docstring easier to understand.

@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2020

Thanks @thomasjpfan! Merging.

@ogrisel ogrisel merged commit 8731cd7 into scikit-learn:master Oct 7, 2020
@ogrisel
Copy link
Member

ogrisel commented Oct 7, 2020

So I guess no we have to open as many PRs as there are modules to update :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.