Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Adds _num_features for array-likes #19633

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 16, 2021

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Related to #19555

What does this implement/fix? Explain your changes.

This PR adds a _num_features that is similar to _num_samples but for features:

  1. _num_features does not actual check that every single row has the same number of elements. The validation would either be done before hand by, like by _validate_data, or validation is delegate to another estimator.
  2. _validate_data will always try to get the number of features based on X without needing to ensure_2d. In other words, it _check_n_features will validate when it can.
  3. There is very light checking here, technically, 3D arrays would work with _num_features, but I am assuming the validation would be done with check_array at some point.

Any other comments?

This PR enables #19555 to call _check_n_features when cv="prefit" to make sure that the prefitted estimator and the CalibrationClassiferCV are compatible.

In the end, I am thinking about feature names. I working toward a simple method or function such as _check_features(X) that any estimator can call, that would do the "correct" thing when it comes to n_features_in_ and feature_names_in_.

@thomasjpfan
Copy link
Member Author

CC @ogrisel This PR implements the idea from #19555 (comment)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for taking care of this @thomasjpfan. Here are some comment. Let me know what you think.

It would be great to improve the tests to check that the extended exception messages work as expected.

sklearn/utils/validation.py Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Outdated Show resolved Hide resolved
sklearn/utils/validation.py Show resolved Hide resolved
sklearn/utils/validation.py Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
if len(X.shape) <= 1:
raise TypeError(message)
if isinstance(X.shape[1], numbers.Integral):
return X.shape[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we raise ValueError in the else branch of this condition? I am not even sure how we could meaningfully trigger this in tests or in a legit usage scenario.

Maybe we should always return X.shape[1] without the if isinstance(X.shape[1], numbers.Integral) condition.

Copy link
Member

@ogrisel ogrisel Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this code was inspired from the _num_samples case for .shape[0] where this is not always known for dask dataframes where you can get delayed first dimensions:

>>> df.shape
(Delayed('int-1d60df59-017f-49c3-80e5-f313b06e4c1d'), 120)

Furthermore, in this case _num_samples would fallback to calling len(df) which would naturally trigger the delayed computation.

For the _num_features case, we would fallback to doing len(df[0]). In the case of a dask dataframe df[0]) would be a series for the first column and therefore len(df[0]) would return the number of samples instead of the number of features.

Finally I don't see a valid common case where this would happen for the second dimension. I think we can remove this defensive coding condition.

Copy link
Member Author

@thomasjpfan thomasjpfan Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for dask dataframes, should we get the first row with the following?

    if hasattr(X, 'iloc'):
        first_sample = X.iloc[0, :]
    else:
        first_sample = X[0]

(Rename to first_sample to be a little more clear)

Copy link
Member

@ogrisel ogrisel Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's necessary: I see no reason why X.shape[1] would be Delayed in a dask-dataframe. Only the first dimension is lazily depend on the result of index based partitioning, for instance after a groupby operation.

sklearn/utils/validation.py Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Mar 10, 2021

I applied my own suggestions to move this PR forward. I will fix the CI if they broke anything.... which they did :)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@thomasjpfan @NicolasHug let me know if you agree with the changes I pushed. In particular: #19633 (comment)

@thomasjpfan
Copy link
Member Author

I updated the error message to display the type when the first sample is a string or byte at 5c010dc (#19633)

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly questions for my own better understanding.

sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/tests/test_validation.py Show resolved Hide resolved
sklearn/base.py Outdated Show resolved Hide resolved
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lorentzenchr
Copy link
Member

@ogrisel As there has been changes since your approval: are we good to merge?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the latest changes. Let's merge.

@ogrisel ogrisel merged commit d996eaf into scikit-learn:main Mar 16, 2021
marrodion pushed a commit to marrodion/scikit-learn that referenced this pull request Mar 17, 2021
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.