-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
ENH Adds _num_features for array-likes #19633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CC @ogrisel This PR implements the idea from #19555 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for taking care of this @thomasjpfan. Here are some comment. Let me know what you think.
It would be great to improve the tests to check that the extended exception messages work as expected.
sklearn/utils/validation.py
Outdated
if len(X.shape) <= 1: | ||
raise TypeError(message) | ||
if isinstance(X.shape[1], numbers.Integral): | ||
return X.shape[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we raise ValueError
in the else
branch of this condition? I am not even sure how we could meaningfully trigger this in tests or in a legit usage scenario.
Maybe we should always return X.shape[1]
without the if isinstance(X.shape[1], numbers.Integral)
condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this code was inspired from the _num_samples
case for .shape[0]
where this is not always known for dask dataframes where you can get delayed first dimensions:
>>> df.shape
(Delayed('int-1d60df59-017f-49c3-80e5-f313b06e4c1d'), 120)
Furthermore, in this case _num_samples
would fallback to calling len(df)
which would naturally trigger the delayed computation.
For the _num_features
case, we would fallback to doing len(df[0])
. In the case of a dask dataframe df[0])
would be a series for the first column and therefore len(df[0])
would return the number of samples instead of the number of features.
Finally I don't see a valid common case where this would happen for the second dimension. I think we can remove this defensive coding condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for dask dataframes, should we get the first row with the following?
if hasattr(X, 'iloc'):
first_sample = X.iloc[0, :]
else:
first_sample = X[0]
(Rename to first_sample
to be a little more clear)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's necessary: I see no reason why X.shape[1]
would be Delayed
in a dask-dataframe. Only the first dimension is lazily depend on the result of index based partitioning, for instance after a groupby operation.
I applied my own suggestions to move this PR forward. I will fix the CI if they broke anything.... which they did :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@thomasjpfan @NicolasHug let me know if you agree with the changes I pushed. In particular: #19633 (comment)
I updated the error message to display the type when the first sample is a string or byte at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions for my own better understanding.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ogrisel As there has been changes since your approval: are we good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the latest changes. Let's merge.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Reference Issues/PRs
Related to #19555
What does this implement/fix? Explain your changes.
This PR adds a
_num_features
that is similar to_num_samples
but for features:_num_features
does not actual check that every single row has the same number of elements. The validation would either be done before hand by, like by_validate_data
, or validation is delegate to another estimator._validate_data
will always try to get the number of features based onX
without needing toensure_2d
. In other words, it_check_n_features
will validate when it can._num_features
, but I am assuming the validation would be done withcheck_array
at some point.Any other comments?
This PR enables #19555 to call
_check_n_features
whencv="prefit"
to make sure that the prefitted estimator and theCalibrationClassiferCV
are compatible.In the end, I am thinking about feature names. I working toward a simple method or function such as
_check_features(X)
that any estimator can call, that would do the "correct" thing when it comes ton_features_in_
andfeature_names_in_
.