Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[MRG] SimpleImputer: Handle string features where all values are missing #18860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
Loading
from

Conversation

ssaamm
Copy link
Contributor

@ssaamm ssaamm commented Nov 17, 2020

Reference Issues/PRs

#17526 - This is relevant, as I don't think it was possible to run into this issue before.

What does this implement/fix? Explain your changes.

When dealing with a sparse string feature, it's not unlikely that a particular CV split has all missing values for said feature. In cases like this, SimpleImputer appears to break. I believe this is because _validate_input currently looks for a str value to determine dtype.

To me, it seems that if a user specifies a string for fill_value, that's another good indicator that dtype should be object.

Any other comments?

Thanks for your time and consideration!

@ssaamm
Copy link
Contributor Author

ssaamm commented Nov 18, 2020

Ah shoot, let me fix these linting problems.

Comment on lines +235 to +236
(isinstance(self.fill_value, str)
or any(isinstance(elem, str) for row in X for elem in row)):
Copy link
Member

@thomasjpfan thomasjpfan Nov 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using fill_value to determine the dtype of X and would cast X to an object dtype for input such as:

[[np.nan], [1], [np.nan]]

I am unsure if we want to do this. The current code is considering [[np.nan], [np.nan]] a numerical feature and raising an error if the fill_value is not numerical, which looks to be correct behavior.

If one using None as the missing value, then everything still works:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='constant', fill_value='UNKNOWN', missing_values=None)
X_2 = [[None], [None]]
imputer.fit_transform(X_2)

# array([['UNKNOWN'],
#        ['UNKNOWN']], dtype=object)

Related to: #17625

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @ssaamm !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.