[MRG] SimpleImputer: Handle string features where all values are missing #18860

ssaamm · Nov 17, 2020

Reference Issues/PRs

#17526 - This is relevant, as I don't think it was possible to run into this issue before.

What does this implement/fix? Explain your changes.

When dealing with a sparse string feature, it's not unlikely that a particular CV split has all missing values for said feature. In cases like this, SimpleImputer appears to break. I believe this is because _validate_input currently looks for a str value to determine dtype.

To me, it seems that if a user specifies a string for fill_value, that's another good indicator that dtype should be object.

Any other comments?

Thanks for your time and consideration!

ssaamm · Nov 18, 2020

Ah shoot, let me fix these linting problems.

thomasjpfan · Nov 19, 2020

sklearn/impute/_base.py

+                (isinstance(self.fill_value, str)
+                 or any(isinstance(elem, str) for row in X for elem in row)):


This is using fill_value to determine the dtype of X and would cast X to an object dtype for input such as:

[[np.nan], [1], [np.nan]]

I am unsure if we want to do this. The current code is considering [[np.nan], [np.nan]] a numerical feature and raising an error if the fill_value is not numerical, which looks to be correct behavior.

If one using None as the missing value, then everything still works:

from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='constant', fill_value='UNKNOWN', missing_values=None) X_2 = [[None], [None]] imputer.fit_transform(X_2) # array([['UNKNOWN'], # ['UNKNOWN']], dtype=object)

Related to: #17625

thomasjpfan

Thank you for the PR @ssaamm !

Allow imputation of string features with all nan values

acbc753

github-actions bot added the module:impute label Nov 17, 2020

Fix lint errors

774d3df

thomasjpfan reviewed Nov 19, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:53

jschubnell mentioned this pull request May 11, 2025

SimpleImputer casts category into object when using "most_frequent" strategy #31350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] SimpleImputer: Handle string features where all values are missing #18860

[MRG] SimpleImputer: Handle string features where all values are missing #18860

Uh oh!

ssaamm commented Nov 17, 2020

Uh oh!

ssaamm commented Nov 18, 2020

Uh oh!

thomasjpfan Nov 19, 2020 •

edited

Loading

Uh oh!

thomasjpfan left a comment

Uh oh!

Uh oh!

		(isinstance(self.fill_value, str)
		or any(isinstance(elem, str) for row in X for elem in row)):

Search code, repositories, users, issues, pull requests...

Uh oh!

[MRG] SimpleImputer: Handle string features where all values are missing #18860

Are you sure you want to change the base?

[MRG] SimpleImputer: Handle string features where all values are missing #18860

Uh oh!

Conversation

ssaamm commented Nov 17, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ssaamm commented Nov 18, 2020

Uh oh!

thomasjpfan Nov 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thomasjpfan Nov 19, 2020 •

edited

Loading