[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079

ftrojan · Dec 29, 2020

Reference Issues/PRs

Fixes #19071.
In SimpleImputer strategies mean/median you can use input arrays with dtype is object. Imagine numeric data in the input array, except for missing values. In object arrays, you have a lot of flexibility on how to encode missing values. You can use np.nan, None, a string constant or anything else. There are two scenarios which are described in the issue #19071:

There is more than one type of missing value in the input object array, one of them is specified as the missing_values parameter of the SimpleImputer. In this case the mean/median imputation should fail, because there are still non-numeric values in the input array other than missing_values which is masked. See SimpleImputer, missing_values and None #19071 for an example of not failing.
There is just one type of missing value in the input object array and that is specified as the missing_values parameter of the SimpleImputer. In this case the mean/median imputation must not fail and should calculate mean/median from the remaining values of the input array. See SimpleImputer, missing_values and None #19071 for an example of this failing with missing_values=None.

What does this implement/fix? Explain your changes.

In the _validate_input method, I keep the dtype as object.
In the _dense_fit method I try to convert the object array into float64 to prepare for mean/median calculation. If the not-masked values still contain some non-numeric values (scenario 1), ValueError is raised with the following informative error message:

Non-numeric values other than missing_values={missing_values}, showing {num_show}/{num_notmasked_nan}: {examples}

where the number of examples is limited to three.

Any other comments?

A comprehensive set of unit tests is added, which covers both scenarios.

kyrajeep

Thank you for the clear explanation/comment. It makes sense. I appreciate that all the checks passed.

cmarmo · Dec 13, 2022

Hi @ftrojan I'm sorry your pull request got lost.
If you are still interested in working on this do you mind fixing conflicts and synchronizing with main?
Thank you so much for your patience.

mionimum viable product

0653d9e

github-actions bot added the module:impute label Dec 29, 2020

flake8

a780e9b

ftrojan changed the title ~~Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071)~~ [MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) Dec 29, 2020

kyrajeep reviewed Jan 4, 2021

View reviewed changes

cmarmo added the Waiting for Reviewer label Jan 11, 2021

Base automatically changed from master to main January 22, 2021 10:53

cmarmo added the Bug label Mar 25, 2021

cmarmo removed the Waiting for Reviewer label Dec 13, 2022

cmarmo added Stalled help wanted labels Dec 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079

[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079

Uh oh!

ftrojan commented Dec 29, 2020 •

edited

Loading

Uh oh!

kyrajeep left a comment •

edited

Loading

Uh oh!

cmarmo commented Dec 13, 2022

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079

Are you sure you want to change the base?

[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079

Uh oh!

Conversation

ftrojan commented Dec 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

kyrajeep left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmarmo commented Dec 13, 2022

Uh oh!

Uh oh!

ftrojan commented Dec 29, 2020 •

edited

Loading

kyrajeep left a comment •

edited

Loading