Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
Loading
from

Conversation

ftrojan
Copy link

@ftrojan ftrojan commented Dec 29, 2020

Reference Issues/PRs

Fixes #19071.
In SimpleImputer strategies mean/median you can use input arrays with dtype is object. Imagine numeric data in the input array, except for missing values. In object arrays, you have a lot of flexibility on how to encode missing values. You can use np.nan, None, a string constant or anything else. There are two scenarios which are described in the issue #19071:

  1. There is more than one type of missing value in the input object array, one of them is specified as the missing_values parameter of the SimpleImputer. In this case the mean/median imputation should fail, because there are still non-numeric values in the input array other than missing_values which is masked. See SimpleImputer, missing_values and None #19071 for an example of not failing.
  2. There is just one type of missing value in the input object array and that is specified as the missing_values parameter of the SimpleImputer. In this case the mean/median imputation must not fail and should calculate mean/median from the remaining values of the input array. See SimpleImputer, missing_values and None #19071 for an example of this failing with missing_values=None.

What does this implement/fix? Explain your changes.

In the _validate_input method, I keep the dtype as object.
In the _dense_fit method I try to convert the object array into float64 to prepare for mean/median calculation. If the not-masked values still contain some non-numeric values (scenario 1), ValueError is raised with the following informative error message:

Non-numeric values other than missing_values={missing_values}, showing {num_show}/{num_notmasked_nan}: {examples}

where the number of examples is limited to three.

Any other comments?

A comprehensive set of unit tests is added, which covers both scenarios.

@ftrojan ftrojan changed the title Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) [MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) Dec 29, 2020
Copy link

@kyrajeep kyrajeep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clear explanation/comment. It makes sense. I appreciate that all the checks passed.

Base automatically changed from master to main January 22, 2021 10:53
@cmarmo cmarmo added the Bug label Mar 25, 2021
@cmarmo
Copy link
Contributor

cmarmo commented Dec 13, 2022

Hi @ftrojan I'm sorry your pull request got lost.
If you are still interested in working on this do you mind fixing conflicts and synchronizing with main?
Thank you so much for your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SimpleImputer, missing_values and None
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.