[MRG] Correct handling of missing_values and NaN in SimpleImputer for object arrays (closes #19071) #19079
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Fixes #19071.
In
SimpleImputer
strategies mean/median you can use input arrays with dtype isobject
. Imagine numeric data in the input array, except for missing values. In object arrays, you have a lot of flexibility on how to encode missing values. You can usenp.nan
,None
, a string constant or anything else. There are two scenarios which are described in the issue #19071:missing_values
parameter of theSimpleImputer
. In this case the mean/median imputation should fail, because there are still non-numeric values in the input array other thanmissing_values
which is masked. See SimpleImputer, missing_values and None #19071 for an example of not failing.missing_values
parameter of theSimpleImputer
. In this case the mean/median imputation must not fail and should calculate mean/median from the remaining values of the input array. See SimpleImputer, missing_values and None #19071 for an example of this failing withmissing_values=None
.What does this implement/fix? Explain your changes.
In the
_validate_input
method, I keep the dtype as object.In the
_dense_fit
method I try to convert the object array into float64 to prepare for mean/median calculation. If the not-masked values still contain some non-numeric values (scenario 1), ValueError is raised with the following informative error message:where the number of examples is limited to three.
Any other comments?
A comprehensive set of unit tests is added, which covers both scenarios.