Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

SimpleImputer with the rule strategies median-1/median+1 #25642

Copy link
Copy link
Open
@gykovacs

Description

@gykovacs
Issue body actions

Describe the workflow you want to enable

I have run into an issue with SimpleImputer. Given a feature of, say, integer type, it is completely reasonable to impute the median to missing values. However, when the overall number of records is even, there is a decent chance, that the median falls between two integers according to the well-known rule (Sorted[N/2-1] + Sorted[N/2])/2. The issue is, that technically, this kind of imputation breaks the domain of the feature, it used to be integer, but now there are spectacular .5 numbers, which can act weirdly in further processing.

Long story, short, when a sequence like 4, 3, ?, 2, 4, 5, 1 is imputed by 3.5, it is not an integer sequence anymore.

Describe your proposed solution

My recommendation is to introduce something like an "adjusted median", which would ensure that the imputed value is a value of the domain of the feature. My recommendation is to pick Sorted[N/2-1] or Sorted[N/2], whichever has the highest number of occurances in the data. If equal, take the smallest.

Basically the "most_frequent" strategy applied to Sorted[N/2-1] and Sorted[N/2] only.

Describe alternatives you've considered, if relevant

Alternative solutions and strategy names could work as well. In the problem described above, the issue is that median calculation is limited to its mathematical definition. np.percentile, just like percentile functions in R offer more flexibility, as what happens in SimpleImputer with the strategy=median is that the 50% percentile is taken with linear interpolation. np.percentile could do it with nearest interpolation. I think offering this control would improve the flexibility of the imputer with very little effort.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.