[RFC] Missing values in RandomForest

So far, we have been using our preprocessing.Imputer to take care of missing values before fitting the RandomForest. Ref: this example

Proposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively...

Have a missing_values variable which will be either None (To raise error when we run into MVs) or int/nan value that will act as placeholder for missing_values.

Different ways in which missing values can be handled (I might have naively added duplicates) -

(-1-1) Add an optional imputation variable, where we can either
- Specify the strategy 'mean', 'median', 'most_frequent' (or missing_value?) and let the clf construct the Imputer on the fly...
- Pass a built Imputer object (like we do for scoring or cv)
  This was the simplest approach. Variants are 5 and 6.
Note that we can do this already using pipeline of imputer followed by the random forest.
(-1+1) Ignore the missing values at the time of generating the splits.
(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).

This is Gilles' suggestion. This is conceptually same as the "separate-class-method", where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff's paper to be the best method in different situations.
As done in rpart, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing...
Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff's paper's paper.

I think, this goes something along the lines of this example
[1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] -
* Split with available value is L--> [1, 2] R --> [3]
* Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3
Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.
Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles' PhD Section 4.4.4)
Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.
Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles' PhD Section 4.4.4).

This, I think, is conceptually similar to 4.

NOTE:

4, 7, 8, 9 are computationally intensive.
5 is not easy to do with our current API
3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
Gilles' -1 were for 1, 2 (The rest were added later)
Ding and Simonoff's paper which compares various methods and their relative accuracy is a good reference.

Taken from Ding and Simonoff's paper the performance of various missing-value methods

CC: @agramfort @GaelVaroquaux @glouppe @arjoly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Missing values in RandomForest #5870

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

[RFC] Missing values in RandomForest #5870

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions