Description
So far, we have been using our preprocessing.Imputer
to take care of missing values before fitting the RandomForest
. Ref: this example
Proposal: It would be beneficial to have the missing values taken care of by all the Tree based classifiers / regressors natively...
Have a missing_values
variable which will be either None
(To raise error when we run into MVs) or int/nan value that will act as placeholder for missing_values
.
Different ways in which missing values can be handled (I might have naively added duplicates) -
-
(-1-1) Add an optional
imputation
variable, where we can either- Specify the strategy
'mean'
,'median'
,'most_frequent'
(ormissing_value
?) and let the clf construct theImputer
on the fly... - Pass a built
Imputer
object (like we do forscoring
orcv
)
This was the simplest approach. Variants are 5 and 6.
Note that we can do this already using pipeline of imputer followed by the random forest.
- Specify the strategy
-
(-1+1) Ignore the missing values at the time of generating the splits.
-
(+1+1) Find the best split by sending the missing-valued samples either side and choosing the direction that brings about a maximum reduce in the entropy (impurity).
This is Gilles' suggestion. This is conceptually same as the "separate-class-method", where the missing values are considered as a separate categorical value. This is considered by Ding and Simonoff's paper to be the best method in different situations.
-
As done in
rpart
, we could use sorrogate variables, where the strategy is to basically use the other features to decide the split, if one feature goes missing... -
Probabilistic method where the missing values are sent to both children, but are weighed with a proportion of the number of non-missing values in each split. Ref Ding and Simonoff's paper's paper.
I think, this goes something along the lines of this example
[1, 2, 3, nan, nan, nan, nan], [0, 0, 1, 1, 1, 1, 0] -
* Split with available value is L--> [1, 2] R --> [3]
* Weights for the last 4 missing-valued samples is --> 2/3 R --> 1/3 -
Do imputation considering it as a supervised learning problem in itself, as done in MissForest. Build using available data --> Predict the missing values using this built model.
-
Impute the missing values using an inaccurate estimate (say using median imputation strategy). Build a RF on the completed data and update the missing values of each sample by the weighted mean value using proximity based methods. Repeat this until convergence. (Refer Gilles' PhD Section 4.4.4)
-
Similar to 6. But one step method, where the imputation is done using the median of the k-nearest neighbors. Refer this airbnb blog.
-
Use ternary trees instead of binary trees with one branch dedicated for missing values? (Refer Gilles' PhD Section 4.4.4).
This, I think, is conceptually similar to 4.
NOTE:
- 4, 7, 8, 9 are computationally intensive.
- 5 is not easy to do with our current API
- 3, 6 seem promising. I will implement 3 and see if I can extend that to 6 later
- Gilles' -1 were for 1, 2 (The rest were added later)
- Ding and Simonoff's paper which compares various methods and their relative accuracy is a good reference.
Taken from Ding and Simonoff's paper the performance of various missing-value methods