Description
Describe the issue linked to the documentation
Context
We discussed with @glemaitre and @GaelVaroquaux about documenting missing-values practices for prediction in scikit-learn as part of my PhD work at Inria (discussion here).
Indeed, the current documentation gives no recommendations on this point. We feel that there is now a better understanding of missing values in the context of supervised learning and thus that we have more hindsight on the theoretical and practical messages that would be helpful for the users than when the current documentation was written. We think it would be useful to restructure the documentation and examples to convey these messages.
Messages to convey
Main messages:
- Missing values in inference and in supervised learning are different problems with different tradeoffs. Define the terms and highlight the differences.
- Don't impute jointly training and test sets: data leakage and can't use in production.
- Simpler learners need powerful imputation (e.g conditional imputation with Iterative Imputer). Define conditional imputation. (theoretical arguments can be found in LeMorvan 2020).
- Conditional imputation is guaranteed to work only for "ignorable missingness" (Missing At Random mechanism, to define). Otherwise, the mask is needed (missingness is seldom ignored: the data are missing for a reason). Wikipedia pages on missing values can justify this.
- Powerful learners + simple imputation or no imputation works best (robustness to missingness mechanisms and flexibility), e.g HistGradientBoosting (this comes from experience, including systematic benchmarks).
- For categorical features, impute missing values as a new category (imputing to an existing category destroys information important to the learner).
- Computation cost of imputation can quickly get large, and even intractable for the most costly methods (e.g IterativeImputer, KNNImputer).
Side messages:
- The optimal predictor on partially-observed values is not always "good" imputation + the optimal predictor on the fully-observed values (Le Morvan et al. 2021). You need to account for missingness in some way.
- For multiple imputation, need to separate training and test behaviors (cf main message 1 above).
- As a consequence, ensemble methods, such as bagging, are a good solution for implementing multiple imputation in practice (a single supervised learning applied to many imputations is likely severely suboptimal).
Take-home messages:
- If little data: use conditional imputation and simple learners.
- If a lot of data (n>1000), use HistGradientBoosting.
- Don't impute categorical variables.
Ressources
- Wikipedia Missing data
- Josse 2019 On the consistency of supervised learning with missing values
- LeMorvan 2020 Linear predictor on linearly-generated data with missing values: non consistency and solutions
- LeMorvan 2021 What's a good imputation to predict with missing values?
- Perez-Lebel 2022 Benchmarking missing-values approaches for predictive models on health databases
Suggest a potential alternative/fix
After discussing with @glemaitre and @GaelVaroquaux, the following changes were suggested.
Big picture
The goal is to give the recommendations above, and have simple examples that convey the right intuitions (even simple simulated data can be didactic by showing the basic mechanisms).
- Write a narrative doc page that gives the big picture messages listed above and some figures.
- Replace the current examples about imputation that don't give a clear message.
- An example to generate simple figures explaining the difference between Missing At Random and Missing Not At Random (as in https://www.slideshare.net/GaelVaroquaux/dirty-data-science-machine-learning-on-noncurated-data/27).
Didactic purpose: intuitions on the fact that missing values may distort distributions. Short example. - An example to develop intuitions on imputation with the interplay between imputation and learning, adapted from http://dirtydata.science/python/gen_notes/01_missing_values.html, but only the two first sections. Didactic purpose: How the mechanism + imputation modifies the link between X and y.
- Adapt the docstrings to give local recommendations:
a. IterativeImputer: give time complexity (algorithmic scalability) and say it is not a magic bullet in the face of structured missingness.
b. KNNImputer: terrible computational scalability.
c. SimpleImputer: does not work well with simple models.
Proposed roadmap
(refers to above, and can be detailed in a board)