[MRG] Native support for missing values in GBDTs #13911
Conversation
|
This comment is for documenting the technical changes made in this PR. Ping @ogrisel when you're back ;)
|
|
Hi @NicolasHug! Nice work. I have the following comments but otherwise it looks good. Furthermore we should document in the API/estimator changes section of the what's new document that the internal data structure of the fitted model has changed to be able to add the native support for missing values. Alternatively we could override the |
| predictions_binned = predictor.predict_binned( | ||
| X_binned, missing_values_bin_idx=bin_mapper.missing_values_bin_idx_) | ||
| assert np.all(predictions == -gradients) | ||
| assert np.all(predictions_binned == -gradients) |
ogrisel
Aug 20, 2019
Member
While this is a good test, I wonder if it wouldn't be better to write something that looks equivalent from the public API only to make it easier to understand. E.g. something based on the following:
>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> import numpy as np
>>> X = np.asarray([-np.inf, 0, 1, np.inf, np.nan]).reshape(-1, 1)
>>> X
array([[-inf],
[ 0.],
[ 1.],
[ inf],
[ nan]])
>>> y_isnan = np.isnan(X.ravel())
>>> y_isnan
array([False, False, False, False, True])
>>> y_isinf = X.ravel() == np.inf
>>> y_isinf
array([False, False, False, True, False])
>>> stump_clf = HistGradientBoostingClassifier(min_samples_leaf=1, max_iter=1, learning_rate=1., max_depth=2)
>>> stump_clf.fit(X, y_isinf).score(X, y_isinf)
1.0
>>> stump_clf.fit(X, y_isnan).score(X, y_isnan)
1.0
While this is a good test, I wonder if it wouldn't be better to write something that looks equivalent from the public API only to make it easier to understand. E.g. something based on the following:
>>> from sklearn.experimental import enable_hist_gradient_boosting
>>> from sklearn.ensemble import HistGradientBoostingClassifier
>>> import numpy as np
>>> X = np.asarray([-np.inf, 0, 1, np.inf, np.nan]).reshape(-1, 1)
>>> X
array([[-inf],
[ 0.],
[ 1.],
[ inf],
[ nan]])
>>> y_isnan = np.isnan(X.ravel())
>>> y_isnan
array([False, False, False, False, True])
>>> y_isinf = X.ravel() == np.inf
>>> y_isinf
array([False, False, False, True, False])
>>> stump_clf = HistGradientBoostingClassifier(min_samples_leaf=1, max_iter=1, learning_rate=1., max_depth=2)
>>> stump_clf.fit(X, y_isinf).score(X, y_isinf)
1.0
>>> stump_clf.fit(X, y_isnan).score(X, y_isnan)
1.0
NicolasHug
Aug 20, 2019
Author
Member
The issue with the public API is that we can't test the predictions for X_binned which I think is important too.
I'll add your test as well though, it can't hurt ;)
The issue with the public API is that we can't test the predictions for X_binned which I think is important too.
I'll add your test as well though, it can't hurt ;)
|
|
||
| # Make sure in particular that the +inf sample is mapped to the left child | ||
| # Note that lightgbm "fails" here and will assign the inf sample to the | ||
| # right child, even though it's a "split on nan" situation. |
ogrisel
Aug 20, 2019
Member
Are you sure LightGBM fails in this case? Why would they have introduced AvoidInf() if not for this case?
Are you sure LightGBM fails in this case? Why would they have introduced AvoidInf() if not for this case?
NicolasHug
Aug 20, 2019
Author
Member
I opened microsoft/LightGBM#2277, looks like they just fixed it (I haven't checked again though)
I opened microsoft/LightGBM#2277, looks like they just fixed it (I haven't checked again though)
Since this is still experimental, I would rather not worry about that. Adding sample weights and then categorical data would probably also cause the same issue. |
|
I doubt that |
|
Thanks Olivier, I have addressed comments and updated the whatsnew with a compound entry of all the changes so far. I labeled this as a MajorFeature, but feel free to change, I'm not sure. Regarding the pickles: we don't even support pickling between major versions so I'm not sure we should have a special case for these estimators |
Well it's cheap (and friendly) to warn the users in the changelog when that actually happens. |
|
LGTM but some more comments / questions ;) Whether this is to be considered a major feature or not can later be changed (either at release time and or discussed at next dev meeting). |
sklearn/ensemble/_hist_gradient_boosting/tests/test_splitting.py
Outdated
Show resolved
Hide resolved
|
@adrinjalali Does this have your +1 as well? Just in case, maybe @thomasjpfan @jnothman @rth @glemaitre @qinhanmin2014 would want to give it a quick look before we merge? |
|
I like the proposal. Don't have time soon to check for correctness unfortunately |
|
At some point our docs skulls have a reference listing of estimators that'd support missing values |
|
The example in docstrings still need fixing. |
Indeed. This time I ran pytest locally prior to pushing :) |
|
This looks good now. And I guess the three of us are in agreement. Merging, nitpicks can go in other PRs, rather have this in, and there's time to fix the issues before the release. |
4b6273b
into
scikit-learn:master


This PR implements support for missing values in histogram-based GBDTs.
By missing values, I mean NaNs. I haven't implemented anything related to sparse matrices (that's for another PR).
Here's a breakdown of the changes, which follows LightGBM and XGBoost strategies. It's relatively simple:
When binning, missing values are assigned to a specific bin (the first one*). We ignore the missing values when computing the bin thresholds (i.e. we only use non-missing values, just like before).
When training, what changes is the strategy to find the best bin: basically when considering a bin to split on, we compute the gain according to 2 scenarii: mapping the samples with missing values to the left child, or mapping them to the right child.
Concretely, this is done by scanning the bins from left to right (like we used to) and from right to left. This is what LightGBM does, and it's explained in XGBoost paper (algo 3.).
At predicting, samples with nans are mapped to the appropriate child (left or right) that was learned to be the one with best gain. Note that if there were no missing values during training, then the samples with missing values are mapped to whichever node has the most samples. That's what H20 does (I haven't checked the other libraries but that seems to be the most sensible behavior).
*LightGBM assigns missing values to the last bin.
I haven't found a compelling reason to do so. We assign missing values to the first bin instead, which has the advantage of always being the first bin (whereas the index of the last bin may vary, and thus needs to be passed along as a parameter which is annoying).We also assign missing values to the last bin nowEDIT: see list of technical changes at #13911 (comment)