Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign up[MRG+1] Faster Gradient Boosting Decision Trees with binned features #12807
Conversation
NicolasHug
added some commits
Dec 17, 2018
NicolasHug
added some commits
Dec 21, 2018
NicolasHug
referenced this pull request
Jan 2, 2019
Merged
MNT simplify some tree code with memoryviews #12886
NicolasHug
added some commits
Jan 3, 2019
NicolasHug
added some commits
Mar 16, 2019
This comment has been minimized.
This comment has been minimized.
if that was true for all possible examples showing the benefits of this method, then we wouldn't be trying to merge this PR, would we? The mere fact that it's much faster for larger datasets while the performance doesn't degrade, deserves a simple example itself. That said, we do need to keep the examples fast. Is it feasible to have an example whish runs fast, and compares this implementation with the old one and/or other ensembles, and yet shows the speedup? |
This comment has been minimized.
This comment has been minimized.
I could make an example reproducing the first benchmark? The thing is, it will be either slow or not super interesting, since the comparison is interesting precisely when the current implementation starts to be slow. I tried to come-up with other examples that would be interesting but I didn't get anything convincing so far. For example, I thought it'd be nice to illustrate the impact of the Really the only reason one would want to use this new implementation is that it's (much) faster. |
adrinjalali
self-assigned this
Mar 21, 2019
adrinjalali
referenced this pull request
Mar 21, 2019
Open
Doesn't load comments if they're too many #1060
This comment has been minimized.
This comment has been minimized.
|
@NicolasHug other than a potential example (which I'm not too attached to anyway) what else is left here? I'm trying to apply the changes to the old trees we talked about, and for that we better have this in first, since those changes slow down the gradient boost models we have twice as slow (as @glemaitre benchmarked). |
This comment has been minimized.
This comment has been minimized.
There is nothing left, as long as you're OK with the current limitations (non-implemented features etc) that are mentioned in the header. IMO it makes sense to implement these new features in other PRs since this one is already gigantic. One thing I didn't mention in the header (but I also believe it should be addressed in another PR) is that |
This comment has been minimized.
This comment has been minimized.
|
I think you address all my comments, if I'm not mistaken. |
This comment has been minimized.
This comment has been minimized.
Just double-checked my mail notifications and yes I have addressed everything so far :) |
This was referenced Mar 23, 2019
This comment has been minimized.
This comment has been minimized.
|
Since you seem to be happy with it, would you mind marking this PR as approved @adrinjalali :) ? I'm hoping it will help other reviews |
adrinjalali
changed the title
[MRG] Faster Gradient Boosting Decision Trees with binned features
[MRG+1] Faster Gradient Boosting Decision Trees with binned features
Mar 25, 2019
adrinjalali
approved these changes
Mar 25, 2019
|
specially since we're putting this in experimental, I'm more than happy to have it it :) |
| G_H_DTYPE_C [::1] ordered_hessians | ||
| unsigned char hessians_are_constant | ||
|
|
||
| def __init__(self, const X_BINNED_DTYPE_C [::1, :] X_binned, unsigned int |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
it really is a nitpick, but the unsigned int at the end of the line w/o the variable name just looks odd :P
| int n_samples | ||
| int feature_idx | ||
| int i | ||
| # need local views to avoid python interactions |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
just a note that in the splitter.pyx case where I had them, if you remember, the local variables actually slowed things down. Not sure how it'd be here.
| with nogil: | ||
| n_samples = sample_indices.shape[0] | ||
|
|
||
| # Populate ordered_gradients and ordered_hessians. (Already done |
This comment has been minimized.
This comment has been minimized.
| bin_2 = binned_feature[sample_indices[i + 2]] | ||
| bin_3 = binned_feature[sample_indices[i + 3]] | ||
|
|
||
| out[feature_idx, bin_0].sum_gradients += ordered_gradients[i] |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
It's really bugging me that there's like 4 or so times of the copy/paste of this part. I wish we could have it in an inline function or something.
This comment has been minimized.
This comment has been minimized.
jeremiedbb
Mar 25, 2019
Contributor
This is to optimize auto-vectorization. It helps gcc :)
@NicolasHug it would be interesting to try not to manually unroll the loop and compile with the -Ofast flag. If performance and numerical stability are preserved, I think it would be worth, and the code would be simpler.
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
-Ofast enables all -O3 optimizations, which are known to produce really buggy code. I'd rather not do that. It'll be fine to enable the specific optimizations which you have in mind though.
This comment has been minimized.
This comment has been minimized.
jeremiedbb
Mar 25, 2019
•
Contributor
-Ofast enables way more than -O3 :)
In fact, -O3 is enabled by default when building sklearn. If you look at the build log, you'll see -O3 all over the place. This flag is safe. It should produce correct code w.r.t. floating point arithmetic.
-Ofast on the other hand enables unsafe math operations, e.g. make use of associativity rules.
However, in this file there's nothing to optimize this way so I think it should be safe.
This comment has been minimized.
This comment has been minimized.
NicolasHug
Mar 25, 2019
Author
Contributor
I'm not sure how I feel either about using Ofast. If we re-wrap this code and use Ofast but in a future decide not to use Ofast anymore for whatever reason, the code will be unnecessarily slow.
I agree however that this unwarpping is a bit unsettling when you first encounter it. I'll add a note at the beginning of the file to explain it.
This comment has been minimized.
This comment has been minimized.
jeremiedbb
Mar 26, 2019
Contributor
I want to add that scipy enables the -ffast-math flag (which is enabled by -Ofast), which enables the -funsafe_math_optimizations flag. Since sklearn makes intensive use of scipy, we sort of already effectively do it in sklearn :)
| sibling = None | ||
| parent = None | ||
|
|
||
| # start and stop indices of the node in the splitter.partition |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
It may help the future developers if you explain what you explained us regarding these partitions and splits here. It's very elegant, it'd be just easier if the reader sees that in comments with a better explanation maybe?
| if self.n_iter_no_change is not None and self.n_iter_no_change < 0: | ||
| raise ValueError('n_iter_no_change={} must be ' | ||
| 'positive.'.format(self.n_iter_no_change)) | ||
| if (self.validation_fraction is not None and |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
then if validation_fraction is not really only a fraction, shouldn't it be like validation_size or something?
| # called from fit(), which only passes pre-binned data to | ||
| # raw_predict() via the scorer_ attribute. predicting is faster on | ||
| # pre-binned data. | ||
| self._in_fit = True |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
if the data is uint8 and max_bins is let say 10, you'd still need to bin, don't you?
| stopped when none of the last ``n_iter_no_change`` scores are better | ||
| than the ``n_iter_no_change - 1``th-to-last one, up to some | ||
| tolerance. If None or 0, no early-stopping is done. | ||
| tol : float or None optional (default=1e-7) |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
| tol : float or None optional (default=1e-7) | |
| tol : float or None, optional (default=1e-7) |
| >>> from sklearn.experimental import HistGradientBoostingRegressor | ||
| >>> X, y = load_boston(return_X_y=True) | ||
| >>> est = HistGradientBoostingRegressor().fit(X, y) | ||
| >>> est.score(X, y) |
This comment has been minimized.
This comment has been minimized.
| This estimator is much faster than | ||
| :class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>` | ||
| for big datasets (n_samples >= 10 000). The input data `X` is pre-binned |
This comment has been minimized.
This comment has been minimized.
adrinjalali
Mar 25, 2019
Member
| for big datasets (n_samples >= 10 000). The input data `X` is pre-binned | |
| for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned |
NicolasHug
referenced this pull request
Mar 25, 2019
Closed
Deprecate sklearn.extmath.log_logistic? #13509
This comment has been minimized.
This comment has been minimized.
|
I would just mention that since it uses OpenMP, this PR needs #13297 (which won't be as such but vendored from loky) before we merge it. |
NicolasHug
added some commits
Mar 25, 2019
This comment has been minimized.
This comment has been minimized.
|
Looks like I can't reply in the comments either so I'll do it here
The thing is all estimators so far use
Yes we bin regardless of the input time and regardless of max_bins. It's OK to re-bin data that was already binned, since the binning operation is idempotent. |
This comment has been minimized.
This comment has been minimized.
|
Thank you very much for the reviews Adrin!!! |
This comment has been minimized.
This comment has been minimized.
As discussed #12807 (comment) we agreed to make this a subsequent PR to avoid making this one even bigger. Note however than we're not using any Cblas routine here so over-subscription at the openmp level isn't an issue. |
This comment has been minimized.
This comment has been minimized.
|
It's not about over-subscription, but about how would you pass the Also, I think it's important to have the openmp helpers before any other integration of openmp because it comes with a test suite to make sure OpenMP works fine on all platforms. |
This comment has been minimized.
This comment has been minimized.
Sure, I was mentioning over-subscription just for completeness. Like I said above |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.

NicolasHug commentedDec 17, 2018
•
edited
This PR proposes a new implementation for Gradient Boosting Decision Trees. This isn't meant to be a replacement of the current sklearn implementation but rather an addition.
This addresses the second bullet point from #8231.
This is a port from pygbm (with @ogrisel, in Numba), which itself uses lots of the optimizations from LightGBM.
Algorithm details and refs
The main differences with the current sklearn implementation are:
Notes to reviewers
This is going to be a lot of work to review, so please feel free to tell me if there's anything I can do / add that could ease reviewing.
Here's a list of things that probably need to be discussed at some point or that are worth pointing out.
The code is a port of pygbm (from numba to cython). I've ported all the tests as well. So a huge part of the code has already been carefully reviewed (or written) by @ogrisel. There are still a few non-trivial changes to the pygbm's code, to accommodate for the numba -> cython translation.
Like #11950, this PR uses OpenMP parallelism with Cython
The code is in
sklearn/ensemble._hist_gradient_boostingand the estimators are exposed insklearn.experimental(which is created here, as a result of a discussion during the Paris sprint).Like in LightGBM, the targets y, gains, values, and sums of gradient / hessians are doubles, and the gradients and hessians array are floats to save space (14c7d47).Y_DTYPEand the associated C type for targetsyis double and not float, because with float the numerical checks (test_loss.py) would not pass. Maybe at some point we'll want to also allow floats since using doubles uses twice as much space (which is not negligible, see the attributes of theSplitterclass).I have only added a short note in the User Guide about the new estimators. I think that the gradient boosting section of the user guide could benefit from an in-depth rewriting. I'd be happy to do that, but in a later PR.
Currently the parallel code uses all possible threads. Do we want to expose
n_jobs(openmp-wise, not joblib of course)?The estimator names are currently
HistGradientBoostingClassifierandHistGradientBoostingRegressor.API differences with current implementation:
Happy to discuss these points of course. In general I tried to match the parameters names with those of the current GBDTs.
New features:
validation_fractioncan also be an int to specify absolute size of the validation set (not just a proportion)Changed parameters and attributes:
n_estimatorsparameter has been changed tomax_iterbecause unlike the current GBDTs implementations, the underlying "predictor" aren't estimators. They are private and have nofitmethod. Also, in multiclass classification we build C * max_iterestimators_attribute has been removed for the same reason.train_score_is of sizen_estimators + 1instead ofn_estimatorsbecause it contains the score of the 0th iteration (before the boosting process).oob_improvement_is replaced byvalidation_score_, also with sizen_estimators + 1Unsupported parameters and attributes:
subsample(doesn't really make sense here)criterion(same)min_samples_splitis not supported, butmin_samples_leafis supported.samples_weights-relatedmin_impurity_decreaseis not supported (we havemin_gain_to_splitbut it is not exposed in the public API)warm_startmax_features(probably not needed)staged_decision_function,staged_predict_proba, etc.initestimatorfeature_importances_loss_attribute is not exposed.Future improvement, for later PRs (no specific order):
Benchmarks
Done on my laptop, intel i5 7th Gen, 4 cores, 8GB Ram.
TLDR:
Comparison between proposed PR and current estimators:
on binary classification only, I don't think it's really needed to do more since the performance difference is striking. Note that for larger sample sizes the current estimators simply cannot run because of the sorting step that never terminates. I don't provide the benchmark code, it's exactly the same as that of

benchmarks/bench_fast_gradient_boosting.py:Comparison between proposed PR and LightGBM / XGBoost:
On the Higgs-Boson dataset:
python benchmarks/bench_hist_gradient_boosting_higgsboson.py --lightgbm --xgboost --subsample 5000000 --n-trees 50Sklearn: done in 28.787s, ROC AUC: 0.7330, ACC: 0.7346
LightGBM: done in 27.595s, ROC AUC: 0.7333, ACC: 0.7349
XGBoost: done in 41.726s, ROC AUC: 0.7335, ACC: 0.7351
Entire log:
regression task:

python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem regression --n-samples-max 5000000 --n-trees 50Binary classification task:
python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 2 --n-samples-max 5000000 --n-trees 50python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 3 --n-samples-max 5000000 --n-trees 50