The Wayback Machine - https://web.archive.org/web/20190327231156/https://github.com/scikit-learn/scikit-learn/pull/12807
Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] Faster Gradient Boosting Decision Trees with binned features #12807

Open
wants to merge 168 commits into
base: master
from
Open

[MRG+1] Faster Gradient Boosting Decision Trees with binned features #12807

wants to merge 168 commits into from

Conversation

@NicolasHug
Copy link
Contributor

NicolasHug commented Dec 17, 2018

This PR proposes a new implementation for Gradient Boosting Decision Trees. This isn't meant to be a replacement of the current sklearn implementation but rather an addition.

This addresses the second bullet point from #8231.

This is a port from pygbm (with @ogrisel, in Numba), which itself uses lots of the optimizations from LightGBM.

Algorithm details and refs

  • The proposed algorithm roughly corresponds to the 'approximate' variant of the XGBoost paper, except that the data is binned at the very beginning of the training process, instead of at each node of the trees.
  • See also Algorithm 1 of the LightGBM paper. Section 2.1 is worth a read.
  • For refresher or general background on GBDTs: The elements of statistical learning. The XGBoost paper is also pretty good.

The main differences with the current sklearn implementation are:

  • Before training, the data is binned into equally-spaced bins (up to 256 bins), which considerably reduces the number of split points to consider. The other advantage is that the data becomes integer-valued, which is faster to handle than real-valued data.
  • Newton method is used instead of gradient descent

Notes to reviewers

This is going to be a lot of work to review, so please feel free to tell me if there's anything I can do / add that could ease reviewing.

Here's a list of things that probably need to be discussed at some point or that are worth pointing out.

  • The code is a port of pygbm (from numba to cython). I've ported all the tests as well. So a huge part of the code has already been carefully reviewed (or written) by @ogrisel. There are still a few non-trivial changes to the pygbm's code, to accommodate for the numba -> cython translation.

  • Like #11950, this PR uses OpenMP parallelism with Cython

  • The code is in sklearn/ensemble._hist_gradient_boosting and the estimators are exposed in sklearn.experimental (which is created here, as a result of a discussion during the Paris sprint).

  • Y_DTYPE and the associated C type for targets y is double and not float, because with float the numerical checks (test_loss.py) would not pass. Maybe at some point we'll want to also allow floats since using doubles uses twice as much space (which is not negligible, see the attributes of the Splitter class). Like in LightGBM, the targets y, gains, values, and sums of gradient / hessians are doubles, and the gradients and hessians array are floats to save space (14c7d47).

  • I have only added a short note in the User Guide about the new estimators. I think that the gradient boosting section of the user guide could benefit from an in-depth rewriting. I'd be happy to do that, but in a later PR.

  • Currently the parallel code uses all possible threads. Do we want to expose n_jobs (openmp-wise, not joblib of course)?

  • The estimator names are currently HistGradientBoostingClassifier and HistGradientBoostingRegressor.

API differences with current implementation:

Happy to discuss these points of course. In general I tried to match the parameters names with those of the current GBDTs.

New features:
  • early stopping can be checked with an arbitrary scorer, not just with the loss
  • validation_fraction can also be an int to specify absolute size of the validation set (not just a proportion)
Changed parameters and attributes:
  • the losses parameters have different names. I personally think that 'deviance' is just obfuscating for logistic loss.
  • the n_estimators parameter has been changed to max_iter because unlike the current GBDTs implementations, the underlying "predictor" aren't estimators. They are private and have no fit method. Also, in multiclass classification we build C * max_iter
  • the estimators_ attribute has been removed for the same reason.
  • train_score_ is of size n_estimators + 1 instead of n_estimators because it contains the score of the 0th iteration (before the boosting process).
  • oob_improvement_ is replaced by validation_score_, also with size n_estimators + 1
Unsupported parameters and attributes:
  • subsample (doesn't really make sense here)
  • criterion (same)
  • min_samples_split is not supported, but min_samples_leaf is supported.
  • anything samples_weights-related
  • min_impurity_decrease is not supported (we have min_gain_to_split but it is not exposed in the public API)
  • warm_start
  • max_features (probably not needed)
  • staged_decision_function, staged_predict_proba, etc.
  • init estimator
  • feature_importances_
  • the loss_ attribute is not exposed.
  • Only least squares loss is supported for regression. No least absolute error, huber or quantile loss.

Future improvement, for later PRs (no specific order):

  • Implement categorical variables support (what to do if there are more than 256 categories?)
  • Allow for more than 256 bins (requires to "dynamically" encode bins as uint8 or uint32)
  • Implement handling of missing values
  • Implement fast PDPs
  • BinMapper is doing almost the same job as KBinDiscretizer (but it's parallelized) so we could eventually integrate it.

Benchmarks

Done on my laptop, intel i5 7th Gen, 4 cores, 8GB Ram.

TLDR:

  • considerably faster than the current sklearn implem
  • faster than XGBoost ('hist' method)
  • faster than CatBoost (not shown here because catboost is much slower than the others and would flatten the plots)
  • very close to lightgbm. In terms of prediction accuracy results are comparable.

Comparison between proposed PR and current estimators:

on binary classification only, I don't think it's really needed to do more since the performance difference is striking. Note that for larger sample sizes the current estimators simply cannot run because of the sorting step that never terminates. I don't provide the benchmark code, it's exactly the same as that of benchmarks/bench_fast_gradient_boosting.py:
current_vs_fast

Comparison between proposed PR and LightGBM / XGBoost:

  • On the Higgs-Boson dataset:
    python benchmarks/bench_hist_gradient_boosting_higgsboson.py --lightgbm --xgboost --subsample 5000000 --n-trees 50

    Sklearn: done in 28.787s, ROC AUC: 0.7330, ACC: 0.7346
    LightGBM: done in 27.595s, ROC AUC: 0.7333, ACC: 0.7349
    XGBoost: done in 41.726s, ROC AUC: 0.7335, ACC: 0.7351

    Entire log:

~/dev/sklearn(branch:gbm*) » python benchmarks/bench_hist_gradient_boosting_higgsboson.py --subsample 5000000 --n-trees 50 --lightgbm --xgboost                                    nico@cotier
Training set with 5000000 records with 28 features.
Fitting a sklearn model...
Binning 1.120 GB of data: 3.665 s
Fitting gradient boosted rounds:
[1/50] 1 tree, 31 leaves, max depth = 7, in 0.595s
[2/50] 1 tree, 31 leaves, max depth = 9, in 0.602s
[3/50] 1 tree, 31 leaves, max depth = 9, in 0.575s
[4/50] 1 tree, 31 leaves, max depth = 12, in 0.552s
[5/50] 1 tree, 31 leaves, max depth = 11, in 0.583s
[6/50] 1 tree, 31 leaves, max depth = 9, in 0.578s
[7/50] 1 tree, 31 leaves, max depth = 11, in 0.561s
[8/50] 1 tree, 31 leaves, max depth = 10, in 0.524s
[9/50] 1 tree, 31 leaves, max depth = 9, in 0.566s
[10/50] 1 tree, 31 leaves, max depth = 10, in 0.552s
[11/50] 1 tree, 31 leaves, max depth = 14, in 0.523s
[12/50] 1 tree, 31 leaves, max depth = 15, in 0.538s
[13/50] 1 tree, 31 leaves, max depth = 11, in 0.501s
[14/50] 1 tree, 31 leaves, max depth = 12, in 0.522s
[15/50] 1 tree, 31 leaves, max depth = 10, in 0.546s
[16/50] 1 tree, 31 leaves, max depth = 9, in 0.409s
[17/50] 1 tree, 31 leaves, max depth = 13, in 0.457s
[18/50] 1 tree, 31 leaves, max depth = 10, in 0.520s
[19/50] 1 tree, 31 leaves, max depth = 13, in 0.463s
[20/50] 1 tree, 31 leaves, max depth = 10, in 0.399s
[21/50] 1 tree, 31 leaves, max depth = 11, in 0.463s
[22/50] 1 tree, 31 leaves, max depth = 9, in 0.356s
[23/50] 1 tree, 31 leaves, max depth = 8, in 0.529s
[24/50] 1 tree, 31 leaves, max depth = 8, in 0.460s
[25/50] 1 tree, 31 leaves, max depth = 9, in 0.414s
[26/50] 1 tree, 31 leaves, max depth = 8, in 0.516s
[27/50] 1 tree, 31 leaves, max depth = 10, in 0.427s
[28/50] 1 tree, 31 leaves, max depth = 8, in 0.460s
[29/50] 1 tree, 31 leaves, max depth = 7, in 0.445s
[30/50] 1 tree, 31 leaves, max depth = 12, in 0.535s
[31/50] 1 tree, 31 leaves, max depth = 10, in 0.498s
[32/50] 1 tree, 31 leaves, max depth = 12, in 0.521s
[33/50] 1 tree, 31 leaves, max depth = 12, in 0.503s
[34/50] 1 tree, 31 leaves, max depth = 10, in 0.410s
[35/50] 1 tree, 31 leaves, max depth = 9, in 0.368s
[36/50] 1 tree, 31 leaves, max depth = 10, in 0.267s
[37/50] 1 tree, 31 leaves, max depth = 8, in 0.460s
[38/50] 1 tree, 31 leaves, max depth = 11, in 0.500s
[39/50] 1 tree, 31 leaves, max depth = 8, in 0.421s
[40/50] 1 tree, 31 leaves, max depth = 8, in 0.391s
[41/50] 1 tree, 31 leaves, max depth = 9, in 0.502s
[42/50] 1 tree, 31 leaves, max depth = 9, in 0.444s
[43/50] 1 tree, 31 leaves, max depth = 7, in 0.366s
[44/50] 1 tree, 31 leaves, max depth = 8, in 0.473s
[45/50] 1 tree, 31 leaves, max depth = 9, in 0.386s
[46/50] 1 tree, 31 leaves, max depth = 11, in 0.411s
[47/50] 1 tree, 31 leaves, max depth = 8, in 0.457s
[48/50] 1 tree, 31 leaves, max depth = 10, in 0.526s
[49/50] 1 tree, 31 leaves, max depth = 8, in 0.535s
[50/50] 1 tree, 31 leaves, max depth = 10, in 0.487s
Fit 50 trees in 28.738 s, (1550 total leaves)
Time spent finding best splits:  17.347s
Time spent applying splits:      2.356s
Time spent predicting:           1.428s
done in 28.787s, ROC AUC: 0.7330, ACC: 0.7346
Fitting a LightGBM model...
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=0.001, min_child_weight=0.001 will be ignored. Current value: min_sum_hessian_in_leaf=0.001
[LightGBM] [Warning] Starting from the 2.1.2 version, default value for the "boost_from_average" parameter in "binary" objective is true.
This may cause significantly different results comparing to the previous versions of LightGBM.
Try to set boost_from_average=false, if your old models produce bad results
[LightGBM] [Info] Number of positive: 2649426, number of negative: 2350574
[LightGBM] [Info] Total Bins 6143
[LightGBM] [Info] Number of data: 5000000, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.529885 -> initscore=0.119683
[LightGBM] [Info] Start training from score 0.119683
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 7
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 13
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 15
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 9
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 10
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 11
[LightGBM] [Debug] Trained a tree with leaves = 31 and max_depth = 8
done in 27.595s, ROC AUC: 0.7333, ACC: 0.7349
Fitting an XGBoost model...
[16:33:14] Tree method is selected to be 'hist', which uses a single updater grow_fast_histmaker.
[16:33:24] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=7
[16:33:25] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:26] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:26] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=8
[16:33:27] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:28] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:29] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:29] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=12
[16:33:30] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:31] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=12
[16:33:31] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=11
[16:33:32] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=13
[16:33:33] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:33] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:34] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:35] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:35] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=11
[16:33:36] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:36] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:37] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=12
[16:33:38] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:38] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:39] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:39] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:40] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=12
[16:33:41] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:41] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=8
[16:33:42] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:42] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:43] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=12
[16:33:44] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:44] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=8
[16:33:45] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=7
[16:33:45] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=11
[16:33:46] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=7
[16:33:47] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=7
[16:33:47] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:48] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:48] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:49] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:50] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:50] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:50] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:51] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=8
[16:33:52] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=10
[16:33:52] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=11
[16:33:53] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=8
[16:33:53] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=11
[16:33:54] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=9
[16:33:54] /workspace/src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 60 extra nodes, 0 pruned nodes, max_depth=12
done in 41.726s, ROC AUC: 0.7335, ACC: 0.7351

  • regression task:
    python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem regression --n-samples-max 5000000 --n-trees 50
    regression

  • Binary classification task:
    python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 2 --n-samples-max 5000000 --n-trees 50

binary_classif

  • multiclass classification task:
    python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 3 --n-samples-max 5000000 --n-trees 50

multiclass

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Mar 17, 2019

Regarding examples I'm not sure how useful it would really be, for now. Looking at the existing examples for the current GBDTs, they all rely on some non-implemented feature like plotting the validation loss at each iteration (requires staged_decision_function) or subsampling (not useful here).

if that was true for all possible examples showing the benefits of this method, then we wouldn't be trying to merge this PR, would we? The mere fact that it's much faster for larger datasets while the performance doesn't degrade, deserves a simple example itself. That said, we do need to keep the examples fast. Is it feasible to have an example whish runs fast, and compares this implementation with the old one and/or other ensembles, and yet shows the speedup?

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 17, 2019

Is it feasible to have an example whish runs fast, and compares this implementation with the old one and/or other ensembles, and yet shows the speedup?

I could make an example reproducing the first benchmark? The thing is, it will be either slow or not super interesting, since the comparison is interesting precisely when the current implementation starts to be slow.

I tried to come-up with other examples that would be interesting but I didn't get anything convincing so far.

For example, I thought it'd be nice to illustrate the impact of the max_bins parameter. But it turns out that it has a very small impact on computation time (the bottleneck of the algorithm is the histogram computation which is dominated by the number of samples, not the number of bins), and also a fairly small impact on the prediction accuracy since for big datasets it doesn't make much of a difference wether you're using 100 bins vs 200 bins.

Really the only reason one would want to use this new implementation is that it's (much) faster.

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Mar 23, 2019

@NicolasHug other than a potential example (which I'm not too attached to anyway) what else is left here?

I'm trying to apply the changes to the old trees we talked about, and for that we better have this in first, since those changes slow down the gradient boost models we have twice as slow (as @glemaitre benchmarked).

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 23, 2019

what else is left here?

There is nothing left, as long as you're OK with the current limitations (non-implemented features etc) that are mentioned in the header.

IMO it makes sense to implement these new features in other PRs since this one is already gigantic.

One thing I didn't mention in the header (but I also believe it should be addressed in another PR) is that n_jobs isn't yet supported here, i.e. openmp always uses as many threads as possible.

@adrinjalali

This comment has been minimized.

Copy link
Member

adrinjalali commented Mar 23, 2019

I think you address all my comments, if I'm not mistaken.
So I'm more than okay with this. Can't wait to have it in and have smaller PRs on all the improvements on documentation and implementation.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 23, 2019

I think you address all my comments, if I'm not mistaken.

Just double-checked my mail notifications and yes I have addressed everything so far :)

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 25, 2019

Since you seem to be happy with it, would you mind marking this PR as approved @adrinjalali :) ? I'm hoping it will help other reviews

@adrinjalali adrinjalali changed the title [MRG] Faster Gradient Boosting Decision Trees with binned features [MRG+1] Faster Gradient Boosting Decision Trees with binned features Mar 25, 2019

@adrinjalali
Copy link
Member

adrinjalali left a comment

specially since we're putting this in experimental, I'm more than happy to have it it :)

G_H_DTYPE_C [::1] ordered_hessians
unsigned char hessians_are_constant

def __init__(self, const X_BINNED_DTYPE_C [::1, :] X_binned, unsigned int

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

it really is a nitpick, but the unsigned int at the end of the line w/o the variable name just looks odd :P

int n_samples
int feature_idx
int i
# need local views to avoid python interactions

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

just a note that in the splitter.pyx case where I had them, if you remember, the local variables actually slowed things down. Not sure how it'd be here.

with nogil:
n_samples = sample_indices.shape[0]

# Populate ordered_gradients and ordered_hessians. (Already done

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

really nice!

bin_2 = binned_feature[sample_indices[i + 2]]
bin_3 = binned_feature[sample_indices[i + 3]]

out[feature_idx, bin_0].sum_gradients += ordered_gradients[i]

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

It's really bugging me that there's like 4 or so times of the copy/paste of this part. I wish we could have it in an inline function or something.

This comment has been minimized.

@jeremiedbb

jeremiedbb Mar 25, 2019

Contributor

This is to optimize auto-vectorization. It helps gcc :)
@NicolasHug it would be interesting to try not to manually unroll the loop and compile with the -Ofast flag. If performance and numerical stability are preserved, I think it would be worth, and the code would be simpler.

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

-Ofast enables all -O3 optimizations, which are known to produce really buggy code. I'd rather not do that. It'll be fine to enable the specific optimizations which you have in mind though.

This comment has been minimized.

@jeremiedbb

jeremiedbb Mar 25, 2019

Contributor

-Ofast enables way more than -O3 :)
In fact, -O3 is enabled by default when building sklearn. If you look at the build log, you'll see -O3 all over the place. This flag is safe. It should produce correct code w.r.t. floating point arithmetic.

-Ofast on the other hand enables unsafe math operations, e.g. make use of associativity rules.
However, in this file there's nothing to optimize this way so I think it should be safe.

This comment has been minimized.

@NicolasHug

NicolasHug Mar 25, 2019

Author Contributor

I'm not sure how I feel either about using Ofast. If we re-wrap this code and use Ofast but in a future decide not to use Ofast anymore for whatever reason, the code will be unnecessarily slow.

I agree however that this unwarpping is a bit unsettling when you first encounter it. I'll add a note at the beginning of the file to explain it.

This comment has been minimized.

@jeremiedbb

jeremiedbb Mar 26, 2019

Contributor

I want to add that scipy enables the -ffast-math flag (which is enabled by -Ofast), which enables the -funsafe_math_optimizations flag. Since sklearn makes intensive use of scipy, we sort of already effectively do it in sklearn :)

sibling = None
parent = None

# start and stop indices of the node in the splitter.partition

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

It may help the future developers if you explain what you explained us regarding these partitions and splits here. It's very elegant, it'd be just easier if the reader sees that in comments with a better explanation maybe?

if self.n_iter_no_change is not None and self.n_iter_no_change < 0:
raise ValueError('n_iter_no_change={} must be '
'positive.'.format(self.n_iter_no_change))
if (self.validation_fraction is not None and

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

then if validation_fraction is not really only a fraction, shouldn't it be like validation_size or something?

# called from fit(), which only passes pre-binned data to
# raw_predict() via the scorer_ attribute. predicting is faster on
# pre-binned data.
self._in_fit = True

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

if the data is uint8 and max_bins is let say 10, you'd still need to bin, don't you?

stopped when none of the last ``n_iter_no_change`` scores are better
than the ``n_iter_no_change - 1``th-to-last one, up to some
tolerance. If None or 0, no early-stopping is done.
tol : float or None optional (default=1e-7)

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member
Suggested change
tol : float or None optional (default=1e-7)
tol : float or None, optional (default=1e-7)
>>> from sklearn.experimental import HistGradientBoostingRegressor
>>> X, y = load_boston(return_X_y=True)
>>> est = HistGradientBoostingRegressor().fit(X, y)
>>> est.score(X, y)

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member

interesting, doesn't this need # doctest: +ELLIPSIS?

This estimator is much faster than
:class:`GradientBoostingClassifier<sklearn.ensemble.GradientBoostingClassifier>`
for big datasets (n_samples >= 10 000). The input data `X` is pre-binned

This comment has been minimized.

@adrinjalali

adrinjalali Mar 25, 2019

Member
Suggested change
for big datasets (n_samples >= 10 000). The input data `X` is pre-binned
for big datasets (n_samples >= 10 000). The input data ``X`` is pre-binned
@jeremiedbb

This comment has been minimized.

Copy link
Contributor

jeremiedbb commented Mar 25, 2019

I would just mention that since it uses OpenMP, this PR needs #13297 (which won't be as such but vendored from loky) before we merge it.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 25, 2019

Looks like I can't reply in the comments either so I'll do it here

then if validation_fraction is not really only a fraction, shouldn't it be like validation_size or something?

The thing is all estimators so far use validation_fraction so we would have inconsistent API names which might be unwanted.

if the data is uint8 and max_bins is let say 10, you'd still need to bin, don't you?

Yes we bin regardless of the input time and regardless of max_bins. It's OK to re-bin data that was already binned, since the binning operation is idempotent.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 25, 2019

Thank you very much for the reviews Adrin!!!

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 25, 2019

I would just mention that since it uses OpenMP, this PR needs #13297 (which won't be as such but vendored from loky) before we merge it.

As discussed #12807 (comment) we agreed to make this a subsequent PR to avoid making this one even bigger.

Note however than we're not using any Cblas routine here so over-subscription at the openmp level isn't an issue.

@jeremiedbb

This comment has been minimized.

Copy link
Contributor

jeremiedbb commented Mar 25, 2019

It's not about over-subscription, but about how would you pass the n_jobs parameter to the prange ?

Also, I think it's important to have the openmp helpers before any other integration of openmp because it comes with a test suite to make sure OpenMP works fine on all platforms.

@NicolasHug

This comment has been minimized.

Copy link
Contributor Author

NicolasHug commented Mar 25, 2019

It's not about over-subscription, but about how would you pass the n_jobs parameter to the prange ?

Sure, I was mentioning over-subscription just for completeness. Like I said above n_jobs isn't yet supported in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
Morty Proxy This is a proxified and sanitized view of the page, visit original site.