-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Balanced Random Forest #13227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balanced Random Forest #13227
Changes from all commits
5ab7141
4fe83b9
e65e4da
04617e9
b816efe
5009dbe
696bade
32d0f99
260eb07
b629966
df381b2
6cb99d9
3529c38
27943df
8f15774
94b6485
c4a904a
d4ee297
04aa670
f5fdcaf
bb6b713
2e5777c
0160f18
aa182fc
94dedbc
2c4ec1e
4bc938c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -109,10 +109,10 @@ set of classifiers is created by introducing randomness in the classifier | |||
construction. The prediction of the ensemble is given as the averaged | ||||
prediction of the individual classifiers. | ||||
|
||||
As other classifiers, forest classifiers have to be fitted with two | ||||
arrays: a sparse or dense array X of size ``[n_samples, n_features]`` holding the | ||||
training samples, and an array Y of size ``[n_samples]`` holding the | ||||
target values (class labels) for the training samples:: | ||||
As other classifiers, forest classifiers have to be fitted with two arrays: a | ||||
sparse or dense array X of size ``[n_samples, n_features]`` holding the | ||||
training samples, and an array Y of size ``[n_samples]`` holding the target | ||||
values (class labels) for the training samples:: | ||||
|
||||
>>> from sklearn.ensemble import RandomForestClassifier | ||||
>>> X = [[0, 0], [1, 1]] | ||||
|
@@ -200,6 +200,9 @@ in bias:: | |||
Parameters | ||||
---------- | ||||
|
||||
Impactful parameters | ||||
.................... | ||||
|
||||
The main parameters to adjust when using these methods is ``n_estimators`` and | ||||
``max_features``. The former is the number of trees in the forest. The larger | ||||
the better, but also the longer it will take to compute. In addition, note that | ||||
|
@@ -223,10 +226,50 @@ or out-of-bag samples. This can be enabled by setting ``oob_score=True``. | |||
|
||||
.. note:: | ||||
|
||||
The size of the model with the default parameters is :math:`O( M * N * log (N) )`, | ||||
where :math:`M` is the number of trees and :math:`N` is the number of samples. | ||||
In order to reduce the size of the model, you can change these parameters: | ||||
``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and ``min_samples_leaf``. | ||||
The size of the model with the default parameters is :math:`O( M * N * log | ||||
(N) )`, where :math:`M` is the number of trees and :math:`N` is the number | ||||
of samples. In order to reduce the size of the model, you can change these | ||||
parameters: ``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and | ||||
``min_samples_leaf``. | ||||
|
||||
.. _balanced_bootstrap: | ||||
|
||||
Learning from imbalanced classes dataset | ||||
........................................ | ||||
|
||||
In some datasets, the number of samples per classes might vary tremendously | ||||
(e.g 100 samples for a "majority" class for a single sample in a "minority" | ||||
class). Learning from these imbalanced dataset is challenging. The tree | ||||
criteria (i.e. gini or entropy) are sensitive to class imbalanced and will | ||||
naturally favor the classes with the most samples given during ``fit``. | ||||
|
||||
The :class:`RandomForestClassifier` provides a parameter `class_weight` with | ||||
the option `"balanced_bootstrap"` to alleviate the bias induces by the class | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
imbalance. This strategy will create a bootstrap subsample for the "minority" | ||||
class and draw with replacement the same amount of training instances from the | ||||
other classes. Each balanced subsample is given to each tree of the ensemble to | ||||
be fitted as proposed in [CLB2004]_. This algorithm is also called balanced | ||||
random-forest. | ||||
|
||||
`class_weight="balanced"` and `class_weight="balanced_subsample"` provide | ||||
alternative balancing strategies which are not as efficient in case of large | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
It was hard to parse the first time I read the sentence, this may help |
||||
difference between the class frequencies. | ||||
|
||||
.. note:: | ||||
Be aware that `sample_weight` will be taken into account when setting | ||||
`class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually | ||||
balanced the dataset using `sample_weight` and use | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
`class_weight="balanced_bootstrap"` at the same time. | ||||
Comment on lines
+260
to
+262
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe:
|
||||
|
||||
.. topic:: Examples: | ||||
|
||||
* :ref:`sphx_glr_auto_examples_plot_learn_from_imbalanced_dataset.py` | ||||
|
||||
.. topic:: References | ||||
|
||||
.. [CLB2004] C. Chen, A. Liaw, and L. Breiman, "Using random forest to learn | ||||
imbalanced data." University of California, Berkeley | ||||
110.1-12, 24, 2004. | ||||
|
||||
Parallelization | ||||
--------------- | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,9 +48,24 @@ Changelog | |
:mod:`sklearn.cluster` | ||
...................... | ||
|
||
- |Fix| example fix in model XXX. :pr:`xxxx` or :issue:`xxxx` by | ||
:user:`name <user id>` | ||
|
||
|
||
Comment on lines
+51
to
+54
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is some merge conflict resolution mishap |
||
:mod:`sklearn.ensemble` | ||
....................... | ||
|
||
- |Enhancement| :class:`cluster.AgglomerativeClustering` has a faster and more | ||
more memory efficient implementation of single linkage clustering. | ||
:pr:`11514` by :user:`Leland McInnes <lmcinnes>`. | ||
|
||
- |Efficiency| add the option `class_weight="balanced_bootstrap"` in | ||
:class:`ensemble.RandomForestClassifier`. This option will ensure that each | ||
tree is trained on a subsample with equal number of instances from each | ||
class. This algorithm is known as balanced-random forest. | ||
:pr:`13227` by :user:`Eric Potash <potash>`, :user:`Christos Aridas <chkoar>` | ||
and :user:`Guillaume Lemaitre <glemaitre>`. | ||
|
||
- |Fix| :class:`cluster.KMeans` with ``algorithm="elkan"`` now converges with | ||
``tol=0`` as with the default ``algorithm="full"``. :pr:`16075` by | ||
:user:`Erich Schubert <kno10>`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe sth like "main parameters"? I think we shouldn't imply the other parameters are not "impactful".