Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Balanced Random Forest #13227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 27 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
5ab7141
Add Balanced Random Forest
Feb 22, 2019
4fe83b9
Fix test
Feb 22, 2019
e65e4da
fix docstring
Feb 22, 2019
04617e9
pep8
Feb 22, 2019
b816efe
Merge remote-tracking branch 'origin/master' into pr/chkoar/13227
glemaitre Nov 19, 2019
5009dbe
FIX remove part from merge
glemaitre Nov 19, 2019
696bade
PEP8
glemaitre Nov 19, 2019
32d0f99
refactor
glemaitre Nov 20, 2019
260eb07
EXA add imbalanced learning example
glemaitre Nov 20, 2019
b629966
fix in example
glemaitre Nov 20, 2019
df381b2
DOC solve sphinx warning
glemaitre Nov 20, 2019
6cb99d9
Merge remote-tracking branch 'origin/master' into pr/chkoar/13227
glemaitre Nov 20, 2019
3529c38
DOC add user guide description
glemaitre Nov 20, 2019
27943df
DOC add whats new
glemaitre Nov 20, 2019
8f15774
applied Nicolas review
glemaitre Nov 20, 2019
94b6485
Merge remote-tracking branch 'origin/master' into pr/chkoar/13227
glemaitre Nov 20, 2019
c4a904a
iter
glemaitre Nov 20, 2019
d4ee297
add versionadded
glemaitre Nov 20, 2019
04aa670
do not use f string, this is not python 3.6
glemaitre Nov 20, 2019
f5fdcaf
fix pandas version
glemaitre Nov 20, 2019
bb6b713
Merge remote-tracking branch 'origin/master' into pr/chkoar/13227
glemaitre Nov 21, 2019
2e5777c
Merge remote-tracking branch 'origin/master' into pr/chkoar/13227
glemaitre Dec 2, 2019
0160f18
improve example
glemaitre Dec 2, 2019
aa182fc
PEP8
glemaitre Dec 2, 2019
94dedbc
set style of dataframe
glemaitre Dec 2, 2019
2c4ec1e
review adrin
glemaitre Feb 20, 2020
4bc938c
Merge remote-tracking branch 'origin/master' into pr/chkoar/13227
glemaitre Feb 20, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion 2 README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ scikit-learn 0.23 and later require Python 3.6 or newer.
Scikit-learn plotting capabilities (i.e., functions start with ``plot_``
and classes end with "Display") require Matplotlib (>= 2.1.1). For running the
examples Matplotlib >= 2.1.1 is required. A few examples require
scikit-image >= 0.13, a few examples require pandas >= 0.18.0.
scikit-image >= 0.13, a few examples require pandas >= 0.21.0.

User installation
~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion 2 doc/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ it as ``scikit-learn[alldeps]``.
Scikit-learn plotting capabilities (i.e., functions start with "plot\_"
and classes end with "Display") require Matplotlib (>= 2.1.1). For running the
examples Matplotlib >= 2.1.1 is required. A few examples require
scikit-image >= 0.13, a few examples require pandas >= 0.18.0.
scikit-image >= 0.13, a few examples require pandas >= 0.21.0.

.. warning::

Expand Down
59 changes: 51 additions & 8 deletions 59 doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,10 +109,10 @@ set of classifiers is created by introducing randomness in the classifier
construction. The prediction of the ensemble is given as the averaged
prediction of the individual classifiers.

As other classifiers, forest classifiers have to be fitted with two
arrays: a sparse or dense array X of size ``[n_samples, n_features]`` holding the
training samples, and an array Y of size ``[n_samples]`` holding the
target values (class labels) for the training samples::
As other classifiers, forest classifiers have to be fitted with two arrays: a
sparse or dense array X of size ``[n_samples, n_features]`` holding the
training samples, and an array Y of size ``[n_samples]`` holding the target
values (class labels) for the training samples::

>>> from sklearn.ensemble import RandomForestClassifier
>>> X = [[0, 0], [1, 1]]
Expand Down Expand Up @@ -200,6 +200,9 @@ in bias::
Parameters
----------

Impactful parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe sth like "main parameters"? I think we shouldn't imply the other parameters are not "impactful".

....................

The main parameters to adjust when using these methods is ``n_estimators`` and
``max_features``. The former is the number of trees in the forest. The larger
the better, but also the longer it will take to compute. In addition, note that
Expand All @@ -223,10 +226,50 @@ or out-of-bag samples. This can be enabled by setting ``oob_score=True``.

.. note::

The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
where :math:`M` is the number of trees and :math:`N` is the number of samples.
In order to reduce the size of the model, you can change these parameters:
``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and ``min_samples_leaf``.
The size of the model with the default parameters is :math:`O( M * N * log
(N) )`, where :math:`M` is the number of trees and :math:`N` is the number
of samples. In order to reduce the size of the model, you can change these
parameters: ``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and
``min_samples_leaf``.

.. _balanced_bootstrap:

Learning from imbalanced classes dataset
........................................

In some datasets, the number of samples per classes might vary tremendously
(e.g 100 samples for a "majority" class for a single sample in a "minority"
class). Learning from these imbalanced dataset is challenging. The tree
criteria (i.e. gini or entropy) are sensitive to class imbalanced and will
naturally favor the classes with the most samples given during ``fit``.

The :class:`RandomForestClassifier` provides a parameter `class_weight` with
the option `"balanced_bootstrap"` to alleviate the bias induces by the class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
the option `"balanced_bootstrap"` to alleviate the bias induces by the class
the option `"balanced_bootstrap"` to alleviate the bias induced by the class

imbalance. This strategy will create a bootstrap subsample for the "minority"
class and draw with replacement the same amount of training instances from the
other classes. Each balanced subsample is given to each tree of the ensemble to
be fitted as proposed in [CLB2004]_. This algorithm is also called balanced
random-forest.

`class_weight="balanced"` and `class_weight="balanced_subsample"` provide
alternative balancing strategies which are not as efficient in case of large
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
alternative balancing strategies which are not as efficient in case of large
alternative balancing strategies which are not as efficient as `class_weight="balanced_bootstrap"` in case of large

It was hard to parse the first time I read the sentence, this may help

difference between the class frequencies.

.. note::
Be aware that `sample_weight` will be taken into account when setting
`class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually
balanced the dataset using `sample_weight` and use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
balanced the dataset using `sample_weight` and use
balance the dataset using `sample_weight` and use

`class_weight="balanced_bootstrap"` at the same time.
Comment on lines +260 to +262
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

Thus balancing the dataset using `sample_weight` and using `class_weight="balanced_bootstrap"`
at the same time is not recommented.


.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_plot_learn_from_imbalanced_dataset.py`

.. topic:: References

.. [CLB2004] C. Chen, A. Liaw, and L. Breiman, "Using random forest to learn
imbalanced data." University of California, Berkeley
110.1-12, 24, 2004.

Parallelization
---------------
Expand Down
38 changes: 38 additions & 0 deletions 38 doc/themes/scikit-learn-modern/static/css/theme.css
Original file line number Diff line number Diff line change
Expand Up @@ -963,6 +963,44 @@ div.sphx-glr-thumbcontainer {
}
}

/* Pandas dataframe css */
/* Taken from: https://github.com/spatialaudio/nbsphinx/blob/fb3ba670fc1ba5f54d4c487573dbc1b4ecf7e9ff/src/nbsphinx.py#L587-L619 */
/* FIXME: to be removed when sphinx-gallery >= 5.0 will be released */

table.dataframe {
border: none !important;
border-collapse: collapse;
border-spacing: 0;
border-color: transparent;
color: black;
font-size: 12px;
table-layout: fixed;
}
table.dataframe thead {
border-bottom: 1px solid black;
vertical-align: bottom;
}
table.dataframe tr,
table.dataframe th,
table.dataframe td {
text-align: right;
vertical-align: middle;
padding: 0.5em 0.5em;
line-height: normal;
white-space: normal;
max-width: none;
border: none;
}
table.dataframe th {
font-weight: bold;
}
table.dataframe tbody tr:nth-child(odd) {
background: #f5f5f5;
}
table.dataframe tbody tr:hover {
background: rgba(66, 165, 245, 0.2);
}

/* rellinks */

.sk-btn-rellink {
Expand Down
15 changes: 15 additions & 0 deletions 15 doc/whats_new/v0.23.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,24 @@ Changelog
:mod:`sklearn.cluster`
......................

- |Fix| example fix in model XXX. :pr:`xxxx` or :issue:`xxxx` by
:user:`name <user id>`


Comment on lines +51 to +54
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is some merge conflict resolution mishap

:mod:`sklearn.ensemble`
.......................

- |Enhancement| :class:`cluster.AgglomerativeClustering` has a faster and more
more memory efficient implementation of single linkage clustering.
:pr:`11514` by :user:`Leland McInnes <lmcinnes>`.

- |Efficiency| add the option `class_weight="balanced_bootstrap"` in
:class:`ensemble.RandomForestClassifier`. This option will ensure that each
tree is trained on a subsample with equal number of instances from each
class. This algorithm is known as balanced-random forest.
:pr:`13227` by :user:`Eric Potash <potash>`, :user:`Christos Aridas <chkoar>`
and :user:`Guillaume Lemaitre <glemaitre>`.

- |Fix| :class:`cluster.KMeans` with ``algorithm="elkan"`` now converges with
``tol=0`` as with the default ``algorithm="full"``. :pr:`16075` by
:user:`Erich Schubert <kno10>`.
Expand Down
Loading
Morty Proxy This is a proxified and sanitized view of the page, visit original site.