scikit-learn · chkoar · Feb 22, 2019 · Feb 22, 2019 · Feb 22, 2019 · Feb 22, 2019
diff --git a/README.rst b/README.rst
@@ -58,7 +58,7 @@ scikit-learn 0.23 and later require Python 3.6 or newer.
 Scikit-learn plotting capabilities (i.e., functions start with ``plot_``
 and classes end with "Display") require Matplotlib (>= 2.1.1). For running the
 examples Matplotlib >= 2.1.1 is required. A few examples require
-scikit-image >= 0.13, a few examples require pandas >= 0.18.0.
+scikit-image >= 0.13, a few examples require pandas >= 0.21.0.

 User installation
 ~~~~~~~~~~~~~~~~~

diff --git a/doc/install.rst b/doc/install.rst
@@ -134,7 +134,7 @@ it as ``scikit-learn[alldeps]``.
 Scikit-learn plotting capabilities (i.e., functions start with "plot\_"
 and classes end with "Display") require Matplotlib (>= 2.1.1). For running the
 examples Matplotlib >= 2.1.1 is required. A few examples require
-scikit-image >= 0.13, a few examples require pandas >= 0.18.0.
+scikit-image >= 0.13, a few examples require pandas >= 0.21.0.

 .. warning::


diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -109,10 +109,10 @@ set of classifiers is created by introducing randomness in the classifier
 construction.  The prediction of the ensemble is given as the averaged
 prediction of the individual classifiers.

-As other classifiers, forest classifiers have to be fitted with two
-arrays: a sparse or dense array X of size ``[n_samples, n_features]`` holding the
-training samples, and an array Y of size ``[n_samples]`` holding the
-target values (class labels) for the training samples::
+As other classifiers, forest classifiers have to be fitted with two arrays: a
+sparse or dense array X of size ``[n_samples, n_features]`` holding the
+training samples, and an array Y of size ``[n_samples]`` holding the target
+values (class labels) for the training samples::

    >>> from sklearn.ensemble import RandomForestClassifier
    >>> X = [[0, 0], [1, 1]]
@@ -200,6 +200,9 @@ in bias::
 Parameters
 ----------

+Impactful parameters
+....................
+
 The main parameters to adjust when using these methods is ``n_estimators`` and
 ``max_features``. The former is the number of trees in the forest. The larger
 the better, but also the longer it will take to compute. In addition, note that
@@ -223,10 +226,50 @@ or out-of-bag samples. This can be enabled by setting ``oob_score=True``.

 .. note::

-    The size of the model with the default parameters is :math:`O( M * N * log (N) )`,
-    where :math:`M` is the number of trees and :math:`N` is the number of samples.
-    In order to reduce the size of the model, you can change these parameters:
-    ``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and ``min_samples_leaf``.
+    The size of the model with the default parameters is :math:`O( M * N * log
+    (N) )`, where :math:`M` is the number of trees and :math:`N` is the number
+    of samples. In order to reduce the size of the model, you can change these
+    parameters: ``min_samples_split``, ``max_leaf_nodes``, ``max_depth`` and
+    ``min_samples_leaf``.
+
+  .. _balanced_bootstrap:
+
+Learning from imbalanced classes dataset
+........................................
+
+In some datasets, the number of samples per classes might vary tremendously
+(e.g 100 samples for a "majority" class for a single sample in a "minority"
+class). Learning from these imbalanced dataset is challenging. The tree
+criteria (i.e. gini or entropy) are sensitive to class imbalanced and will
+naturally favor the classes with the most samples given during ``fit``.
+
+The :class:`RandomForestClassifier` provides a parameter `class_weight` with
+the option `"balanced_bootstrap"` to alleviate the bias induces by the class
-the option `"balanced_bootstrap"` to alleviate the bias induces by the class
+the option `"balanced_bootstrap"` to alleviate the bias induced by the class
-the option `"balanced_bootstrap"` to alleviate the bias induces by the class
+the option `"balanced_bootstrap"` to alleviate the bias induced by the class
+imbalance. This strategy will create a bootstrap subsample for the "minority"
+class and draw with replacement the same amount of training instances from the
+other classes. Each balanced subsample is given to each tree of the ensemble to
+be fitted as proposed in [CLB2004]_. This algorithm is also called balanced
+random-forest.
+
+`class_weight="balanced"` and `class_weight="balanced_subsample"` provide
+alternative balancing strategies which are not as efficient in case of large
-alternative balancing strategies which are not as efficient in case of large
+alternative balancing strategies which are not as efficient as `class_weight="balanced_bootstrap"` in case of large
-alternative balancing strategies which are not as efficient in case of large
+alternative balancing strategies which are not as efficient as `class_weight="balanced_bootstrap"` in case of large
+difference between the class frequencies.
+
+.. note::
+  Be aware that `sample_weight` will be taken into account when setting
+  `class_weight="balanced_bootstrap"`. Thus, it is recommended to not manually
+  balanced the dataset using `sample_weight` and use
-  balanced the dataset using `sample_weight` and use
+  balance the dataset using `sample_weight` and use
-  balanced the dataset using `sample_weight` and use
+  balance the dataset using `sample_weight` and use
+  `class_weight="balanced_bootstrap"` at the same time.
+
+.. topic:: Examples:
+
+ * :ref:`sphx_glr_auto_examples_plot_learn_from_imbalanced_dataset.py`
+
+.. topic:: References
+
+  .. [CLB2004] C. Chen, A. Liaw, and L. Breiman, "Using random forest to learn
+         imbalanced data." University of California, Berkeley
+         110.1-12, 24, 2004.

 Parallelization
 ---------------

diff --git a/doc/themes/scikit-learn-modern/static/css/theme.css b/doc/themes/scikit-learn-modern/static/css/theme.css
@@ -963,6 +963,44 @@ div.sphx-glr-thumbcontainer {
  }
 }

+/* Pandas dataframe css */
+/* Taken from: https://github.com/spatialaudio/nbsphinx/blob/fb3ba670fc1ba5f54d4c487573dbc1b4ecf7e9ff/src/nbsphinx.py#L587-L619 */
+/* FIXME: to be removed when sphinx-gallery >= 5.0 will be released */
+
+table.dataframe {
+  border: none !important;
+  border-collapse: collapse;
+  border-spacing: 0;
+  border-color: transparent;
+  color: black;
+  font-size: 12px;
+  table-layout: fixed;
+}
+table.dataframe thead {
+  border-bottom: 1px solid black;
+  vertical-align: bottom;
+}
+table.dataframe tr,
+table.dataframe th,
+table.dataframe td {
+  text-align: right;
+  vertical-align: middle;
+  padding: 0.5em 0.5em;
+  line-height: normal;
+  white-space: normal;
+  max-width: none;
+  border: none;
+}
+table.dataframe th {
+  font-weight: bold;
+}
+table.dataframe tbody tr:nth-child(odd) {
+  background: #f5f5f5;
+}
+table.dataframe tbody tr:hover {
+  background: rgba(66, 165, 245, 0.2);
+}
+
 /* rellinks */

 .sk-btn-rellink {

diff --git a/doc/whats_new/v0.23.rst b/doc/whats_new/v0.23.rst
@@ -48,9 +48,24 @@ Changelog
 :mod:`sklearn.cluster`
 ......................

+- |Fix| example fix in model XXX. :pr:`xxxx` or :issue:`xxxx` by
+  :user:`name <user id>`
+
+
+:mod:`sklearn.ensemble`
+.......................
+
 - |Enhancement| :class:`cluster.AgglomerativeClustering` has a faster and more
  more memory efficient implementation of single linkage clustering.
  :pr:`11514` by :user:`Leland McInnes <lmcinnes>`.
+
+- |Efficiency| add the option `class_weight="balanced_bootstrap"` in
+  :class:`ensemble.RandomForestClassifier`. This option will ensure that each
+  tree is trained on a subsample with equal number of instances from each
+  class. This algorithm is known as balanced-random forest.
+  :pr:`13227` by :user:`Eric Potash <potash>`, :user:`Christos Aridas <chkoar>`
+  and :user:`Guillaume Lemaitre <glemaitre>`.
+
 - |Fix| :class:`cluster.KMeans` with ``algorithm="elkan"`` now converges with
  ``tol=0`` as with the default ``algorithm="full"``. :pr:`16075` by
  :user:`Erich Schubert <kno10>`.