@@ -20,10 +20,24 @@ prediction.
20
20
Well calibrated classifiers are probabilistic classifiers for which the output
21
21
of the :term: `predict_proba ` method can be directly interpreted as a confidence
22
22
level.
23
- For instance, a well calibrated (binary) classifier should classify the samples
24
- such that among the samples to which it gave a :term: `predict_proba ` value
25
- close to 0.8,
26
- approximately 80% actually belong to the positive class.
23
+ For instance, a well calibrated (binary) classifier should classify the samples such
24
+ that among the samples to which it gave a :term: `predict_proba ` value close to, say,
25
+ 0.8, approximately 80% actually belong to the positive class.
26
+
27
+ Before we show how to re-calibrate a classifier, we first need a way to detect how
28
+ good a classifier is calibrated.
29
+
30
+ .. note ::
31
+ Strictly proper scoring rules for probabilistic predictions like
32
+ :func: `sklearn.metrics.brier_score_loss ` and
33
+ :func: `sklearn.metrics.log_loss ` assess calibration (reliability) and
34
+ discriminative power (resolution) of a model, as well as the randomness of the data
35
+ (uncertainty) at the same time. This follows from the well-known Brier score
36
+ decomposition of Murphy [1 ]_. As it is not clear which term dominates, the score is
37
+ of limited use for assessing calibration alone (unless one computes each term of
38
+ the decomposition). A lower Brier loss, for instance, does not necessarily
39
+ mean a better calibrated model, it could also mean a worse calibrated model with much
40
+ more discriminatory power, e.g. using many more features.
27
41
28
42
.. _calibration_curve :
29
43
@@ -33,7 +47,7 @@ Calibration curves
33
47
Calibration curves, also referred to as *reliability diagrams * (Wilks 1995 [2 ]_),
34
48
compare how well the probabilistic predictions of a binary classifier are calibrated.
35
49
It plots the frequency of the positive label (to be more precise, an estimation of the
36
- *conditional event probability * :math: `P(Y=1 |\text {predict\_proba })`) on the y-axis
50
+ *conditional event probability * :math: `P(Y=1 |\text {predict_proba })`) on the y-axis
37
51
against the predicted probability :term: `predict_proba ` of a model on the x-axis.
38
52
The tricky part is to get values for the y-axis.
39
53
In scikit-learn, this is accomplished by binning the predictions such that the x-axis
@@ -62,7 +76,7 @@ by showing the number of samples in each predicted probability bin.
62
76
63
77
:class: `LogisticRegression ` returns well calibrated predictions by default as it has a
64
78
canonical link function for its loss, i.e. the logit-link for the :ref: `log_loss `.
65
- This leads to the so-called **balance property **, see [7 ]_ and
79
+ This leads to the so-called **balance property **, see [8 ]_ and
66
80
:ref: `Logistic_regression `.
67
81
In contrast to that, the other shown models return biased probabilities; with
68
82
different biases per model.
@@ -79,7 +93,7 @@ case in this dataset which contains 2 redundant features.
79
93
:class: `RandomForestClassifier ` shows the opposite behavior: the histograms
80
94
show peaks at probabilities approximately 0.2 and 0.9, while probabilities
81
95
close to 0 or 1 are very rare. An explanation for this is given by
82
- Niculescu-Mizil and Caruana [1 ]_: "Methods such as bagging and random
96
+ Niculescu-Mizil and Caruana [3 ]_: "Methods such as bagging and random
83
97
forests that average predictions from a base set of models can have
84
98
difficulty making predictions near 0 and 1 because variance in the
85
99
underlying base models will bias predictions that should be near zero or one
@@ -99,7 +113,7 @@ to 0 or 1 typically.
99
113
.. currentmodule :: sklearn.svm
100
114
101
115
:class: `LinearSVC ` (SVC) shows an even more sigmoid curve than the random forest, which
102
- is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [1 ]_), which
116
+ is typical for maximum-margin methods (compare Niculescu-Mizil and Caruana [3 ]_), which
103
117
focus on difficult to classify samples that are close to the decision boundary (the
104
118
support vectors).
105
119
@@ -167,29 +181,18 @@ fit the regressor. It is up to the user to
167
181
make sure that the data used for fitting the classifier is disjoint from the
168
182
data used for fitting the regressor.
169
183
170
- :func: `sklearn.metrics.brier_score_loss ` may be used to assess how
171
- well a classifier is calibrated. However, this metric should be used with care
172
- because a lower Brier score does not always mean a better calibrated model.
173
- This is because the Brier score metric is a combination of calibration loss
174
- and refinement loss. Calibration loss is defined as the mean squared deviation
175
- from empirical probabilities derived from the slope of ROC segments.
176
- Refinement loss can be defined as the expected optimal loss as measured by the
177
- area under the optimal cost curve. As refinement loss can change
178
- independently from calibration loss, a lower Brier score does not necessarily
179
- mean a better calibrated model.
180
-
181
- :class: `CalibratedClassifierCV ` supports the use of two 'calibration'
182
- regressors: 'sigmoid' and 'isotonic'.
184
+ :class: `CalibratedClassifierCV ` supports the use of two regression techniques
185
+ for calibration via the `method ` parameter: `"sigmoid" ` and `"isotonic" `.
183
186
184
187
.. _sigmoid_regressor :
185
188
186
189
Sigmoid
187
190
^^^^^^^
188
191
189
- The sigmoid regressor is based on Platt's logistic model [3 ]_:
192
+ The sigmoid regressor, ` method="sigmoid" ` is based on Platt's logistic model [4 ]_:
190
193
191
194
.. math ::
192
- p(y_i = 1 | f_i) = \frac {1 }{1 + \exp (A f_i + B)}
195
+ p(y_i = 1 | f_i) = \frac {1 }{1 + \exp (A f_i + B)} \,,
193
196
194
197
where :math: `y_i` is the true label of sample :math: `i` and :math: `f_i`
195
198
is the output of the un-calibrated classifier for sample :math: `i`. :math: `A`
@@ -200,10 +203,10 @@ The sigmoid method assumes the :ref:`calibration curve <calibration_curve>`
200
203
can be corrected by applying a sigmoid function to the raw predictions. This
201
204
assumption has been empirically justified in the case of :ref: `svm ` with
202
205
common kernel functions on various benchmark datasets in section 2.1 of Platt
203
- 1999 [3 ]_ but does not necessarily hold in general. Additionally, the
206
+ 1999 [4 ]_ but does not necessarily hold in general. Additionally, the
204
207
logistic model works best if the calibration error is symmetrical, meaning
205
208
the classifier output for each binary class is normally distributed with
206
- the same variance [6 ]_. This can be a problem for highly imbalanced
209
+ the same variance [7 ]_. This can be a problem for highly imbalanced
207
210
classification problems, where outputs do not have equal variance.
208
211
209
212
In general this method is most effective for small sample sizes or when the
@@ -213,7 +216,7 @@ high and low outputs.
213
216
Isotonic
214
217
^^^^^^^^
215
218
216
- The 'isotonic' method fits a non-parametric isotonic regressor, which outputs
219
+ The ` method="isotonic" ` fits a non-parametric isotonic regressor, which outputs
217
220
a step-wise non-decreasing function, see :mod: `sklearn.isotonic `. It minimizes:
218
221
219
222
.. math ::
@@ -226,10 +229,20 @@ calibrated classifier for sample :math:`i` (i.e., the calibrated probability).
226
229
This method is more general when compared to 'sigmoid' as the only restriction
227
230
is that the mapping function is monotonically increasing. It is thus more
228
231
powerful as it can correct any monotonic distortion of the un-calibrated model.
229
- However, it is more prone to overfitting, especially on small datasets [5 ]_.
232
+ However, it is more prone to overfitting, especially on small datasets [6 ]_.
230
233
231
234
Overall, 'isotonic' will perform as well as or better than 'sigmoid' when
232
- there is enough data (greater than ~ 1000 samples) to avoid overfitting [1 ]_.
235
+ there is enough data (greater than ~ 1000 samples) to avoid overfitting [3 ]_.
236
+
237
+ .. note :: Impact on ranking metrics like AUC
238
+
239
+ It is generally expected that calibration does not affect ranking metrics such as
240
+ ROC-AUC. However, these metrics might differ after calibration when using
241
+ `method="isotonic" ` since isotonic regression introduces ties in the predicted
242
+ probabilities. This can be seen as within the uncertainty of the model predictions.
243
+ In case, you strictly want to keep the ranking and thus AUC scores, use
244
+ `method="logistic" ` which is a strictly monotonic transformation and thus keeps
245
+ the ranking.
233
246
234
247
Multiclass support
235
248
^^^^^^^^^^^^^^^^^^
@@ -239,7 +252,7 @@ support 1-dimensional data (e.g., binary classification output) but are
239
252
extended for multiclass classification if the `base_estimator ` supports
240
253
multiclass predictions. For multiclass predictions,
241
254
:class: `CalibratedClassifierCV ` calibrates for
242
- each class separately in a :ref: `ovr_classification ` fashion [4 ]_. When
255
+ each class separately in a :ref: `ovr_classification ` fashion [5 ]_. When
243
256
predicting
244
257
probabilities, the calibrated probabilities for each class
245
258
are predicted separately. As those probabilities do not necessarily sum to
@@ -254,36 +267,42 @@ one, a postprocessing is performed to normalize them.
254
267
255
268
.. topic :: References:
256
269
257
- .. [1 ] `Predicting Good Probabilities with Supervised Learning
258
- <https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf> `_,
259
- A. Niculescu-Mizil & R. Caruana, ICML 2005
270
+ .. [1 ] Allan H. Murphy (1973).
271
+ :doi: `"A New Vector Partition of the Probability Score"
272
+ <10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2> `
273
+ Journal of Applied Meteorology and Climatology
260
274
261
275
.. [2 ] `On the combination of forecast probabilities for
262
276
consecutive precipitation periods.
263
277
<https://journals.ametsoc.org/waf/article/5/4/640/40179> `_
264
278
Wea. Forecasting, 5, 640–650., Wilks, D. S., 1990a
265
279
266
- .. [3 ] `Probabilistic Outputs for Support Vector Machines and Comparisons
280
+ .. [3 ] `Predicting Good Probabilities with Supervised Learning
281
+ <https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf> `_,
282
+ A. Niculescu-Mizil & R. Caruana, ICML 2005
283
+
284
+
285
+ .. [4 ] `Probabilistic Outputs for Support Vector Machines and Comparisons
267
286
to Regularized Likelihood Methods.
268
287
<https://www.cs.colorado.edu/~mozer/Teaching/syllabi/6622/papers/Platt1999.pdf> `_
269
288
J. Platt, (1999)
270
289
271
- .. [4 ] `Transforming Classifier Scores into Accurate Multiclass
290
+ .. [5 ] `Transforming Classifier Scores into Accurate Multiclass
272
291
Probability Estimates.
273
292
<https://dl.acm.org/doi/pdf/10.1145/775047.775151> `_
274
293
B. Zadrozny & C. Elkan, (KDD 2002)
275
294
276
- .. [5 ] `Predicting accurate probabilities with a ranking loss.
295
+ .. [6 ] `Predicting accurate probabilities with a ranking loss.
277
296
<https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180410/> `_
278
297
Menon AK, Jiang XJ, Vembu S, Elkan C, Ohno-Machado L.
279
298
Proc Int Conf Mach Learn. 2012;2012:703-710
280
299
281
- .. [6 ] `Beyond sigmoids: How to obtain well-calibrated probabilities from
300
+ .. [7 ] `Beyond sigmoids: How to obtain well-calibrated probabilities from
282
301
binary classifiers with beta calibration
283
302
<https://projecteuclid.org/euclid.ejs/1513306867> `_
284
303
Kull, M., Silva Filho, T. M., & Flach, P. (2017).
285
304
286
- .. [7 ] Mario V. Wüthrich, Michael Merz (2023).
305
+ .. [8 ] Mario V. Wüthrich, Michael Merz (2023).
287
306
:doi: `"Statistical Foundations of Actuarial Learning and its Applications"
288
307
<10.1007/978-3-031-12409-9> `
289
308
Springer Actuarial
0 commit comments