RFC Principled metrics for scoring and calibration of supervised learning

We receive a lot feature requests to add new metrics for classification and regression task, each shedding light on different aspects of supervised learning, e.g. (incomplete list)

Scikit-learn estimators for regression and classification give, with rare exceptions, point forecasts, i.e. a single number as prediction for a single row of input data. Point forecasts usually aim at predicting a certain property/functional of the predictive distribution such as the expectation E[Y|X] or the median Median(Y|X). Such point forecasts are well understood and there is good theory on how to validate them, see [1].

Examples: LinearRegression, Ridge, HistGradientBoostingRegressor(loss='squared_error') and classifiers with a predict_proba like LogisticRegression estimate the expectation.

There are 2 main aspects of model validation/selection:

Calibration: How good does a model take into account the available information in form of features? Is there (conditional) bias?
This can be assessed with identification functions, which in the case of the expectation amounts to checking for bias sum(y_predicted - y_observed), see [2, 4].
Scoring: How good is the predictive performance of a model A compared to a model B? The aim is often to select the better model. This can be assessed by (consistent) scoring functions, like the (mean) squared error for the expectation sum((y_predicted - y_observed)^2), see [1, 4].

A. How to structure `metrics` in a principled way to make it clear which metric assesses what?

How to make the distinction between calibration and scoring more explicit?
How to make it clear which functional is assessed, e.g. expectation or a quantile, or 2 different quantiles?
What is in-scope, what is out-of-scope?

Note that (consistent) scoring functions assess calibration and resolution at the same time, see [3, 4]. They are a good fit for cross validation.

For example, #20162 proposes scoring functions and calibration functions for 2 quantiles, i.e. the prediction of 2 estimators at the same time (a 2-point forecast).

B. How to assess calibration?

Currently, we have calibration_curve to assess (auto-)calibration for classifiers. It would be desirable to

also assess regression tasks;
also look at conditional calibration, i.e. aggregates grouped by some feature (either bins or categorical level);
handle different functionals, e.g. the expectation and quantiles.
What is in-scope, what is out-of-scope?

Note that calibration scores, by nature, often do not comply with "larger is better" or "smaller is better* and are thus not suitable for cross-validation/grid search (for hyper-parameter tuning). An example is the bias proposed in #17854, also mentioned above, which is sign sensitive (larger absolute value means worse, and the sign means under- or overshooting of the prediction).

Further examples: #11096 is for (auto-) calibration of classification only, #20162 proposes (among others) a metric for the calibration of 2 quantiles.

References

[1] Gneiting (2009) https://arxiv.org/abs/0912.0902
[2] Nolde & Ziegel (2016) https://arxiv.org/abs/1608.05498
[3] Pohle (2020) https://arxiv.org/abs/2005.01835
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC Principled metrics for scoring and calibration of supervised learning #21718

There are 2 main aspects of model validation/selection:

A. How to structure `metrics` in a principled way to make it clear which metric assesses what?

B. How to assess calibration?

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

RFC Principled metrics for scoring and calibration of supervised learning #21718

Description

There are 2 main aspects of model validation/selection:

A. How to structure metrics in a principled way to make it clear which metric assesses what?

B. How to assess calibration?

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A. How to structure `metrics` in a principled way to make it clear which metric assesses what?