Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

RFC Principled metrics for scoring and calibration of supervised learning #21718

Copy link
Copy link
Open
@lorentzenchr

Description

@lorentzenchr
Issue body actions

We receive a lot feature requests to add new metrics for classification and regression task, each shedding light on different aspects of supervised learning, e.g. (incomplete list)

Scikit-learn estimators for regression and classification give, with rare exceptions, point forecasts, i.e. a single number as prediction for a single row of input data. Point forecasts usually aim at predicting a certain property/functional of the predictive distribution such as the expectation E[Y|X] or the median Median(Y|X). Such point forecasts are well understood and there is good theory on how to validate them, see [1].

Examples: LinearRegression, Ridge, HistGradientBoostingRegressor(loss='squared_error') and classifiers with a predict_proba like LogisticRegression estimate the expectation.

There are 2 main aspects of model validation/selection:

  1. Calibration: How good does a model take into account the available information in form of features? Is there (conditional) bias?
    This can be assessed with identification functions, which in the case of the expectation amounts to checking for bias sum(y_predicted - y_observed), see [2, 4].
  2. Scoring: How good is the predictive performance of a model A compared to a model B? The aim is often to select the better model. This can be assessed by (consistent) scoring functions, like the (mean) squared error for the expectation sum((y_predicted - y_observed)^2), see [1, 4].

A. How to structure metrics in a principled way to make it clear which metric assesses what?

  1. How to make the distinction between calibration and scoring more explicit?
  2. How to make it clear which functional is assessed, e.g. expectation or a quantile, or 2 different quantiles?
  3. What is in-scope, what is out-of-scope?

Note that (consistent) scoring functions assess calibration and resolution at the same time, see [3, 4]. They are a good fit for cross validation.

For example, #20162 proposes scoring functions and calibration functions for 2 quantiles, i.e. the prediction of 2 estimators at the same time (a 2-point forecast).

B. How to assess calibration?

Currently, we have calibration_curve to assess (auto-)calibration for classifiers. It would be desirable to

  1. also assess regression tasks;
  2. also look at conditional calibration, i.e. aggregates grouped by some feature (either bins or categorical level);
  3. handle different functionals, e.g. the expectation and quantiles.
  4. What is in-scope, what is out-of-scope?

Note that calibration scores, by nature, often do not comply with "larger is better" or "smaller is better* and are thus not suitable for cross-validation/grid search (for hyper-parameter tuning). An example is the bias proposed in #17854, also mentioned above, which is sign sensitive (larger absolute value means worse, and the sign means under- or overshooting of the prediction).

Further examples: #11096 is for (auto-) calibration of classification only, #20162 proposes (among others) a metric for the calibration of 2 quantiles.

References

[1] Gneiting (2009) https://arxiv.org/abs/0912.0902
[2] Nolde & Ziegel (2016) https://arxiv.org/abs/1608.05498
[3] Pohle (2020) https://arxiv.org/abs/2005.01835
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.