Description
We receive a lot feature requests to add new metrics for classification and regression task, each shedding light on different aspects of supervised learning, e.g. (incomplete list)
- issues
Fmax
score (or maximum of F1/Fbeta) #26026- add RMSLE to sklearn.metrics.SCORERS.keys() #21686
- Mean Standardized Log Loss (MSLL) for uncertainty aware regression models #21665
- Precision @ Recall K || Recall @ Precision K #20266
- Metrics for prediction intervals #20162
- Add more D2 scores #20943
- Feature Request: function to calculate Expected Calibration Error (ECE) #18268
- Feature Request: bias regression metric #17854
- Should sklearn include the Equal Error Rate metric? #15247
- Regression metrics - which strategy ? #13482
- Add sklearn.metrics.cumulative_gain_curve and sklearn.metrics.lift_curve #10003
- Adding Fall-out, Miss rate, specificity as metrics #5516
- PRs
Scikit-learn estimators for regression and classification give, with rare exceptions, point forecasts, i.e. a single number as prediction for a single row of input data. Point forecasts usually aim at predicting a certain property/functional of the predictive distribution such as the expectation E[Y|X]
or the median Median(Y|X)
. Such point forecasts are well understood and there is good theory on how to validate them, see [1].
Examples: LinearRegression
, Ridge
, HistGradientBoostingRegressor(loss='squared_error')
and classifiers with a predict_proba
like LogisticRegression
estimate the expectation.
There are 2 main aspects of model validation/selection:
- Calibration: How good does a model take into account the available information in form of features? Is there (conditional) bias?
This can be assessed with identification functions, which in the case of the expectation amounts to checking for biassum(y_predicted - y_observed)
, see [2, 4]. - Scoring: How good is the predictive performance of a model A compared to a model B? The aim is often to select the better model. This can be assessed by (consistent) scoring functions, like the (mean) squared error for the expectation
sum((y_predicted - y_observed)^2)
, see [1, 4].
A. How to structure metrics
in a principled way to make it clear which metric assesses what?
- How to make the distinction between calibration and scoring more explicit?
- How to make it clear which functional is assessed, e.g. expectation or a quantile, or 2 different quantiles?
- What is in-scope, what is out-of-scope?
Note that (consistent) scoring functions assess calibration and resolution at the same time, see [3, 4]. They are a good fit for cross validation.
For example, #20162 proposes scoring functions and calibration functions for 2 quantiles, i.e. the prediction of 2 estimators at the same time (a 2-point forecast).
B. How to assess calibration?
Currently, we have calibration_curve
to assess (auto-)calibration for classifiers. It would be desirable to
- also assess regression tasks;
- also look at conditional calibration, i.e. aggregates grouped by some feature (either bins or categorical level);
- handle different functionals, e.g. the expectation and quantiles.
- What is in-scope, what is out-of-scope?
Note that calibration scores, by nature, often do not comply with "larger is better" or "smaller is better* and are thus not suitable for cross-validation/grid search (for hyper-parameter tuning). An example is the bias proposed in #17854, also mentioned above, which is sign sensitive (larger absolute value means worse, and the sign means under- or overshooting of the prediction).
Further examples: #11096 is for (auto-) calibration of classification only, #20162 proposes (among others) a metric for the calibration of 2 quantiles.
References
[1] Gneiting (2009) https://arxiv.org/abs/0912.0902
[2] Nolde & Ziegel (2016) https://arxiv.org/abs/1608.05498
[3] Pohle (2020) https://arxiv.org/abs/2005.01835
[4] Fissler, Lorentzen & Mayer (2022) https://arxiv.org/abs/2202.12780