Description
I recently came across #12895 (with PR #13467) and the older #6457, this woke up an old topic that I would like to share.
In our team, we had the need to provide model performance metrics, for regression models. This is a slightly different goal than using metrics for grid-search or model selection. Indeed the metric is not only used to "select the best model" but to provide users with feedback about "how good a model is".
For regression models I introduced three categories of metrics that happened to be quite intuitive:
-
Absolute performance (L2 RMSE, L1 MAE): these metrics can all be interpreted as an "average prediction error" ("average" in the broad sense here) expressed in the unit of the prediction target (e.g. "average error of 12kWh")
-
Relative performance (L2 CVRMSE, L1 CVMAE, and per-point relative metrics such as MAPE or MARE, MARES, MAREL...): these metrics can all be interpreted as an "average relative prediction error" expressed as a percentage of the target (e.g. "average error of 10%").
-
Comparison to a dummy model (L2 RRSE, L1 RAE): these metrics can all be interpreted as a ratio between the performance of the model at hand, and the performance of a dummy, constant model (predicting always the average). These need to be inverted to be intuitive e.g. "20% -> 5 times better than a dummy model"
Of course these categories are "applicative". They all make sense from a user point of view, however as far as model selection is concerned, only two make sense (MAE and RMSE). Not even R² because R²=1-RRSE² so it is not a performance metric but a comparison to dummy metric (but I dont want to open the debate here so please refrain from objecting on that one :) ).
Anyway my question for the core sklearn
team is: shall I propose a pull request with all these metrics ? I'm ready to shoot since we've done it in our private repo, aligned with sklearn regression.py
file. So it is rather a matter of deciding if this is a good idea. And if so, introducing categories might be needed, to help users better understand.
An alternative might be to create a small independent projet containing all the metrics, leaving only the mean_absolute_error
(L1) and mean_squared_error
(L2) in sklearn.
Any thoughts on this ?