Description
Describe the workflow you want to enable
I'd like to have a feature importance method native to linear models (without L1 penalty) that is calculated on the training set:
clf = LogisticRegression(with_importance=True)
clf.fit(X, y)
clf.feature_importances_ # or some nice plot thereof
Describe your proposed solution
New proposal
Evaluate if the LMG (Lindeman, Merenda and Gold, see [1, 2]) is applicable and feasible for L2 penalized regression and for GLMs. Else, consider other measures of [1, 2].
In short, LMG is Shapley value decomposition of R2 by the features.
References:
- [1] R package relaimpo with JSS paper U. Grömping (2006). Relative Importance for Linear Regression in R: The Package relaimpo
- [2] U. Grömping (2016). Variable importance in regression models
Original proposal
Compute the t-statistic of the coefficients
t[j] = coef[j] / std(coef[j])
and use the absolute, i.e. |t|
, as measure of (in-sample) importance. For GLMs like the logistic regression, see section 5.3 in https://arxiv.org/pdf/1509.09169.pdf for a formula of Var[coef]
.
Describe alternatives you've considered, if relevant
Any general importance measure (permutation importance, SHAP values, ...) also works.
Additional context
Given the great and legitimate need for interpretability, I would favor to have a native importance measure for linear models. Random Forests have their own native feature_importances_
with the warning
impurity-based feature importances can be misleading for high cardinality features (many unique values).
We could add a similar warning for collinear features like
feature importances can be misleading for collinear or high-dimensional features.
I guess, in the end, this is true for all feature importance measures, even for SHAP (see also our multicollinear example).
Prior discussions like #16802, #6773, #13048, focued on p-values which seem out-of-scope for scikit-learn for different reasons. I hope we can circumvent these reasons by focusing on feature importance only and not considering p-values.