Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FEA Add variable importance to linear models #21170

Copy link
Copy link
Open
@lorentzenchr

Description

@lorentzenchr
Issue body actions

Describe the workflow you want to enable

I'd like to have a feature importance method native to linear models (without L1 penalty) that is calculated on the training set:

clf = LogisticRegression(with_importance=True)
clf.fit(X, y)
clf.feature_importances_  # or some nice plot thereof

Describe your proposed solution

New proposal

Evaluate if the LMG (Lindeman, Merenda and Gold, see [1, 2]) is applicable and feasible for L2 penalized regression and for GLMs. Else, consider other measures of [1, 2].

In short, LMG is Shapley value decomposition of R2 by the features.

References:

Original proposal

Compute the t-statistic of the coefficients

t[j] = coef[j] / std(coef[j])

and use the absolute, i.e. |t|, as measure of (in-sample) importance. For GLMs like the logistic regression, see section 5.3 in https://arxiv.org/pdf/1509.09169.pdf for a formula of Var[coef].

Describe alternatives you've considered, if relevant

Any general importance measure (permutation importance, SHAP values, ...) also works.

Additional context

Given the great and legitimate need for interpretability, I would favor to have a native importance measure for linear models. Random Forests have their own native feature_importances_ with the warning

impurity-based feature importances can be misleading for high cardinality features (many unique values).

We could add a similar warning for collinear features like

feature importances can be misleading for collinear or high-dimensional features.

I guess, in the end, this is true for all feature importance measures, even for SHAP (see also our multicollinear example).

Prior discussions like #16802, #6773, #13048, focued on p-values which seem out-of-scope for scikit-learn for different reasons. I hope we can circumvent these reasons by focusing on feature importance only and not considering p-values.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      Morty Proxy This is a proxified and sanitized view of the page, visit original site.