-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
changing ARDRegression feature selection method #30951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I'm suspicious that this is something we'd like to do. The diff doesn't look straightforward, introduces new constructor args, and it all can be avoided with a scaling step in the pipeline.
So I'm more of -.5 on this, but I defer to others to see how they feel.
if self.standardize: | ||
# Standardize X | ||
self.scaler_ = StandardScaler() | ||
X = self.scaler_.fit_transform(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer having scaling done in a pipeline before this step rather than here. I'm not sure if we've changed our opinion on that though (cc @lorentzenchr @ogrisel )
threshold_lambda : float, default=10 000 | ||
Threshold for removing (pruning) weights with high precision from | ||
the computation. | ||
min_significance : float, default=0.5 | ||
Minimum statistical significance (|beta|/sigma) required to keep a feature. | ||
Default of 0.5 provides a reasonable balance between feature selection | ||
and model accuracy. | ||
This replaces the threshold_lambda parameter for more interpretable feature | ||
pruning. | ||
|
||
standardize : bool, default=True | ||
Whether to standardize features before fitting. Recommended for | ||
consistent feature selection behavior regardless of feature scales. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a lot more careful consideration. We don't simply remove constructor arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I realized that since, the new solution wouldn't be backward compatible this way.
Issues with current implementation:
When we scale a feature by factor k:
Lambda threshold approach:
The coefficient β becomes β/k
Lambda becomes k²·λ
If λ < threshold, then k²·λ may exceed threshold just due to scaling
Conclusion: Pruning depends on feature scale
Proposed approach based on Significance:
Significance = |β|·√λ = |β|/σ
When scaled: (|β|/k)·√(k²·λ) = |β|·√λ
Conclusion: Invariant to feature scaling
Advantages of Significance-Based Pruning
Statistical interpretation: Significance of 2.0 corresponds to ~95% confidence that the coefficient is non-zero - a standard statistical threshold
Consistency with theory: In statistics, we typically don't discard variables just because they have high precision; we discard them when we can't distinguish their effect from zero
Better feature selection:
Keeps small but certain effects (|β| small, λ large, |β|·√λ > threshold)
Removes large uncertain effects (|β| large, λ small, |β|·√λ < threshold)
Preservation of Bayesian framework: We're still using the same Bayesian update equations for λ and α, just changing the pruning criterion
Potential Concerns
Original purpose of threshold_lambda: The original ARD papers describe pruning based on lambda to promote sparsity, but they didn't explicitly consider scale dependence
Computation: Both approaches have identical computational costs
Sparse solutions: Significance might retain more features than lambda thresholding in some cases, but they're features we can be confident are relevant