Description
Describe the workflow you want to enable
I would like to select features by thresholding their mean value (i.e., mean-across-samples), similar to how VarianceThreshold
selects features by thresholding their variance-across-samples.
Describe your proposed solution
Two possible options:
- Implement a new Estimator in
sklearn.feature_selection
, similar toVarianceThreshold
. Example: https://github.com/hermidalc/sklearn-extensions/blob/f9296d0f3ed5d71b7f07779b47d8cf71bbcfa51b/feature_selection/_average_threshold.py#L7-L96 - Add a
mode='threshold'
option toGenericUnivariateSelect
(https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.GenericUnivariateSelect.html)- This would allow the user to pass in more flexible
score_func
s as well
- This would allow the user to pass in more flexible
Describe alternatives you've considered, if relevant
Another alternative, although this seems counter to how these functions are designed
SelectFromModel(DummyRegressor(strategy='mean'), importance_getter='constant_', threshold=min_mean_value)
Additional context
Setting a MeanThreshold would be useful when working with non-negative features, such as pixel intensity in images. For example, we might want to exclude pixels that are regularly saturated in our dataset, as they may be less informative.
Specifically, in my research field of neuroscience (single-neuron recordings), our "features" are the (non-negative) action-potential-counts for each neuron. We often exclude neurons with very-low-firing-rates to minimize discretization error. Here are a few examples of neuroscience papers that set a MeanThreshold per neuron (i.e., feature):
- https://www.cell.com/neuron/pdfExtended/S0896-6273(17)30592-5
Units with mean firing rates less than 1.5 Hz were excluded from the analysis.
- https://www.nature.com/articles/s41586-021-04042-9
Neurons with firing rates less than 0.5 Hz were excluded.
- https://www.nature.com/articles/s41598-018-37227-w
Firing data were Gaussian smoothed and binned in 0.1 s periods and bins with firing rates less than 0.1 Hz (no spikes) were excluded.