Description
This can wait after the release.
A discussion happened in the GLM PR #14300 about what properties we would like sample_weight
to have.
Current Versions
First, a short side comment about 3 ways simple weights (s_i
) are currently used in loss functions with regularized generalized linear models in scikit-learn (as far as I understand),
-
Version 1a:
$L_{1a}(\omega) = \sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$ For instance:
Ridge
(alsoLogisticRegression
whereC=1/α
) -
Version 2a:
$L_{2a}(\omega) = \frac{1}{n_{\text{samples}}}\sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$ For instance:
SGDClassifier
? (maybeLasso
,ElasticNet
once they are added?) -
Version 2b:
$L_{2b}(\omega) = \frac{1}{\sum_i s_i}\sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$ For instance, currently proposed in the GLM PR for
PoissonRegressor
etc (edit: meanwhile implemented this way)
Properties
For sample weight it's useful to think in term of invariant properties, as they can be directly expressed in common tests. For instance,
- checking that zero sample weight is equivalent to ignoring samples in add common test that zero sample weight means samples are ignored #15015 (replaced by Common check for sample weight invariance with removed samples #17176) helped discovering a number of issues.
All of the above formulations should verify this.It is verified only byL_1a
andL_2b
.
Similarly, paraphrasing #14300 (comment) other properties we might want to enforce, are,
-
multiplying some sample weight by
N
is equivalent to repeating the corresponding samples N times.
It is verified only byL_1a
andL_2b
.
Example: ForL_2a
setting all weights to 2, is equivalent to having 2x more samples only ifα = α / 2
. -
Finally, that scaling sample weight has no effect. This is only verified by
L_2b
. For bothL_1a
andL_2a
multiplying all samples weights byk
is equivalent to settingα = α / k
.This one is more controversial. Against enforcing this,
- there are arguments of keeping a meaning for business metrics (e.g. RFC Semantic of sample_weight in regression metrics #15651)
in favor,
- that we don't want a coupling between using samples weight and regularization.
Example: Say one has a model without sample weights, and one wants to see if applying samples weights (imbalanced dataset, sample uncertainty, etc) improves it. Without this property it's difficult to conclude: is the evaluation metric better with sample weights, due to those, or simply because we now have a better regularized model? One has to simultaneously consider these two factors.
Whether we want/need consistency between the use of sample weight in metrics in estimators is another question. I'm not convinced we do, since in most cases estimators don't care about the global scaling of the loss function, and these formulations are equivalent up to a scaling of the regularization parameter. So maybe using the L_1a
equivalent expression in metrics could be fine.
In any case, we need to decide the behavior we want. This is a blocker for,
- Poisson, Gamma and Tweedie Regression Minimal Generalized linear models implementation (L2 + lbfgs) #14300
- adding sample weights in
ElasticNet
and Lasso [MRG] Sample weights for ElasticNet #15436 - other tests for sample weights consistency in linear models by @lorentzenchr in TST Add tests for LinearRegression that sample weights act consistently #15554
Note: Ridge actually seem to have a different sample weight behavior for dense and sparse as reported in #15438
@agramfort 's option on this can be found in #15651 (comment) (if I understood correctly).
Please correct if I missed something (this could also use a more in depth review of how it is done in other libraries).