RFC Sample weight invariance properties

This can wait after the release.

A discussion happened in the GLM PR #14300 about what properties we would like sample_weight to have.

Current Versions

First, a short side comment about 3 ways simple weights (s_i) are currently used in loss functions with regularized generalized linear models in scikit-learn (as far as I understand),

Version 1a: $L_{1a}(\omega) = \sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$

For instance: Ridge (also LogisticRegression where C=1/α)
Version 2a: $L_{2a}(\omega) = \frac{1}{n_{\text{samples}}}\sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$

For instance: SGDClassifier? (maybe Lasso, ElasticNet once they are added?)
Version 2b: $L_{2b}(\omega) = \frac{1}{\sum_i s_i}\sum_i s_i \cdot l(x_i, \omega) + \alpha \lVert \omega\rVert$

For instance, currently proposed in the GLM PR for PoissonRegressor etc (edit: meanwhile implemented this way)

Properties

For sample weight it's useful to think in term of invariant properties, as they can be directly expressed in common tests. For instance,

checking that zero sample weight is equivalent to ignoring samples in add common test that zero sample weight means samples are ignored #15015 (replaced by Common check for sample weight invariance with removed samples #17176) helped discovering a number of issues.
~~All of the above formulations should verify this.~~ It is verified only by L_1a and L_2b.

Similarly, paraphrasing #14300 (comment) other properties we might want to enforce, are,

multiplying some sample weight by N is equivalent to repeating the corresponding samples N times.
It is verified only by L_1a and L_2b.
Example: For L_2a setting all weights to 2, is equivalent to having 2x more samples only if α = α / 2.
Finally, that scaling sample weight has no effect. This is only verified by L_2b. For both L_1a and L_2a multiplying all samples weights by k is equivalent to setting α = α / k.

This one is more controversial. Against enforcing this,
- there are arguments of keeping a meaning for business metrics (e.g. RFC Semantic of sample_weight in regression metrics #15651)
in favor,
- that we don't want a coupling between using samples weight and regularization.
  Example: Say one has a model without sample weights, and one wants to see if applying samples weights (imbalanced dataset, sample uncertainty, etc) improves it. Without this property it's difficult to conclude: is the evaluation metric better with sample weights, due to those, or simply because we now have a better regularized model? One has to simultaneously consider these two factors.

Whether we want/need consistency between the use of sample weight in metrics in estimators is another question. I'm not convinced we do, since in most cases estimators don't care about the global scaling of the loss function, and these formulations are equivalent up to a scaling of the regularization parameter. So maybe using the L_1a equivalent expression in metrics could be fine.

In any case, we need to decide the behavior we want. This is a blocker for,

Poisson, Gamma and Tweedie Regression Minimal Generalized linear models implementation (L2 + lbfgs) #14300
adding sample weights in ElasticNet and Lasso [MRG] Sample weights for ElasticNet #15436
other tests for sample weights consistency in linear models by @lorentzenchr in TST Add tests for LinearRegression that sample weights act consistently #15554

Note: Ridge actually seem to have a different sample weight behavior for dense and sparse as reported in #15438

@agramfort 's option on this can be found in #15651 (comment) (if I understood correctly).

Please correct if I missed something (this could also use a more in depth review of how it is done in other libraries).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

RFC Sample weight invariance properties #15657

Current Versions

Properties

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

RFC Sample weight invariance properties #15657

Description

Current Versions

Properties

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions