FIX Correct `GaussianMixture.weights_` normalization #24119

kasmith11 · Aug 5, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Weights that were normalized using n_samples now use weights_.sum()

Any other comments?

Micky774 · Aug 5, 2022

Do we know if we can actually attribute any errors to the old method? While the new one is safer and more aligned with some recent changes I'm just wondering if there's a viable non-regression test to include in this PR.

jjerphan · Aug 8, 2022

To me, we should choose a normalizer term such that after normalization np.abs(np.sum(self.weights_) - 1.0) is minimized. I think normalizing by the sum of weights makes sense in this case.

Regarding a viable non-regression test, I think it's fair to come with a minimal reproducible example (or wait for the issue's author to know if they already have one in response to #24085 (comment)) showing a faulty behaviour and then turn it into a test this PR would fix.

kasmith11 · Aug 15, 2022

Would it be helpful if I develop a minimal reproducible example? My plan was to do something similar to the gist linked in #23032 under Steps/Code to Reproduce? Thoughts?

jjerphan · Aug 22, 2022

@kasmith11: Yes, definitely: it would be helpful!

kasmith11 · Aug 22, 2022

@jjerphan I've created a minimal reproducible example that prints self.weights_ of GaussianMixture in the following gist that heavily leans on the example outlined in the Sklearn documentation. This prints the weights of a trained GaussianMixture model before any adjustments to the way self.weights_ is calculated.

However, I'm having trouble discerning how I can attribute any errors to the older method as outlined by @Micky774? Would I need to add an additional gist of the weights after changing how self.weights_ is calculated?

Micky774 · Aug 30, 2022

@jjerphan I've created a minimal reproducible example that prints self.weights_ of GaussianMixture in the following gist that heavily leans on the example outlined in the Sklearn documentation. This prints the weights of a trained GaussianMixture model before any adjustments to the way self.weights_ is calculated.

However, I'm having trouble discerning how I can attribute any errors to the older method as outlined by @Micky774? Would I need to add an additional gist of the weights after changing how self.weights_ is calculated?

The goal is to have some script that demonstrates "that's where the old method went wrong". For example, even just finding a reasonable use-case where the weights sum up to be something other than 1 would be sufficient.

In my opinion, this is a really simple change that is made out of an abundance of caution, and well-motivated abstractly, so I think it's worth including with or without a demonstrable failure. We still need input from core maintainers though.

betatim · Nov 1, 2022

Should we merge this? It seems like there is consensus that even without a example of where the old behaviour is incorrect this change is desirable. In particular if other parts of this estimator normalise the weights to sum to 1, it makes sense to me to do this everywhere.

I can't construct an example off the top of my head either. However in particle physics the question of how to normalise weights comes up a bit. I think there the thing to get right is that the normalisation of different groups of samples (e.g. for different classes) is consistent with each other. Otherwise you end up giving one group a much higher weight relative to the other. But, I can't make an example right now :-/

Micky774 · Nov 1, 2022

Should we merge this? It seems like there is consensus that even without a example of where the old behaviour is incorrect this change is desirable.

I am fine with advancing this PR. I wonder if the code on main generates incorrect behavior when log_resp contains zeros? I can check later, but if so we may be able to add that as a non-regression test.

In particular if other parts of this estimator normalise the weights to sum to 1, it makes sense to me to do this everywhere.

For clarification, this still does normalize to one, since at this stage it is expected that weights are almost exactly n_samples, it's more of a semantic consideration regarding normalize to one explicitly (this PR's change) vs implicitly (what is on main).

jjerphan · Nov 3, 2022

sklearn/mixture/_gaussian_mixture.py


        self.weights_ = weights if self.weights_init is None else self.weights_init
        self.means_ = means if self.means_init is None else self.means_init
+        self.weights_ /= self.weights_.sum()


Can weights be negative? If so, I think this might be more appropriate:

Suggested change

self.weights_ /= self.weights_.sum()

self.weights_ /= np.abs(self.weights_).sum()

cc @jeremiedbb which whom we discussed normalization with negative weights.

IIRC self.weights_ should be non-negative at this point. Not completely sure. I will try to take a look later, time permitting (though this isn't too high on my list)

Side comment: does anyone know of applications that are wide spread and use negative sample weights?

The only people I know who have negative sample weights are (some) researchers in High Energy Particle physics. From my memory the interpretation of what those samples with negative weights mean is tricky. The consensus I remember is that a group of samples should always have a net positive weight (but this isn't always true), but that you shouldn't take the absolute value of a individual sample's weight. The reason for not using abs(weight) is that a possible interpretation of a negative sample weight is that you are trying to "subtract" that sample from your final group of samples. (I think the "deep physics" reasons for how you end up with them is that quantum mechanics has constructive and destructive interference terms, and depending on how your sample generator is written you end up with samples with a negative weight that are meant to represent negative interference, or something like that.)

Basically, are negative weights something worth spending time on or best to punt on it till there are use-cases?

Basically, are negative weights something worth spending time on or best to punt on it till there are use-cases?

AFAIK not really, but I wonder what prompted @jjerphan and @jeremiedbb's conversations on normalizing negative weights.

What @betatim described prompted those discussions; I think @betatim definitely is more knowledgeable than me on this subject.

I think, use-cases relying on negative weights are relatively rare, and this might not be something we would like to maintain relevant normalization for if this normalization is complex.

I think it is good to assume that weights are always positive and use that to reduce complexity/additional computation.

As long as we avoid (re-normalizing) user provided weights_init (see my review below), with the current capabilities of Mixture classes like GMM, I don't see how its possible to get negative weights. Thus I hope this isn't something we have to consider one way or another.

And I would really push to not re-normalize user provided weights_init, and especially to not make rigid decisions on how to re-normalize negative vs. positive weights, etc. Otherwise we take away the ability to precisely define an initial state for the EM algorithm.

The negative weights come back all the time ;). The last report was the following: #18934

If really only want positive weight, we should use only_non_negative=True in _check_sample_weight. I assume that we should come up with a PR allowing negative weight for people that really want it but without any guarantee that the normalization makes any sense. If an algorithm was written to take into account negative weight then we can always make sure that the normalization is properly done as per the literature.

emirkmo

I don't think the normalization step should re-normalize user-provided weights. We expect the user to do this and they may have normalization preferences. So I think the normalization should move back up and only apply to the initializer, not to pre-initialized user input.

However if the user provides responsibilities (which is currently not possible but hopefully added in PR #24812), then I guess some normalization can be expected similar to passing in 'kmeans' or whatever.

emirkmo · Nov 13, 2022

sklearn/mixture/_gaussian_mixture.py


        self.weights_ = weights if self.weights_init is None else self.weights_init
        self.means_ = means if self.means_init is None else self.means_init
+        self.weights_ /= self.weights_.sum()


The normalization should not be done after the user input initialization! We expect normalized weights to be given in weights_init. It's previous location looks correct to me.

Suggested change

self.weights_ /= self.weights_.sum()

emirkmo · Nov 13, 2022

sklearn/mixture/_gaussian_mixture.py

@@ -715,10 +715,10 @@ def _initialize(self, X, resp):
        weights, means, covariances = _estimate_gaussian_parameters(
            X, resp, self.reg_covar, self.covariance_type
        )
-        weights /= n_samples


I suggest that, as with the other comment, the weights normalization is done here. We expect normalized weights to be input in weights_init.

Suggested change

weights /= n_samples

weights /= weights.sum()

As per the negative weights, by moving the normalization step here, we largely avoid this problem, as the current inputs shouldn't really provide negative weights as far as I am aware. (only applying normalization if we run the _estimate_gaussian_parameters step.), but not re-normalizing what might be carefully normalized user weights),

However, PR #24812 making GMM initialization more robust (allowing users to pass in their own responsibilities matrix, or to use a custom callable, such as a pre-fitted K-means) will make this matter. So it's worth considering the issue here.

glemaitre · Dec 29, 2022

I agree with @emirkmo regarding the location of the normalization. I would not expect a to renormalize user-defined weights.

glemaitre

As mentioned in the issue, I find it OK to delegate the responsibility of the normalization to the user. However, algorithmically since each M subsequent step will renormalize using the sum of the weight, I intuitively don't understand how a normalization that does not use this approach would be helpful.

If this is indeed hurting the convergence we should either raise a warning if we want to allow unnormalized weights or an error if we don't want to.

weights now computed using self.weights_.sum()

adae79d

github-actions bot added the module:mixture label Aug 5, 2022

Micky774 added the Quick Review For PRs that are quick to review label Aug 5, 2022

jjerphan changed the title ~~#24085 - Gaussian Mixture - weights now computed using self.weights_.sum()~~ FIX Correct GaussianMixture.weights_ normalization Aug 8, 2022

jjerphan added Waiting for Reviewer and removed Quick Review For PRs that are quick to review labels Aug 23, 2022

Merge branch 'main' into guassian_mixture_weights

0a06716

jjerphan reviewed Nov 3, 2022

View reviewed changes

emirkmo suggested changes Nov 13, 2022

View reviewed changes

glemaitre mentioned this pull request Dec 29, 2022

Update _gaussian_mixture.py #25246

Closed

glemaitre self-requested a review December 29, 2022 10:09

glemaitre reviewed Dec 29, 2022

View reviewed changes

jeremiedbb mentioned this pull request Mar 15, 2023

Corrected the weights to use sum instead of number of samples #25860

Closed

adrinjalali added Stalled Needs Decision Requires decision Moderate Anything that requires some knowledge of conventions and best practices and removed Waiting for Reviewer labels Jul 27, 2023

	self.weights_ /= self.weights_.sum()
	self.weights_ /= np.abs(self.weights_).sum()

Search code, repositories, users, issues, pull requests...

Uh oh!

FIX Correct GaussianMixture.weights_ normalization #24119

Are you sure you want to change the base?

FIX Correct GaussianMixture.weights_ normalization #24119

Uh oh!

Conversation

kasmith11 commented Aug 5, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Micky774 commented Aug 5, 2022

Uh oh!

jjerphan commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kasmith11 commented Aug 15, 2022

Uh oh!

jjerphan commented Aug 22, 2022

Uh oh!

kasmith11 commented Aug 22, 2022

Uh oh!

Micky774 commented Aug 30, 2022

Uh oh!

betatim commented Nov 1, 2022

Uh oh!

Micky774 commented Nov 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjerphan Nov 3, 2022

Choose a reason for hiding this comment

Uh oh!

Micky774 Nov 9, 2022

Choose a reason for hiding this comment

Uh oh!

betatim Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

Micky774 Nov 10, 2022

Choose a reason for hiding this comment

Uh oh!

jjerphan Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

betatim Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

emirkmo Nov 14, 2022

Choose a reason for hiding this comment

Uh oh!

glemaitre Dec 29, 2022

Choose a reason for hiding this comment

Uh oh!

emirkmo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emirkmo Nov 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emirkmo Nov 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Dec 29, 2022

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FIX Correct `GaussianMixture.weights_` normalization #24119

FIX Correct `GaussianMixture.weights_` normalization #24119

jjerphan commented Aug 8, 2022 •

edited

Loading

Micky774 commented Nov 1, 2022 •

edited

Loading

jjerphan Nov 14, 2022 •

edited

Loading

emirkmo left a comment •

edited

Loading

emirkmo Nov 13, 2022 •

edited

Loading

emirkmo Nov 13, 2022 •

edited

Loading