-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Gaussian Mixture - weighted implementation #17130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- EM updated with sample weight - Log prob to to the power of sample weight
- Instead of (p(x; theta)) ** w, it is Gaussian(x; theta) ** w - Consistent with https://arxiv.org/pdf/1509.01509.pdf
Are you looking for feedback at this point? At a glance this is looking quite nice |
- in test_weighted_vs_repeated
Hi @jnothman , I think a quick feedback for the test test_weighted_vs_repeated in mixture/tests/test_gaussian_mixture.py is welcome. |
(We have over time proposed generic testing for sample weight invariances, but it's not really happened outside of evaluation metrics.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments!
reg_covar=0, random_state=rng, | ||
covariance_type=covar_type) | ||
|
||
g1 = GaussianMixture(n_components=n_components, n_init=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a bit neater to just use sklearn.base.clone
rather than repeat the construction.
g_weighted = g.fit(X, sample_weight=sample_weight) | ||
g_repeated = g1.fit(X_repeat) | ||
|
||
assert_allclose(np.sort(g_weighted.weights_), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use the stricter assert_allclose(g_weighted.weights_[arg_idx1], g_repeated.weights_[arg_idx2])
?
n_features = rand_data.n_features | ||
n_components = rand_data.n_components | ||
|
||
for covar_type in COVARIANCE_TYPE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using pytest.mark.parametrize might simplify your test a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reproduced this structure from other tests (for instance test_gaussian_mixture_fit_best_params).
I will generalize @pytest.mark.parametrize to the other tests.
assert_allclose(g_weighted.means_[arg_idx1], | ||
g_repeated.means_[arg_idx2]) | ||
|
||
if covar_type == 'full': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say we'd always go to this level of detail in ensuring models are identical.
It would be less repetitive if you defined the transformation once, then applied it to both precisions_
attributes.
def uniform_precisions(precisions):
if covar_type == "tied":
...
...
return np.trace(precisions, axis1=1, axis2=2).argsort()
But I also wonder whether we should just more approximately evaluate that the two models have identical likelihood functions applied to held-out data.
ecov = EmpiricalCovariance() | ||
ecov.covariance_ = prec_weighted[h] | ||
# the accuracy depends on the number of data and randomness, rng | ||
assert_allclose(ecov.error_norm(prec_repeated[k]), 0, atol=0.001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we not able to compare precisions directly once indexed by the appropriate arg_idx
? I.e. can't we just assert_allclose(g_weighted.precisions_[arg_idx1], g_repeated.precisions_[arg_idx2])
except where tied? I don't know this code well, admittedly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I used a conservative approach as the Frobenius error norm was used in other tests, such as test_gaussian_mixture_fit. I changed for assert_allclose(g_weighted.precisions_[arg_idx1], g_repeated.precisions_[arg_idx2])
and it failed for some tests.
- Add clone to avoid repeating GMM instance - Change the way to assert weights_ - Add parametrize - Add uniform_precisions function
Thanks @jnothman for the comments, I tried to take them in consideration as much as possible. |
Hey @jnothman , I implemented the majority of your comments. Should I wait for other review or maybe should I try to reach some GMM developers? |
Hi @dupuisIRT , I'm sorry your pull request got lost. If you are still interested in finalize this work, do you mind synchronizing with upstream? Please note that in the meanwhile the branch |
Sure, I'll synchronize with upstream. |
PR created to work on #15647
Implementation of a weighted GMM version. If a sample has a weight w, it means the sample is observed w times.
Based on the two papers:
Replicating the structure of Kmeans with a sample_weight argument.
Two main tests are added:
For the moment, the work is still in progress, in particular regarding: