Gaussian Mixture - weighted implementation #17130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

dupuisIRT wants to merge 11 commits into scikit-learn:main from dupuisIRT:gmm_weighted_2

dupuisIRT commented May 5, 2020 •

edited

Loading

PR created to work on #15647

Implementation of a weighted GMM version. If a sample has a weight w, it means the sample is observed w times.

Based on the two papers:

EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis, Gebru et al., https://arxiv.org/abs/1509.01509
Reward-weighted GMM and its application toaction-selection in robotized shoe dressing, Colomé et al., http://www.iri.upc.edu/files/scidoc/1953-Reward-weighted-GMM-and-its-application-to-action-selection-in-robotized-shoe-dressing.pdf

Replicating the structure of Kmeans with a sample_weight argument.
Two main tests are added:

samples with zero weights do not have an impact
samples with integer weights should give the same result than a duplicated sample (ex: if a sample X1 has a weight of 3 and all other samples have a unit weight, weighted gmm should give the same result that the classical GMM with three X1 samples + the other samples with no modification)

For the moment, the work is still in progress, in particular regarding:

PEP8
the validity of the EM algorithm
the tests
the default behavior of sample_weight (if None)

dupuisIRT added 7 commits

November 20, 2019 12:02


          Implement weighted-data gmm

1f4f9ae

- EM updated with sample weight
- Log prob to to the power of sample weight


          Correct log prob and associated methods

e826794

- Instead of (p(x; theta)) ** w, it is Gaussian(x; theta) ** w
- Consistent with https://arxiv.org/pdf/1509.01509.pdf


          Add tests, update bayesian interface, and tied cov

3f9c599


          Improve PEP8 and change tests

7c44947


          Correct probability value with sampele_weight


          Correct the working version

263af5c


          Merge branch 'master' into gmm_weighted_2

8ef4e9e

github-actions bot added the module:mixture label

dupuisIRT marked this pull request as draft

May 5, 2020 10:45


          Correction to the linting

cf06195

Member

jnothman commented May 11, 2020

For the moment, the work is still in progress, in particular regarding ...

Are you looking for feedback at this point? At a glance this is looking quite nice


          Fix the number of repeated samples in test

df993e9

- in test_weighted_vs_repeated

Author

dupuisIRT commented May 11, 2020

Hi @jnothman , I think a quick feedback for the test test_weighted_vs_repeated in mixture/tests/test_gaussian_mixture.py is welcome.

Member

jnothman commented May 11, 2020

(We have over time proposed generic testing for sample weight invariances, but it's not really happened outside of evaluation metrics.)

jnothman reviewed

View reviewed changes

Member

jnothman left a comment

Some comments!

sklearn/mixture/tests/test_gaussian_mixture.py Outdated

+                                          reg_covar=0, random_state=rng,
+                                          covariance_type=covar_type)
+                      g1 = GaussianMixture(n_components=n_components, n_init=20,

Member

jnothman May 11, 2020

might be a bit neater to just use sklearn.base.clone rather than repeat the construction.

sklearn/mixture/tests/test_gaussian_mixture.py Outdated

+                      g_weighted = g.fit(X, sample_weight=sample_weight)
+                      g_repeated = g1.fit(X_repeat)
+                      assert_allclose(np.sort(g_weighted.weights_),

Member

jnothman May 11, 2020

Is it possible to use the stricter assert_allclose(g_weighted.weights_[arg_idx1], g_repeated.weights_[arg_idx2])?

sklearn/mixture/tests/test_gaussian_mixture.py Outdated

+                  n_features = rand_data.n_features
+                  n_components = rand_data.n_components
+                  for covar_type in COVARIANCE_TYPE:

Member

jnothman May 11, 2020

Using pytest.mark.parametrize might simplify your test a little.

Author

dupuisIRT May 11, 2020 •

edited

Loading

I reproduced this structure from other tests (for instance test_gaussian_mixture_fit_best_params).

I will generalize @pytest.mark.parametrize to the other tests.

sklearn/mixture/tests/test_gaussian_mixture.py Outdated

+                      assert_allclose(g_weighted.means_[arg_idx1],
+                                      g_repeated.means_[arg_idx2])
+                      if covar_type == 'full':

Member

jnothman May 11, 2020

I can't say we'd always go to this level of detail in ensuring models are identical.

It would be less repetitive if you defined the transformation once, then applied it to both precisions_ attributes.

def uniform_precisions(precisions):
    if covar_type == "tied":
        ...
    ...
    return np.trace(precisions, axis1=1, axis2=2).argsort()

But I also wonder whether we should just more approximately evaluate that the two models have identical likelihood functions applied to held-out data.

sklearn/mixture/tests/test_gaussian_mixture.py Outdated

+                          ecov = EmpiricalCovariance()
+                          ecov.covariance_ = prec_weighted[h]
+                          # the accuracy depends on the number of data and randomness, rng
+                          assert_allclose(ecov.error_norm(prec_repeated[k]), 0, atol=0.001)

Member

jnothman May 11, 2020

are we not able to compare precisions directly once indexed by the appropriate arg_idx? I.e. can't we just assert_allclose(g_weighted.precisions_[arg_idx1], g_repeated.precisions_[arg_idx2]) except where tied? I don't know this code well, admittedly.

Author

dupuisIRT May 11, 2020

Here, I used a conservative approach as the Frobenius error norm was used in other tests, such as test_gaussian_mixture_fit. I changed for assert_allclose(g_weighted.precisions_[arg_idx1], g_repeated.precisions_[arg_idx2]) and it failed for some tests.

dupuisIRT added 2 commits

May 11, 2020 19:20


          Improve test following jnothman's comments

52238f4

- Add clone to avoid repeating GMM instance
- Change the way to assert weights_
- Add parametrize
- Add uniform_precisions function


          Correct linting

e3d3fcd

Author

dupuisIRT commented May 12, 2020

Thanks @jnothman for the comments, I tried to take them in consideration as much as possible.

dupuisIRT marked this pull request as ready for review

May 12, 2020 12:16

Author

dupuisIRT commented May 27, 2020

Hey @jnothman , I implemented the majority of your comments. Should I wait for other review or maybe should I try to reach some GMM developers?

dupuisIRT changed the title ~~[WIP] Gaussian Mixture - weighted implementation~~ Gaussian Mixture - weighted implementation

Base automatically changed from master to main

January 22, 2021 10:52

Contributor

cmarmo commented Mar 28, 2022

Hi @dupuisIRT , I'm sorry your pull request got lost. If you are still interested in finalize this work, do you mind synchronizing with upstream? Please note that in the meanwhile the branch master has been renamed as main.
Thank you very much for your work so far and your patience.

Author

dupuisIRT commented Apr 12, 2022

Sure, I'll synchronize with upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels