Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408

alceballosa · Jun 26, 2021

Reference Issues/PRs

Fixes conflicts in #11101.

The PR was stale for about 3 months so we introduced changes to account for other PRs involving the mixture model such as #17937 and #20030.

What does this implement/fix? Explain your changes.

PR #11101 was not merged due to a few conflicts and the fact that the kmeans_plusplus module wasn't public at the moment. We resolved the conflicts, so we modified the new functions so they referenced the public module introduced in #17937.

One key component of Mr. Gordon Walsh's implementation of GMM with sample and kmeans++-based initialization was to use a 0-iterations initialization. However, changes introduced in #20030 caused the 0-iterations instances to attempt to reach a line that should only be accessed with 1 or more iterations. For this reason, we introduced a conditional that avoids trying to do so when the max_iter parameter is 0.

Furthermore, there is a warning issued whenever the GMM was initialized with 0 iterations, since there is no convergence at that point. However, choosing to use max_iter=0 will usually be intentional, so we believe the user shouldn't be warned about non-convergence in such cases.

Finally, we modified the max_iter < 0 warning to reflect the fact that only negative numbers are not to be used, since previously the message would reference numbers smaller than 1 even if 0 was ok to use as the max_iter parameter.

Any other comments?

#DataUmbrella sprint
This PR was developed by @ariosramirez and myself.
cc: @amueller @g-walsh

- Documentation fixes - docstring addition - remove assert_true

…t to include this. Now no longer need to import private _kmeans

Rename a function and add comments.

…n_improvement

alceballosa · Mar 19, 2022

I'm wrapping up the suggestions made by @jeremiedbb. However, there seems to be a linting issue (check: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=39801&view=logs&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=fc67071d-c3d4-58b8-d38e-cafc0d3c731a) related to black.

I'm using black 21.6b0 as per the Contributing documentation and running the black command as black -t py39 <filename>.py. As per the documentation, it should be black <filename.py>, however, using Python 3.9.10 (installed by conda when setting python=3.9) seems to make black think that the target is 3.10:

Error: Invalid value for '-t' / '--target-version': 'py310' is not one of 'py27', 'py33', 'py34', 'py35', 'py36', 'py37', 'py38', 'py39'.

Should I downgrade to Python 3.9.0 or something? @cmarmo @jjerphan

Thanks!

jeremiedbb · Mar 19, 2022

We switched to black==22.1.0. It's in the dev version of the contributing guide. We are trying to make the contributing guide from the stable version of the doc to point to the dev version.

jeremiedbb

Here are some hints to fix the CI failures

sklearn/mixture/_bayesian_mixture.py

sklearn/mixture/_gaussian_mixture.py

… is not the case. Decided to remove this test.

alceballosa · Mar 20, 2022

Thanks once again @jeremiedbb. It seems that the only remaining issues are related to the pytest.raises messages but as I didn't do the commit removing the messages I'm not sure how to proceed. I suppose it will be ok just to put them back in as they were before Andres' commit, but I noticed you mention check_scalar and I can't find any reference to it.

sklearn/mixture/_base.py

reshamas · Mar 20, 2022

@alceballosa
This might be a helpful reference for the check_scalar function, with example PRs:
#21927

jeremiedbb

I think one the test is still testing stuff we can't guarantee. I propose a few changes to improve the testing of the different inits

sklearn/mixture/tests/test_gaussian_mixture.py

alceballosa · Mar 22, 2022

@jeremiedbb unrelated to your latest suggestions, I noticed that across this implementation we used 'kmeans' and 'k-means++' instead of 'kmeans' and 'kmeans++'. Do you think we should standardize that? I might be nitpicking but I can see some people trying to change from 'kmeans' to the ++ version and getting an error by not putting the dash.

jeremiedbb · Mar 22, 2022

'k-means++' is already the name of an init of KMeans. I think it's important to keep the same name for both.

I can see some people trying to change from 'kmeans' to the ++ version and getting an error by not putting the dash.

They'll still get an informative error message giving all the possibilities, so I think's fine.

jeremiedbb · Apr 6, 2022

@alceballosa I solved the conflicts, fixed the test and added the test that we discussed about.

jeremiedbb

LGTM. Thanks @alceballosa and @ariosramirez !

alceballosa · Apr 6, 2022

Thanks to you for wrapping this up with those last commits @jeremiedbb ! :)

reshamas · Apr 6, 2022

Congrats @alceballosa @ariosramirez for continuing work on this after the LATAM sprint!

Thanks to all the maintainers too for this!
@thomasjpfan We can update the sprint list too :)
cc: @amueller

…#20408) Co-authored-by: Gordon Walsh <gordon.p.walsh@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Andres David Rios Ramirez <ariosramirez.data@gmail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

g-walsh added 30 commits December 5, 2018 10:22

Add data samples as initialisation. test to see if it is useful

3f74e32

move the WIP test script

b411f72

Adjustment for PEP

2e95883

Add kmeans++ initialisation from kmeans module.

c09af7f

Track indices in kmeans++ and edit kmeans and gmm implementation.

1e97c8b

remove test file

47263e8

replace original .gitignore

d92ef40

Update for Flake8

a190afd

Add test for new initialisations in gmm. Check for consistent results.

f1fc384

Update docstrings for base classes to include new initialisations.

82ea0a8

PEP8 changes and whitespace error in one docstring.

f25d558

Remove 'for' from description as it was highlighted in documentation.

97f35d1

Update Documentation to include choice of initialization for gmm

e8d052e

Change marker type for old matplotlib

9025634

Fix inevitable PEP8 issues.

1397b40

Fix inevitable PEP8 issues.

946506f

Remove camel case function names

a69bccf

Add relative time taken to initialize to example and update docs

966deb6

Add relative time taken to initialize to example and update docs

d5e7e6f

Update for ci errors. PEP and variable names

b2d87d0

Add iteration steps to plot.

72b9607

remove irrelevant comments and fix other review issues

ad870aa

Max line length fix

56bb7d0

Add missing import

beea072

Include iterations in documentation plot

6ef8536

Alterations from review.

2312784

- Documentation fixes - docstring addition - remove assert_true

Allow gmm to take n_iter=0 for testing of initializations. Change plo…

798dd6b

…t to include this. Now no longer need to import private _kmeans

Clear up plot_gmm_init to remove confusing gen_gmm.

8630d82

Rename a function and add comments.

Resolve upstream conflict

b349473

Merge remote-tracking branch 'upstream/master' into gmm_initialisatio…

0f88e8c

…n_improvement

Reformat with black 22.1.0, changed rtol to standard value

2ca5f2b

jeremiedbb reviewed Mar 20, 2022

View reviewed changes

sklearn/mixture/_bayesian_mixture.py Outdated Show resolved Hide resolved

sklearn/mixture/_gaussian_mixture.py Outdated Show resolved Hide resolved

sklearn/mixture/_gaussian_mixture.py Outdated Show resolved Hide resolved

alceballosa added 2 commits March 20, 2022 12:36

Init params test assumes GMM can converge to the same solution, which…

a3abd4b

… is not the case. Decided to remove this test.

Fixed CI failures caused by docstrings

5b009dc

jeremiedbb reviewed Mar 20, 2022

View reviewed changes

sklearn/mixture/_base.py Show resolved Hide resolved

Matched error messages from check_scalar

7967b50

jeremiedbb reviewed Mar 22, 2022

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved

sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved

sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved

jeremiedbb added 3 commits April 6, 2022 09:56

Merge remote-tracking branch 'upstream/main' into pr/alceballosa/20408

8b3b58e

solve conflicts, fix test, add test

7464787

cln

cf71a9a

jeremiedbb reviewed Apr 6, 2022

View reviewed changes

jeremiedbb approved these changes Apr 6, 2022

View reviewed changes

jeremiedbb added 3 commits April 6, 2022 10:33

cln what's new

5e1eecf

re cln what's new

7ba7524

same

84c0f5e

jeremiedbb merged commit 3540b00 into scikit-learn:main Apr 6, 2022

This was referenced Apr 6, 2022

[MRG] Fix Random initialisation of GMM should consider data magnitude #10850 #11101

Closed

GaussianMixture model each attributes scaling issue #14398

Closed

reshamas mentioned this pull request Apr 6, 2022

Highlight PR #20408 from LATAM (June 2021) sprint scikit-learn/communication#15

Closed

reshamas removed the Waiting for Reviewer label Apr 6, 2022

Search code, repositories, users, issues, pull requests...

Uh oh!

Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408

Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408

Uh oh!

Conversation

alceballosa commented Jun 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

alceballosa commented Mar 19, 2022

Uh oh!

jeremiedbb commented Mar 19, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alceballosa commented Mar 20, 2022

Uh oh!

Uh oh!

reshamas commented Mar 20, 2022

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alceballosa commented Mar 22, 2022

Uh oh!

jeremiedbb commented Mar 22, 2022

Uh oh!

jeremiedbb commented Apr 6, 2022

Uh oh!

jeremiedbb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alceballosa commented Apr 6, 2022

Uh oh!

reshamas commented Apr 6, 2022

Uh oh!

Uh oh!

alceballosa commented Jun 26, 2021 •

edited

Loading

jeremiedbb left a comment •

edited

Loading