-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408
Conversation
- Documentation fixes - docstring addition - remove assert_true
…t to include this. Now no longer need to import private _kmeans
Rename a function and add comments.
I'm wrapping up the suggestions made by @jeremiedbb. However, there seems to be a linting issue (check: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=39801&view=logs&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=fc67071d-c3d4-58b8-d38e-cafc0d3c731a) related to black. I'm using black 21.6b0 as per the Contributing documentation and running the black command as
Should I downgrade to Python 3.9.0 or something? @cmarmo @jjerphan Thanks! |
We switched to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are some hints to fix the CI failures
… is not the case. Decided to remove this test.
Thanks once again @jeremiedbb. It seems that the only remaining issues are related to the |
@alceballosa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one the test is still testing stuff we can't guarantee. I propose a few changes to improve the testing of the different inits
@jeremiedbb unrelated to your latest suggestions, I noticed that across this implementation we used 'kmeans' and 'k-means++' instead of 'kmeans' and 'kmeans++'. Do you think we should standardize that? I might be nitpicking but I can see some people trying to change from 'kmeans' to the ++ version and getting an error by not putting the dash. |
'k-means++' is already the name of an init of KMeans. I think it's important to keep the same name for both.
They'll still get an informative error message giving all the possibilities, so I think's fine. |
@alceballosa I solved the conflicts, fixed the test and added the test that we discussed about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks @alceballosa and @ariosramirez !
Thanks to you for wrapping this up with those last commits @jeremiedbb ! :) |
Congrats @alceballosa @ariosramirez for continuing work on this after the LATAM sprint! Thanks to all the maintainers too for this! |
…#20408) Co-authored-by: Gordon Walsh <gordon.p.walsh@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Andres David Rios Ramirez <ariosramirez.data@gmail.com> Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Reference Issues/PRs
Fixes conflicts in #11101.
Resolves #10850 and #14398.
The PR was stale for about 3 months so we introduced changes to account for other PRs involving the mixture model such as #17937 and #20030.
What does this implement/fix? Explain your changes.
PR #11101 was not merged due to a few conflicts and the fact that the kmeans_plusplus module wasn't public at the moment. We resolved the conflicts, so we modified the new functions so they referenced the public module introduced in #17937.
One key component of Mr. Gordon Walsh's implementation of GMM with sample and kmeans++-based initialization was to use a 0-iterations initialization. However, changes introduced in #20030 caused the 0-iterations instances to attempt to reach a line that should only be accessed with 1 or more iterations. For this reason, we introduced a conditional that avoids trying to do so when the max_iter parameter is 0.
Furthermore, there is a warning issued whenever the GMM was initialized with 0 iterations, since there is no convergence at that point. However, choosing to use max_iter=0 will usually be intentional, so we believe the user shouldn't be warned about non-convergence in such cases.
Finally, we modified the max_iter < 0 warning to reflect the fact that only negative numbers are not to be used, since previously the message would reference numbers smaller than 1 even if 0 was ok to use as the max_iter parameter.
Any other comments?
#DataUmbrella sprint
This PR was developed by @ariosramirez and myself.
cc: @amueller @g-walsh