Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

alceballosa
Copy link
Contributor

@alceballosa alceballosa commented Jun 26, 2021

Reference Issues/PRs

Fixes conflicts in #11101.

Resolves #10850 and #14398.

The PR was stale for about 3 months so we introduced changes to account for other PRs involving the mixture model such as #17937 and #20030.

What does this implement/fix? Explain your changes.

PR #11101 was not merged due to a few conflicts and the fact that the kmeans_plusplus module wasn't public at the moment. We resolved the conflicts, so we modified the new functions so they referenced the public module introduced in #17937.

One key component of Mr. Gordon Walsh's implementation of GMM with sample and kmeans++-based initialization was to use a 0-iterations initialization. However, changes introduced in #20030 caused the 0-iterations instances to attempt to reach a line that should only be accessed with 1 or more iterations. For this reason, we introduced a conditional that avoids trying to do so when the max_iter parameter is 0.

Furthermore, there is a warning issued whenever the GMM was initialized with 0 iterations, since there is no convergence at that point. However, choosing to use max_iter=0 will usually be intentional, so we believe the user shouldn't be warned about non-convergence in such cases.

Finally, we modified the max_iter < 0 warning to reflect the fact that only negative numbers are not to be used, since previously the message would reference numbers smaller than 1 even if 0 was ok to use as the max_iter parameter.

Any other comments?

#DataUmbrella sprint
This PR was developed by @ariosramirez and myself.
cc: @amueller @g-walsh

- Documentation fixes
- docstring addition
- remove assert_true
…t to include this.

Now no longer need to import private _kmeans
Rename a function and add comments.
@alceballosa
Copy link
Contributor Author

I'm wrapping up the suggestions made by @jeremiedbb. However, there seems to be a linting issue (check: https://dev.azure.com/scikit-learn/scikit-learn/_build/results?buildId=39801&view=logs&j=32e2e1bb-a28f-5b18-6cfc-3f01273f5609&t=fc67071d-c3d4-58b8-d38e-cafc0d3c731a) related to black.

I'm using black 21.6b0 as per the Contributing documentation and running the black command as black -t py39 <filename>.py. As per the documentation, it should be black <filename.py>, however, using Python 3.9.10 (installed by conda when setting python=3.9) seems to make black think that the target is 3.10:

Error: Invalid value for '-t' / '--target-version': 'py310' is not one of 'py27', 'py33', 'py34', 'py35', 'py36', 'py37', 'py38', 'py39'.

Should I downgrade to Python 3.9.0 or something? @cmarmo @jjerphan

Thanks!

@jeremiedbb
Copy link
Member

We switched to black==22.1.0. It's in the dev version of the contributing guide. We are trying to make the contributing guide from the stable version of the doc to point to the dev version.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some hints to fix the CI failures

sklearn/mixture/_bayesian_mixture.py Outdated Show resolved Hide resolved
sklearn/mixture/_gaussian_mixture.py Outdated Show resolved Hide resolved
sklearn/mixture/_gaussian_mixture.py Outdated Show resolved Hide resolved
@alceballosa
Copy link
Contributor Author

Thanks once again @jeremiedbb. It seems that the only remaining issues are related to the pytest.raises messages but as I didn't do the commit removing the messages I'm not sure how to proceed. I suppose it will be ok just to put them back in as they were before Andres' commit, but I noticed you mention check_scalar and I can't find any reference to it.

sklearn/mixture/_base.py Show resolved Hide resolved
@reshamas
Copy link
Member

@alceballosa
This might be a helpful reference for the check_scalar function, with example PRs:
#21927

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one the test is still testing stuff we can't guarantee. I propose a few changes to improve the testing of the different inits

sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved
sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved
sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved
@alceballosa
Copy link
Contributor Author

@jeremiedbb unrelated to your latest suggestions, I noticed that across this implementation we used 'kmeans' and 'k-means++' instead of 'kmeans' and 'kmeans++'. Do you think we should standardize that? I might be nitpicking but I can see some people trying to change from 'kmeans' to the ++ version and getting an error by not putting the dash.

@jeremiedbb
Copy link
Member

'k-means++' is already the name of an init of KMeans. I think it's important to keep the same name for both.

I can see some people trying to change from 'kmeans' to the ++ version and getting an error by not putting the dash.

They'll still get an informative error message giving all the possibilities, so I think's fine.

@jeremiedbb
Copy link
Member

@alceballosa I solved the conflicts, fixed the test and added the test that we discussed about.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @alceballosa and @ariosramirez !

@alceballosa
Copy link
Contributor Author

Thanks to you for wrapping this up with those last commits @jeremiedbb ! :)

@reshamas
Copy link
Member

reshamas commented Apr 6, 2022

Congrats @alceballosa @ariosramirez for continuing work on this after the LATAM sprint!

Thanks to all the maintainers too for this!
@thomasjpfan We can update the sprint list too :)
cc: @amueller

jjerphan added a commit to jjerphan/scikit-learn that referenced this pull request Apr 29, 2022
…#20408)

Co-authored-by: Gordon Walsh <gordon.p.walsh@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Andres David Rios Ramirez <ariosramirez.data@gmail.com>
Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Random initialisation of GMM should consider data magnitude
9 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.