Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Make GMM initialization more robust and easily user customizeable #24812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: main
Choose a base branch
Loading
from

Conversation

emirkmo
Copy link
Contributor

@emirkmo emirkmo commented Nov 2, 2022

Reference Issues/PRs

Fixes #23195. Fixes #24811.

What does this implement/fix? Explain your changes.

Docs are expanded to describe the new initialization. Tests are added. See the linked issues for more details.

Any other comments?

It has been difficult to robustly initialize GMM since the initialization uses hardcoded parameters and the user really only has control over the type of algorithm to use, but not on any specific initialization, or with more robust user-chosen parameters (which are needed when data is sparse or highly structured, for example). Although it was always possible to hack together private functions from the _gaussian_mixture.py to properly do this while using the _base.py class for reference, and using for example clusters.KMeans to do a fit first to use for initialization, this was difficult in practice.

It required calculating means, weights, and precisions (after first reverse-engineering and calculating responsibilities), which are not trivial, nor explained. Since all of these are calculated from responsibilities initialized from the init_params inside the GMM class anyway, this PR now adds a way to simple pass in the responsibilities instead, and adds a helper function to calculate them from KMeans, kmeans_plusplus, or custom initializers, in a generic way.

Also removed some unreachable code and increased test coverage, added tests to show that these changes produce the same behavior as the old initializer, harder to test, code, and added some tests which should act as regression tests or edge case tests. Finally, I am not sure what the best place for this get_responsibilities helper function is, and whether it should be exposed in the __init__.py for the module.

All suggestions welcome, as this is my first code contribution to scikit-learn. Black, Flake8, and MyPy were run successful on my local machine, and the docs compiled.

Also remove unused code line in fit_predict. Not sure why this was left in from a previous refactor but it's not accessed.
get_responsibilities takes in array of incides or labels and gives the responsibilities. This logic was previously buried in the _initialize_parameters of the base class separately for each init_params. Now it will be easier for users to impute own responsibilities and provide a callable that returns responsibilities.
add function to check responsibilities to `_base` to be used for validating callable that returns responsibilities.
Modify GMM to allow initializing a GMM based on user input responsibilities, with new `responsibilities_init` and use the new function to validate its input. 
Minor bugfixes to logic pre-testing.
According to tests, dtype should not be e.g. np.int32, but should be inexact numpy array. Removed explicit dtype to pass tests.
Also making sure we don't have future regressions using new feature by checking against old (manual) version of the responsibility creator.
Add test for random_state initialization.
Add tests for _check_responsibilities and get_responsibilities.
Previously, it would trigger during check_parameters which runs also on every fit operation, meaning that responsibilities could not be of different size to X. Also added a data permutation test to avoid regressions.
@emirkmo emirkmo changed the title Make GMM initialization more robust and easily user customizeable ENH Make GMM initialization more robust and easily user customizeable Nov 2, 2022
Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emirkmo for your pull request.
Here some suggestions to get rid of the sphinx and docstring errors.
Once everything is green you will have more chances to be reviewed.
Thanks for your patience.

sklearn/mixture/_base.py Outdated Show resolved Hide resolved
sklearn/mixture/_base.py Outdated Show resolved Hide resolved
sklearn/mixture/_gaussian_mixture.py Outdated Show resolved Hide resolved
@emirkmo
Copy link
Contributor Author

emirkmo commented Nov 9, 2022

@cmarmo Thanks for the review! CI is passing now.

@cmarmo
Copy link
Contributor

cmarmo commented Nov 14, 2022

Thanks @emirkmo for fixing the CI. I'm afraid I'm probably not the right person for a review here.
I checked a bit deeper and I am under the impression that the two issues you are willing to solve could benefit of two separate pull requests:

  • the first adding the 'callable' custom initialization
  • the second allowing to pass user-defined 'responsibilites'.

In my opinion this will speed up a bit the review process, but please be patient as 1.2 is in preparation and that means older PRs should probably go in first.
Thanks for your understanding!

@emirkmo
Copy link
Contributor Author

emirkmo commented Jan 11, 2023

Thanks @emirkmo for fixing the CI. I'm afraid I'm probably not the right person for a review here. I checked a bit deeper and I am under the impression that the two issues you are willing to solve could benefit of two separate pull requests:

  • the first adding the 'callable' custom initialization
  • the second allowing to pass user-defined 'responsibilites'.

In my opinion this will speed up a bit the review process, but please be patient as 1.2 is in preparation and that means older PRs should probably go in first. Thanks for your understanding!

@cmarmo Sorry that I have not responded. The issue is that both are actually the same thing. The responsibility matrix is the tunnel that everything goes through. The callable is just a way of creating it. Currently, this step is hard coded into the init methods passed via a string! So it is not testable or generalizable to a callable either.

In fact, a unified API for passing in responsibilities is really the first step. Whether these are user defined or the API is used for the current string based init methods.. The callable part could come in later. It's just trivial to add once the responsibility API is done.

Do you still suggest that I split it?

The reverse of allowing a callable but not user defined responsibilities is very difficult, see below. It would be difficult and suboptimal to separate the two, especially the callable part by itself. Technically it is possible. But I would be adding 95% of the "allowing to pass user-defined 'responsibilites'.", but then working hard to not expose this to the user.

Technically, it is possible to only add the callable, but like I said it would require some bad code practices and engineering to check/validate the responsibilities created and passed in by the callable while not exposing this to the user. It also would be very opaque how to define a proper callable. Currently we do not need to tell the user how to compute responsibilities, we have that hardcoded into the init method, but if we add the option to pass a callable that creates a responsibilities matrix, we would need to define and show how to create it, otherwise I don't see how to document the callable option:

Callable must create "normalized responsibilities". How and in what form? Well the form the other built in methods use, but each has its own version that is hard coded into the GMM init function, which may not work for your method. Good luck!

@cmarmo
Copy link
Contributor

cmarmo commented Jan 21, 2023

Hi @emirkmo , thanks for your explanations: as I said I'm not the right person for a review here, so if you think this is the right direction for this PR , I'm afraid the only thing to do next is wait for a reviewer... hope they will come ASAP.

@cmarmo cmarmo removed their request for review January 21, 2023 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.