ENH Make GMM initialization more robust and easily user customizeable #24812

emirkmo · Nov 2, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Allows passing in a custom callable to initialize GaussianMixture class (and the parent BaseMixture class), as discussed in
Add more robust Kmeans initializer for mixture.GaussianMixture #23195 (also see comment Add more robust Kmeans initializer for mixture.GaussianMixture #23195 (comment)). Adds appropriate tests, input validation.
Additionally, allows easily passing in user defined responsibilities to exactly initialize a GaussianMixture class (Allow directly passing in responsibilities to GMM #24811), with the new responsibilities_init parameter. The public get_responsibilities function is now used in the initialization of the BaseMixture class, and is tested properly.

Docs are expanded to describe the new initialization. Tests are added. See the linked issues for more details.

Any other comments?

It has been difficult to robustly initialize GMM since the initialization uses hardcoded parameters and the user really only has control over the type of algorithm to use, but not on any specific initialization, or with more robust user-chosen parameters (which are needed when data is sparse or highly structured, for example). Although it was always possible to hack together private functions from the _gaussian_mixture.py to properly do this while using the _base.py class for reference, and using for example clusters.KMeans to do a fit first to use for initialization, this was difficult in practice.

It required calculating means, weights, and precisions (after first reverse-engineering and calculating responsibilities), which are not trivial, nor explained. Since all of these are calculated from responsibilities initialized from the init_params inside the GMM class anyway, this PR now adds a way to simple pass in the responsibilities instead, and adds a helper function to calculate them from KMeans, kmeans_plusplus, or custom initializers, in a generic way.

Also removed some unreachable code and increased test coverage, added tests to show that these changes produce the same behavior as the old initializer, harder to test, code, and added some tests which should act as regression tests or edge case tests. Finally, I am not sure what the best place for this get_responsibilities helper function is, and whether it should be exposed in the __init__.py for the module.

All suggestions welcome, as this is my first code contribution to scikit-learn. Black, Flake8, and MyPy were run successful on my local machine, and the docs compiled.

Also remove unused code line in fit_predict. Not sure why this was left in from a previous refactor but it's not accessed.

get_responsibilities takes in array of incides or labels and gives the responsibilities. This logic was previously buried in the _initialize_parameters of the base class separately for each init_params. Now it will be easier for users to impute own responsibilities and provide a callable that returns responsibilities.

add function to check responsibilities to `_base` to be used for validating callable that returns responsibilities. Modify GMM to allow initializing a GMM based on user input responsibilities, with new `responsibilities_init` and use the new function to validate its input. Minor bugfixes to logic pre-testing.

According to tests, dtype should not be e.g. np.int32, but should be inexact numpy array. Removed explicit dtype to pass tests.

Also making sure we don't have future regressions using new feature by checking against old (manual) version of the responsibility creator. Add test for random_state initialization. Add tests for _check_responsibilities and get_responsibilities.

Previously, it would trigger during check_parameters which runs also on every fit operation, meaning that responsibilities could not be of different size to X. Also added a data permutation test to avoid regressions.

cmarmo

Thanks @emirkmo for your pull request.
Here some suggestions to get rid of the sphinx and docstring errors.
Once everything is green you will have more chances to be reviewed.
Thanks for your patience.

sklearn/mixture/_base.py

sklearn/mixture/_gaussian_mixture.py

Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>

emirkmo · Nov 9, 2022

@cmarmo Thanks for the review! CI is passing now.

cmarmo · Nov 14, 2022

Thanks @emirkmo for fixing the CI. I'm afraid I'm probably not the right person for a review here.
I checked a bit deeper and I am under the impression that the two issues you are willing to solve could benefit of two separate pull requests:

the first adding the 'callable' custom initialization
the second allowing to pass user-defined 'responsibilites'.

In my opinion this will speed up a bit the review process, but please be patient as 1.2 is in preparation and that means older PRs should probably go in first.
Thanks for your understanding!

emirkmo · Jan 11, 2023

Thanks @emirkmo for fixing the CI. I'm afraid I'm probably not the right person for a review here. I checked a bit deeper and I am under the impression that the two issues you are willing to solve could benefit of two separate pull requests:

the first adding the 'callable' custom initialization

the second allowing to pass user-defined 'responsibilites'.

In my opinion this will speed up a bit the review process, but please be patient as 1.2 is in preparation and that means older PRs should probably go in first. Thanks for your understanding!

@cmarmo Sorry that I have not responded. The issue is that both are actually the same thing. The responsibility matrix is the tunnel that everything goes through. The callable is just a way of creating it. Currently, this step is hard coded into the init methods passed via a string! So it is not testable or generalizable to a callable either.

In fact, a unified API for passing in responsibilities is really the first step. Whether these are user defined or the API is used for the current string based init methods.. The callable part could come in later. It's just trivial to add once the responsibility API is done.

Do you still suggest that I split it?

The reverse of allowing a callable but not user defined responsibilities is very difficult, see below.

It would be difficult and suboptimal to separate the two, especially the callable part by itself. Technically it is possible. But I would be adding 95% of the "allowing to pass user-defined 'responsibilites'.", but then working hard to not expose this to the user.

Technically, it is possible to only add the callable, but like I said it would require some bad code practices and engineering to check/validate the responsibilities created and passed in by the callable while not exposing this to the user. It also would be very opaque how to define a proper callable. Currently we do not need to tell the user how to compute responsibilities, we have that hardcoded into the init method, but if we add the option to pass a callable that creates a responsibilities matrix, we would need to define and show how to create it, otherwise I don't see how to document the callable option:

Callable must create "normalized responsibilities". How and in what form? Well the form the other built in methods use, but each has its own version that is hard coded into the GMM init function, which may not work for your method. Good luck!

cmarmo · Jan 21, 2023

Hi @emirkmo , thanks for your explanations: as I said I'm not the right person for a review here, so if you think this is the right direction for this PR , I'm afraid the only thing to do next is wait for a reviewer... hope they will come ASAP.

emirkmo added 13 commits November 2, 2022 11:04

add callable as init_params argument and update docstring

3d32f33

Add user-level function to get responsibilities.

43a515c

Also remove unused code line in fit_predict. Not sure why this was left in from a previous refactor but it's not accessed.

rename to responsibilities_init to pass doctests

6079e21

change dtype of responsibilities to be inexact

c701c61

According to tests, dtype should not be e.g. np.int32, but should be inexact numpy array. Removed explicit dtype to pass tests.

Remove extra random_state in callable

fd0c22b

Fix formatting & docstring.

c4d10e5

format with black

a033408

flake8 code style fixes

bc791e2

Move responsibilities_init to only trigger on init, add tests,

df7e4c3

Previously, it would trigger during check_parameters which runs also on every fit operation, meaning that responsibilities could not be of different size to X. Also added a data permutation test to avoid regressions.

add new functionality to docs (mixture.rst)

1c593d1

github-actions bot added the module:mixture label Nov 2, 2022

emirkmo added 2 commits November 2, 2022 21:07

update changelog & add get_responsibilities to mixture init.

7b6561c

Merge branch 'main' into gmm_callable_init

3984f95

emirkmo changed the title ~~Make GMM initialization more robust and easily user customizeable~~ ENH Make GMM initialization more robust and easily user customizeable Nov 2, 2022

cmarmo reviewed Nov 7, 2022

View reviewed changes

sklearn/mixture/_base.py Outdated Show resolved Hide resolved

sklearn/mixture/_base.py Outdated Show resolved Hide resolved

sklearn/mixture/_gaussian_mixture.py Outdated Show resolved Hide resolved

emirkmo and others added 3 commits November 8, 2022 22:13

Apply suggestions from code review

8abb766

Co-authored-by: Chiara Marmo <cmarmo@users.noreply.github.com>

Merge branch 'main' into gmm_callable_init

3673a9e

fix failing doctest

bda7fa2

emirkmo mentioned this pull request Nov 13, 2022

FIX Correct GaussianMixture.weights_ normalization #24119

Open

emirkmo requested a review from cmarmo November 14, 2022 09:45

emirkmo mentioned this pull request Nov 14, 2022

Weights are being normalized using number of samples as opposed to sum in GaussianMixture #24085

Closed

betatim added the Waiting for Reviewer label Nov 17, 2022

cmarmo removed their request for review January 21, 2023 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH Make GMM initialization more robust and easily user customizeable #24812

ENH Make GMM initialization more robust and easily user customizeable #24812

Uh oh!

emirkmo commented Nov 2, 2022 •

edited

Loading

Uh oh!

cmarmo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emirkmo commented Nov 9, 2022

Uh oh!

cmarmo commented Nov 14, 2022 •

edited

Loading

Uh oh!

emirkmo commented Jan 11, 2023 •

edited

Loading

Uh oh!

cmarmo commented Jan 21, 2023

Uh oh!

Uh oh!

Search code, repositories, users, issues, pull requests...

Uh oh!

ENH Make GMM initialization more robust and easily user customizeable #24812

Are you sure you want to change the base?

ENH Make GMM initialization more robust and easily user customizeable #24812

Uh oh!

Conversation

emirkmo commented Nov 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

cmarmo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emirkmo commented Nov 9, 2022

Uh oh!

cmarmo commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emirkmo commented Jan 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmarmo commented Jan 21, 2023

Uh oh!

Uh oh!

emirkmo commented Nov 2, 2022 •

edited

Loading

cmarmo commented Nov 14, 2022 •

edited

Loading

emirkmo commented Jan 11, 2023 •

edited

Loading