Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

DictionaryLearning: Fix several issues in the dict update #19198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 15, 2021

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Jan 18, 2021

  • Fixes Block coordinate descent for dictionary update has a non optimal step #4866: minor bug in the dict update when used by MiniBatchDictionaryLearning. I added a test which currently fail on master.

  • _update_dict seems to make smart things to update the residuals incrementally but I found that it's actually much faster (~10x to 20x) to write the function more naively and compute the objective function from scratch at the end. The impact on the time for the whole dict learning is negligible since the bottleneck is the sparse coding but I find version much more readable (and by reading the related issues I'm not the only one).

  • When an atom is not used, the current strategy is to generate a new one from a normal distribution. But it's very likely that it will still not be used. A discussion with @tomMoral lead us think that sample a new atom from the data may be a better strategy. Below is a plot of the objective function for both strategies.

dl_loss

  • more small adjustments. I explain thoses in dedicated comments

sklearn/decomposition/_dict_learning.py Show resolved Hide resolved

newd = Y[random_state.choice(n_samples)]

# add small noise to avoid making the sparse coding ill conditioned
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampling from the data without adding noise triggered some warnings in the sparse coding at some point about ill conditioned problem. Adding a small noise fixed it. Do you have an opinion about that strategy @tomMoral or @agramfort ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting trick. Out of curiosity, which specific operation would trigger the warning? By default sparse coding is using our coordinate descent solver, no? I would have assumed that this would not raise any warning on ill conditioned problems. But maybe I am wrong?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is usually a good idea to add some noise. Else, you have a chance that you sample an atom that is only used to encode one variable. Adding some noise makes it somewhat more robust.

Note that instead of sampling the data, you could also sample from the residuals. These approaches are in particular appealing because they are related to the greedy algorithms that have somewhat some convergence guarantees.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel The warning came from here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremiedbb what do you think of @tomMoral's comment above? The current PR is already a big net improvement so maybe it's not necessary. But maybe it would converge even faster or to a en even better value of the objective function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if it makes the code more complex to organize, maybe this could be explored in a later PR to keep this PR simple to review and quick to merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really satisfied by the strategy I implemented in the sense that now all atoms are used after only a few iterations.

Sampling from the residuals should give very good results too but it requires keeping track of the residuals. The previous implementation was keeping track of the residuals but was much slower and very hard to follow.

sklearn/decomposition/_dict_learning.py Show resolved Hide resolved
@ogrisel
Copy link
Member

ogrisel commented Jan 20, 2021

When an atom is not used, the current strategy is to generate a new one from a normal distribution. But it's very likely that it will still not be used. A discussion with @tomMoral lead us think that sample a new atom from the data may be a better strategy. Below is a plot of the objective function for both strategies.

This in quite an impressive impact.

Could you please re-run the convergence plots with 10 lines for 10 random starts for each version of the code to see if the run above was not just luck (and to quantify the variability induced by the random init)?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Nice improvements both in terms of convergence quality and code readability. just a few comments. I like the fact that dictionary is now (n_components, n_features) everywhere.

Please do not forget to add an entry to whats new for 1.0.

sklearn/decomposition/_dict_learning.py Outdated Show resolved Hide resolved
sklearn/decomposition/_dict_learning.py Outdated Show resolved Hide resolved

newd = Y[random_state.choice(n_samples)]

# add small noise to avoid making the sparse coding ill conditioned
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting trick. Out of curiosity, which specific operation would trigger the warning? By default sparse coding is using our coordinate descent solver, no? I would have assumed that this would not raise any warning on ill conditioned problems. But maybe I am wrong?

sklearn/decomposition/_dict_learning.py Show resolved Hide resolved
sklearn/decomposition/_dict_learning.py Show resolved Hide resolved
@jeremiedbb
Copy link
Member Author

Could you please re-run the convergence plots with 10 lines for 10 random starts for each version of the code to see if the run above was not just luck (and to quantify the variability induced by the random init)?

See below. It's not random starts since the init is deterministic in dict learning, but different splits of the data. I added 2 strategies: set the atom to all zeros and leave the atom unchanged (these seem to be the only 2 strats in spams).

loss_mode

The proposed strategy is consistently better. The other 3 seem to be roughly the same (makes sense since with these strategies unused atoms remain unused for the whole run)

@ogrisel
Copy link
Member

ogrisel commented Jan 20, 2021

Thank you very much. This is very convincing.

Base automatically changed from master to main January 22, 2021 10:53
@ogrisel
Copy link
Member

ogrisel commented Feb 2, 2021

I confirm +1 for merging. Maybe a quick second review by @arthurmensch @jakirkham @agramfort or @GaelVaroquaux?

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @jeremiedbb for those improvements!

I just have added minors suggestions.

sklearn/decomposition/_dict_learning.py Outdated Show resolved Hide resolved
@@ -575,6 +578,31 @@ def test_sparse_coder_n_features_in():
assert sc.n_features_in_ == d.shape[1]


def test_update_dict():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_update_dict():
def test_update_dict():
"""Non-regression test for issue #4866."""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use docstrings for the tests, only comments. I added that as a comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think @thomasjpfan mentioned elsewhere that we were transitioning to docstring even in tests now that we no longer use nose (I think this is the reason).

No strong opinion for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were transitioning to docstring even in tests now that we no longer use nose (I think this is the reason).

Exactly. Since that we don't make a strong statement, this is still depending on who is reviewing the PR :)

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only small nitpicks for the future me that could read the code :)

sklearn/decomposition/_dict_learning.py Outdated Show resolved Hide resolved
sklearn/decomposition/_dict_learning.py Show resolved Hide resolved
@@ -807,7 +794,7 @@ def dict_learning_online(X, n_components=2, *, alpha=1, n_iter=100,
else:
X_train = X

dictionary = check_array(dictionary.T, order='F', dtype=np.float64,
dictionary = check_array(dictionary, order='F', dtype=np.float64,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be sufficient to call asfortranarray instead of making a full check_array
Maybe add the same comment regarding why using fortran as in the previous function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the initial dictionary can be provided by the user so it's better to use check_array. We should call check_array in dict_learning as well, but I'd rather do that in a separate PR where I refactor some parts (e.g. #18975).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK make sense then.

@@ -575,6 +578,31 @@ def test_sparse_coder_n_features_in():
assert sc.n_features_in_ == d.shape[1]


def test_update_dict():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we were transitioning to docstring even in tests now that we no longer use nose (I think this is the reason).

Exactly. Since that we don't make a strong statement, this is still depending on who is reviewing the PR :)

@ogrisel ogrisel merged commit 2c5ea4e into scikit-learn:main Apr 15, 2021
@ogrisel
Copy link
Member

ogrisel commented Apr 15, 2021

Thanks for the fix @jeremiedbb and the reviews @glemaitre @jjerphan and @tomMoral!

@tomMoral
Copy link
Contributor

Super cool to see this one land in scikit-learn! :)
Thx a lot @jeremiedbb ! 🚀

@agramfort
Copy link
Member

thx heaps @jeremiedbb 🎉 🍻

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021
…rn#19198)

Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Block coordinate descent for dictionary update has a non optimal step
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.