-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DictionaryLearning: Fix several issues in the dict update #19198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
newd = Y[random_state.choice(n_samples)] | ||
|
||
# add small noise to avoid making the sparse coding ill conditioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sampling from the data without adding noise triggered some warnings in the sparse coding at some point about ill conditioned problem. Adding a small noise fixed it. Do you have an opinion about that strategy @tomMoral or @agramfort ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting trick. Out of curiosity, which specific operation would trigger the warning? By default sparse coding is using our coordinate descent solver, no? I would have assumed that this would not raise any warning on ill conditioned problems. But maybe I am wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is usually a good idea to add some noise. Else, you have a chance that you sample an atom that is only used to encode one variable. Adding some noise makes it somewhat more robust.
Note that instead of sampling the data, you could also sample from the residuals. These approaches are in particular appealing because they are related to the greedy algorithms that have somewhat some convergence guarantees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel The warning came from here
if diag < 1e-7: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremiedbb what do you think of @tomMoral's comment above? The current PR is already a big net improvement so maybe it's not necessary. But maybe it would converge even faster or to a en even better value of the objective function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if it makes the code more complex to organize, maybe this could be explored in a later PR to keep this PR simple to review and quick to merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was really satisfied by the strategy I implemented in the sense that now all atoms are used after only a few iterations.
Sampling from the residuals should give very good results too but it requires keeping track of the residuals. The previous implementation was keeping track of the residuals but was much slower and very hard to follow.
This in quite an impressive impact. Could you please re-run the convergence plots with 10 lines for 10 random starts for each version of the code to see if the run above was not just luck (and to quantify the variability induced by the random init)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Nice improvements both in terms of convergence quality and code readability. just a few comments. I like the fact that dictionary
is now (n_components, n_features)
everywhere.
Please do not forget to add an entry to whats new for 1.0.
|
||
newd = Y[random_state.choice(n_samples)] | ||
|
||
# add small noise to avoid making the sparse coding ill conditioned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting trick. Out of curiosity, which specific operation would trigger the warning? By default sparse coding is using our coordinate descent solver, no? I would have assumed that this would not raise any warning on ill conditioned problems. But maybe I am wrong?
See below. It's not random starts since the init is deterministic in dict learning, but different splits of the data. I added 2 strategies: set the atom to all zeros and leave the atom unchanged (these seem to be the only 2 strats in spams). The proposed strategy is consistently better. The other 3 seem to be roughly the same (makes sense since with these strategies unused atoms remain unused for the whole run) |
Thank you very much. This is very convincing. |
I confirm +1 for merging. Maybe a quick second review by @arthurmensch @jakirkham @agramfort or @GaelVaroquaux? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you @jeremiedbb for those improvements!
I just have added minors suggestions.
@@ -575,6 +578,31 @@ def test_sparse_coder_n_features_in(): | ||
assert sc.n_features_in_ == d.shape[1] | ||
|
||
|
||
def test_update_dict(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_update_dict(): | |
def test_update_dict(): | |
"""Non-regression test for issue #4866.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use docstrings for the tests, only comments. I added that as a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think @thomasjpfan mentioned elsewhere that we were transitioning to docstring even in tests now that we no longer use nose (I think this is the reason).
No strong opinion for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were transitioning to docstring even in tests now that we no longer use nose (I think this is the reason).
Exactly. Since that we don't make a strong statement, this is still depending on who is reviewing the PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have only small nitpicks for the future me that could read the code :)
@@ -807,7 +794,7 @@ def dict_learning_online(X, n_components=2, *, alpha=1, n_iter=100, | ||
else: | ||
X_train = X | ||
|
||
dictionary = check_array(dictionary.T, order='F', dtype=np.float64, | ||
dictionary = check_array(dictionary, order='F', dtype=np.float64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be sufficient to call asfortranarray
instead of making a full check_array
Maybe add the same comment regarding why using fortran as in the previous function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the initial dictionary can be provided by the user so it's better to use check_array. We should call check_array in dict_learning as well, but I'd rather do that in a separate PR where I refactor some parts (e.g. #18975).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK make sense then.
@@ -575,6 +578,31 @@ def test_sparse_coder_n_features_in(): | ||
assert sc.n_features_in_ == d.shape[1] | ||
|
||
|
||
def test_update_dict(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we were transitioning to docstring even in tests now that we no longer use nose (I think this is the reason).
Exactly. Since that we don't make a strong statement, this is still depending on who is reviewing the PR :)
Thanks for the fix @jeremiedbb and the reviews @glemaitre @jjerphan and @tomMoral! |
Super cool to see this one land in scikit-learn! :) |
thx heaps @jeremiedbb 🎉 🍻 |
…rn#19198) Co-authored-by: Olivier Grisel <olivier.grisel@gmail.com>
Fixes Block coordinate descent for dictionary update has a non optimal step #4866: minor bug in the dict update when used by MiniBatchDictionaryLearning. I added a test which currently fail on master.
_update_dict
seems to make smart things to update the residuals incrementally but I found that it's actually much faster (~10x to 20x) to write the function more naively and compute the objective function from scratch at the end. The impact on the time for the whole dict learning is negligible since the bottleneck is the sparse coding but I find version much more readable (and by reading the related issues I'm not the only one).When an atom is not used, the current strategy is to generate a new one from a normal distribution. But it's very likely that it will still not be used. A discussion with @tomMoral lead us think that sample a new atom from the data may be a better strategy. Below is a plot of the objective function for both strategies.