-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
WIP: Adding Passive Aggressive learning rates #1259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi Rob, could you please update the documentation to mention those learning rates (both in docstrings and narrative documentation of the SGD module)? Don't forget to update the reference section to add a link the crammer06a paper. Also have you tried to use those in practice? What do they bring? What was your motivation to implement those? |
eta = 1.0 / sqnorm(x_data_ptr, x_ind_ptr, xnnz) | ||
eta = min(alpha/loss.dloss(p,y), eta) | ||
elif learning_rate == PA2: | ||
eta = 1.0 / (sqnorm(x_data_ptr, x_ind_ptr, xnnz) + 1.0/2*alpha) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8: 1.0 / 2 * alpha
which should even be simplified as 0.5 * alpha
Let me add the documentation. My motivation is completeness, this is frequently the baseline other SGD classifiers try to beat, and partly since passive-aggressive algorithms seem to give smoother updates of the weights. What would be a good test to show how well this performs? |
Thanks, please also extend the existing unittests for the
I don't know, you tell me :) Such demonstration would ideally be done as new example or extending an existing example with one of the default datasets. Maybe you could plot the online regret (cumulated number of misclassifications) using the |
It's a bit of a drag that we can't access the regret information from SGD. As a meta-level discussion, do you get the sense that SGD has been a bit monolithic in structure which will make it harder over time to add SGD learners? |
@@ -358,6 +358,9 @@ user via ``eta0`` and ``power_t``, resp. | ||
For a constant learning rate use ``learning_rate='constant'`` and use ``eta0`` | ||
to specify the learning rate. | ||
|
||
For passive-aggressive learning rate use ``learning_rate='pa'``, ``learning_rate='pa1'`` | ||
or ``learning_rate='pa2'``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should explain what passive aggressive learning rate mean in practice here and also give the mathematical formulation of those losses in:
The monolithic design of the current implementation is primarily motivated by the speed constraints. Every bit of flexibility we add in the main SGD loop has a cost. However we will add a callback based monitoring / checkpointing API in the scikit at some point. |
Also in writing this explanation I am wondering if I should remove "PA" since PA-1 and PA-2 reduce to it when alpha = 0. |
+1 for removing |
Thanks for the explanation this is much clearer now. Do you plan to do the online regret plot using the Also more importantly, this PR needs to add unit tests (extending the existing tests for the |
I am considering doing a test-error test on 5, 10, 25, 50, 100 percent of training data. I am worried that doing a partial_fit on small chunks won't work well enough. |
The regular |
Also, where are the |
Next to the implementation, in the https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/tests/test_sgd.py Note there is also: |
I think the idea is nice but I have two concerns:
For those reasons, we may want to expose the passive-aggressive algorithm only in its own estimators (PassiveAggressiveClassifier and PassiveAggressiveRegressor) while keeping the cython code as it is now. |
@mblondel I feel like Giving it its own estimators is fine, especially as I am going to be pushing AROW, AdaGrad, and Averaged SGD/Perceptron soon. |
In the scikit-learn project: The C parameter is used in the following models: http://scikit-learn.org/dev/modules/svm.html#mathematical-formulation The alpha parameter is used as a constructor parameter for instance in models: http://scikit-learn.org/dev/modules/linear_model.html#lasso |
@@ -358,6 +358,23 @@ user via ``eta0`` and ``power_t``, resp. | ||
For a constant learning rate use ``learning_rate='constant'`` and use ``eta0`` | ||
to specify the learning rate. | ||
|
||
For passive-aggressive algorithms, there is no learning rate, instead | ||
step-size is taken as large as would guarantee the example would have | ||
been correctly classified. In practice, this only works on seperable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separable
As the |
I think |
If we create You can pass both Inheriting from |
You may want to ask people's opinion on the mailing-list for those ones. Averaged SGD/Perceptron seems fine to me but the other two may be too recent (only 30 to 40 citations). |
PA-2 respectively. Setting ``C`` to 0 gives the vanilla passive- | ||
aggressive algorithm. | ||
|
||
..math:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not display right in the docs. I believe that there is a missing white space between '..' and 'math'.
I have made a bunch of minor comments. I cannot comment on the example yet, as it doesn't run for me. A few other minor remarks:
|
A couple of remarks:
The The fact that Now if this is too much for you to fix @zaxtax then I can take over from here. However I have very few spare bandwidth so don't be surprised if I cannot find the time to fix the remaining issues of this PR before a couple of months :) However I think this is a very important contrib as it seems that in the few couple of tests I did when checking this PR, PA models were always on par or beating |
Hum alright, then calling the regularization parameter |
On Sat, Oct 27, 2012 at 10:09:28AM -0700, Olivier Grisel wrote:
If that's indeed the case, than I would push for not using 'C', and using |
@ogrisel I am committed to fixing all issues in this commit. |
@ogrisel It would indeed be nice if you could check that. I would personally be ok with keeping it C in any case as it is a weight with respect to the loss rather than the norm (like alpha) and bigger means less regularized like LinearSVC. |
Apparently, SGDRegressor and Lasso (coordinate descent) have the same issue. They both use alpha but the scale is different: SGD minizes 1 / (2n) \sum (w . x_i - y_i) ^2 + alpha * penalty while CD minimizes 1 / 2 \sum (w . x_i - y_i) ^2 + alpha * penalty. |
I tried using the digits dataset and LinearSVC & PA do not seem to converge to the same However, for For L2-SVC and PA-II the equivalent parameterization seems to be different: I need to set So this might be a bug of the PA-II regularization. Note that the digits dataset might not be noisy enough to have the regularization play an important role. Maybe we should find a better dataset to test those equivalences. If you are interested: import numpy as np
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.svm import LinearSVC
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
digits = load_digits() Then for L2 / PA-II: rng = np.random.RandomState(None)
seed = rng.randint(1000)
C = 0.0000001
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, random_state=seed)
linear_svm = LinearSVC(C=C, loss='l2').fit(X_train, y_train)
linear_pa = PassiveAggressiveClassifier(C=1. / C, loss='pa2', random_state=seed, n_iter=5).fit(X_train, y_train)
print linear_svm.score(X_test, y_test), linear_pa.score(X_test, y_test) and for L1 / PA-I: rng = np.random.RandomState(None)
seed = rng.randint(1000)
C = 0.0000001
X_train, X_test, y_train, y_test = train_test_split(
digits.data, digits.target, random_state=seed)
linear_svm = LinearSVC(C=C, loss='l1').fit(X_train, y_train)
linear_pa = PassiveAggressiveClassifier(C=C / X_train.shape[0], loss='pa1', random_state=seed, n_iter=5).fit(X_train, y_train)
print linear_svm.score(X_test, y_test), linear_pa.score(X_test, y_test) |
@mblondel Will do. |
Merged. Turns out the implementation was still incorrect (along with all the issues at the API level)... |
Err did you fix it yet? |
Or rather: are you planning to work on it? |
Everything is good now (except that I spent 5 hours on this PR T_T) |
As often. It is my experience that quite often I pick up a PR that looks |
Wow thanks a lot for putting all this work in to merge the PR! This is a great feature to have. |
@GaelVaroquaux how about Travis? |
I am not sure that the Also, AFAIK the narrative documentation lacks the mathematical formulation of the objective functions being optimized by this family of models. Next time @mblondel it would be better to open a new PR from your branch to master to allow others to follow and review the merge work. |
Do we have the time budget necessary? If so, it would be great.
Yes, its often a lot of simple things. |
Would be great to setup travis for scikit-learn but we need to find a way no to install scipy from source as part of the travis build: it means finding a way to configure the travis VM to use a binary package for numpy / scipy (e.g. the ubuntu package if the travis VM is ubuntu distrib). |
My intuition is that it's ok (it downscales the step size) but you're right that there's no mathematical grounding for it. I'm ok with removing it but we need to remove sample_weight too then. BTW, the paper has a section on cost sensitive learning but it's focused on multiclass classification.
I removed it on purpose because I thought it didn't bring anything. You can add it back if you want.
I should have but I have other stuff I want to do this week. At least now it is robustly tested (I implemented a correctness test like I did for the Perceptron). Let's do the final polishing in master. |
Am 05.11.2012 16:33, schrieb Olivier Grisel:
|
+0 for for removing stuff that has no theoretical grounding nor empirical validation that it works as expected.
Yes, also please include the equivalence with |
@amueller thanks, next time I hope to have enough of the work needed done ahead of time to not burden you guys |
I have added Passive Aggressive learning rates as described in
http://jmlr.csail.mit.edu/papers/volume7/crammer06a/crammer06a.pdf
I have tried to make as few changes as possible, so these rates should integrate well with everything else.