Description
I did some follow-up work for the discussions on potential problems for Ridge(normalize=False)
in #19426 and #17444 in the following gist:
https://gist.github.com/ogrisel/cf7ac8bca6725ec02d4cc8d03ce096a7
This gist compares the results between a minimal implementation of ridge with LBFGS by @agramfort against scikit-learn's Ridge(normalize=False, solver="cholesky")
, Ridge(normalize=False, solver="lsqr")
, Ridge(normalize=False, solver="svd")
and other solvers.
The main conclusion is that those models seems to converge to the same solution, if we set a low value of tol
and if we are not too strict on the tolerance: see how I defined atol in the gist. I cannot get convergence to machine precision (only around 1e-6/1e-7 in float64), but maybe this is expected. I am not so sure: I tried to tweak pgtol
on top of factr
but it did not improve.
On large-ish problems, L-BFGS can be significantly faster than our the always accurate svd solver (but not always, depends on the regularization) but never faster than the lsqr
(LAPACK-based) solver. The cholesky solver (our default for dense data) is sometimes faster or slower than lsqr and L-BFGS (depending on regularization) but always faster than SVD (at least with high regularization). The sag/saga solvers are not necessarily as fast as they could be: I noticed that they do no benefit from multi-threading on a multicore machine, so that might be the reason. I have never seen then run faster than lsqr. sag/saga can be significantly faster than L-BFGS despite using fewer cores on problems with n_samples=5e3
/ n_features=1e3
and not too much regularization (alpha=1e-6).
There are still problems with normalize=True
as reported in #17444 but maybe we don't care so much because we are going to deprecate this option, see #17772 by @maikia that I plan to review soon.
Proposed plan of actions:
- proceed with the deprecation of
normalize=True
in DEP Deprecate 'normalize' in ridge models #17772 andtherefore put Ridge(normalize=True) fails check_sample_weights_invariance #17444 on hold (or less of a priority)actually no, see: [MRG] FIX sample_weight invariance for linear models #19616. - improve the ridge tests in scikit-learn by explicitly evaluating the min function for different solvers and maybe using a utility such as the
check_min
of my gist.
Tight tests (but without checking the minimum explicitly) TST tight and clean tests for Ridge #22910 - maybe add the ridge_bfgs function as a reference to compare against in our test suite? Since it does not seem to always converge to a very high quality optimum, I am not so sure of the value.
- maybe we could conduct a more systematic evaluation of various datasets with different shapes, conditioning and regularization to select a better default than "cholesky" for dense data? "cholesky" is reputed to not be numerically stable, although I did not observe problems in my gist, maybe because of regularization.
lsqr
seems to be very promising, both in terms of speed and accuracy of the solution (according to mycheck_min
thingy). - I am not sure that our default value of
tol=1e-3
is safe. It's probably very solver dependent, but I doubt that it's a good idea for the fastlsqr
solver for instance (to be benchmarked if we do the above evaluation). ENH change Ridge tol to 1e-4 #24465
Other ideas:
- maybe we should consider adding l-bfgs an option? That would require doing more evaluation on sparse data / ill-conditioned data with various values of alpha. Not sure of the benefit compared to
lsqr
orcholesky
though: they seem to always be more accurate than l-bfgs and faster (especiallylsqr
). And l-bfgs can be very slow when alpha is small. - and maybe add a QR-based solver that is supposed to be more numerically stable? or is this what LAPACK is doing internally when we use the
lsqr
solver?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status