Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

ENH Allows disabling refitting of CV estimators #30463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
Loading
from

Conversation

AhmedThahir
Copy link

@AhmedThahir AhmedThahir commented Dec 11, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

What

  • Allows disable refitting of cross-validation estimators (such as LassoCV, RidgeCV) on the full training set after finding the best hyperparameters.
  • User may use a keyword argument refit to toggle this behavior.

Why
User does not want to waste resources on refitting when user only wants one/more of the following, which do not involve refitting:

  • optimal hyperparameter
  • cv_results_ for the different hyperparameters
  • best_score_ of all the hyperparameters

This is especially impactful for large datasets.

How

  • Added a refit argument which disables refitting if refit=False.
  • By default, refit=True is set to prevent breaking existing usage

Any other comments?

This is my first (hopefully first of many) PR for scikit-learn. If you have any feedback on my implementation/PR documentation/etc, feel free to share - I'd really appreciate it.

Thanks to @paulAdrienMarie for all the support!

Copy link

github-actions bot commented Dec 11, 2024

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here


black

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/sklearn/linear_model/tests/test_ridge.py	2025-02-11 18:37:51.534575+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/linear_model/tests/test_ridge.py	2025-02-11 18:38:09.269688+00:00
@@ -629,25 +629,27 @@
     ridge = Ridge(alpha=penalties[:-1])
     err_msg = "Number of targets and number of penalties do not correspond: 4 != 5"
     with pytest.raises(ValueError, match=err_msg):
         ridge.fit(X, y)
 
+
 def test_ridgecv_regression_refit():
     rng = np.random.RandomState(0)
     alphas = (0.1, 1.0, 10.0)
 
     for n_samples, n_features in ((6, 5), (5, 10)):
         y = rng.randn(n_samples)
         X = rng.randn(n_samples, n_features)
-        
+
         refit_true = RidgeCV(alphas=alphas, refit=True)
         refit_true.fit(X, y)
 
         refit_false = RidgeCV(alphas=alphas, refit=False)
         refit_false.fit(X, y)
-        
+
         assert_array_almost_equal(refit_true.best_score_, refit_false.best_score_)
+
 
 def test_ridgecv_regression_refit():
     rng = np.random.RandomState(0)
     alphas = (0.1, 1.0, 10.0)
 
@@ -657,12 +659,13 @@
         refit_true = RidgeClassifierCV(alphas=alphas, refit=True)
         refit_true.fit(X, y)
 
         refit_false = RidgeClassifierCV(alphas=alphas, refit=False)
         refit_false.fit(X, y)
-        
+
         assert_array_almost_equal(refit_true.best_score_, refit_false.best_score_)
+
 
 @pytest.mark.parametrize("n_col", [(), (1,), (3,)])
 @pytest.mark.parametrize("csr_container", CSR_CONTAINERS)
 def test_X_CenterStackOp(n_col, csr_container):
     rng = np.random.RandomState(0)
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/linear_model/tests/test_ridge.py

Oh no! 💥 💔 💥
1 file would be reformatted, 924 files would be left unchanged.

ruff

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


sklearn/linear_model/tests/test_ridge.py:641:1: W293 [*] Blank line contains whitespace
    |
639 |         y = rng.randn(n_samples)
640 |         X = rng.randn(n_samples, n_features)
641 |         
    | ^^^^^^^^ W293
642 |         refit_true = RidgeCV(alphas=alphas, refit=True)
643 |         refit_true.fit(X, y)
    |
    = help: Remove whitespace from blank line

sklearn/linear_model/tests/test_ridge.py:647:1: W293 [*] Blank line contains whitespace
    |
645 |         refit_false = RidgeCV(alphas=alphas, refit=False)
646 |         refit_false.fit(X, y)
647 |         
    | ^^^^^^^^ W293
648 |         assert_array_almost_equal(refit_true.best_score_, refit_false.best_score_)
    |
    = help: Remove whitespace from blank line

sklearn/linear_model/tests/test_ridge.py:650:5: F811 Redefinition of unused `test_ridgecv_regression_refit` from line 634
    |
648 |         assert_array_almost_equal(refit_true.best_score_, refit_false.best_score_)
649 | 
650 | def test_ridgecv_regression_refit():
    |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ F811
651 |     rng = np.random.RandomState(0)
652 |     alphas = (0.1, 1.0, 10.0)
    |
    = help: Remove definition: `test_ridgecv_regression_refit`

sklearn/linear_model/tests/test_ridge.py:662:1: W293 [*] Blank line contains whitespace
    |
660 |         refit_false = RidgeClassifierCV(alphas=alphas, refit=False)
661 |         refit_false.fit(X, y)
662 |         
    | ^^^^^^^^ W293
663 |         assert_array_almost_equal(refit_true.best_score_, refit_false.best_score_)
    |
    = help: Remove whitespace from blank line

Found 4 errors.
[*] 3 fixable with the `--fix` option.

mypy

mypy detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed mypy version is mypy=1.9.0.


sklearn/externals/_arff.py:782: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/utils/_testing.py:1317: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/utils/_testing.py:1318: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/utils/_testing.py:1319: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/tree/_classes.py:1991: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/utils/tests/test_tags.py:76: note: By default the bodies of untyped functions are not checked, consider using --check-untyped-defs  [annotation-unchecked]
sklearn/linear_model/tests/test_ridge.py:650: error: Name "test_ridgecv_regression_refit" already defined on line 634  [no-redef]
Found 1 error in 1 file (checked 559 source files)

Generated for commit: aa386ef. Link to the linter CI: here

@AhmedThahir
Copy link
Author

I think I'm done from my side with the code changes and documentation.

Code changes for ElasticNetCV and MultitaskElasticNetCV will be done by @paulAdrienMarie

paulAdrienMarie added a commit to paulAdrienMarie/scikit-learn that referenced this pull request Dec 11, 2024
paulAdrienMarie added a commit to paulAdrienMarie/scikit-learn that referenced this pull request Dec 11, 2024
@AhmedThahir
Copy link
Author

AhmedThahir commented Dec 11, 2024

@kayo09, please do not make such reviews - it confuses us.

Only @paulAdrienMarie and I are assigned to work on this PR.

Edit: @kayo09, could you try to remove the review.

Copy link
Author

@AhmedThahir AhmedThahir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulAdrienMarie The refit parameter was supposed to appear after cv parameter. This is not a programming issue, but a conflict with the sklearn docs automation, as I followed the scheme followed by GridSearchCV and accordingly made the documentation in the same order, ie refit after cv.

Just sharing this as feedback so that it may be useful for you in the future - not as criticism.

This was the error message from which I understood:

AssertionError: Docstring Error:
In function: sklearn.linear_model._coordinate_descent.ElasticNetCV.__init__
There's a parameter name mismatch in function docstring w.r.t. function signature, at index 8 diff: 'cv' != 'refit'
Full diff:
['l1_ratio',
'eps',
'n_alphas',
'alphas',
'fit_intercept',
'precompute',
'max_iter',
'tol',
+  'refit',
'cv',
'copy_X',
-  'refit',
'verbose',
'n_jobs',
'positive',
'random_state',
'selection']
In function: sklearn.linear_model._coordinate_descent.MultiTaskElasticNetCV.__init__
There's a parameter name mismatch in function docstring w.r.t. function signature, at index 7 diff: 'cv' != 'refit'
Full diff:
['l1_ratio',
'eps',
'n_alphas',
'alphas',
'fit_intercept',
'max_iter',
'tol',
+  'refit',
'cv',
'copy_X',
-  'refit',
'verbose',
'n_jobs',
'random_state',
'selection']

@AhmedThahir AhmedThahir changed the title Allows disabling refitting of CV estimators ENH Allows disabling refitting of CV estimators Dec 12, 2024
@AhmedThahir
Copy link
Author

Hi @paulAdrienMarie

Could you help with fixing the code coverage issue?

What would be the best way to go about it?

@paulAdrienMarie
Copy link

Hi @AhmedThahir

I think we need to add unitests for the feature we added to the models using CrossValidation. I can take a look at it !

@AhmedThahir
Copy link
Author

AhmedThahir commented Jan 19, 2025

Hi @paulAdrienMarie, were you able to able to make any progress?

@jeremiedbb and @alifa98, could you pls support us? I've not been able to understand how to make appropriate tests.

Copy link
Contributor

@OmarManzoor OmarManzoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @AhmedThahir

For the tests maybe you could try fitting once with refit=True and once with refit=False and check that the attributes of interest like the alphas are similar. You could do this for all the CV estimators that you are modified to include this new behavior.

Once that is taken care of we might also need a ChangeLog entry.

sklearn/linear_model/_omp.py Outdated Show resolved Hide resolved
Co-authored-by: Omar Salman <omar.salman2007@gmail.com>
@AhmedThahir
Copy link
Author

AhmedThahir commented Feb 11, 2025

@OmarManzoor, thanks for the response :)

Thanks for the PR @AhmedThahir

:)

For the tests maybe you could try fitting once with refit=True and once with refit=False and check that the attributes of interest like the alphas are similar. You could do this for all the CV estimators that you are modified to include this new behavior.

Noted, can you check if this is a good idea to compare the best_score_: aa386ef. If yes, I'll replicate the same for other estimators as well.

Once that is taken care of we might also need a ChangeLog entry.

Okay, will contact you once that's done

@AhmedThahir
Copy link
Author

Hi there, can anyone support with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.