DOC improve plot_grid_search_refit_callable.py and add links #30990

adrinjalali · Mar 13, 2025

This adds links to the example, as well as improving the example itself.

I wonder if the plots can be done easier, either with polars or matplotlib, or other libs. I'm not really a plotting person.

Maybe @lucyleeow or @MarcoGorelli would have an idea (this uses polars)

cc @StefanieSenger

It also makes the docstrings for refit more consistent with one another (which is an interesting case for comparing docstrings efforts @lucyleeow )

github-actions · Mar 13, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 8193ce6. Link to the linter CI: here}

examples/model_selection/plot_grid_search_refit_callable.py

sklearn/model_selection/_search_successive_halving.py

examples/model_selection/plot_grid_search_refit_callable.py

adrinjalali · Mar 14, 2025

These are the kinds of plots which I really think we should have much easier ways to do, either as Displays in sklearn, or in skore, not sure.

I've changed the plots a bit, and I think they're better/more informative of what's happening

cc @glemaitre

ogrisel

Thanks for giving this example some love. Here is a first pass of suggestions related to the use of stratification when estimating the standard deviation of cross-validation score. I plan to do a second pass tomorrow.

ogrisel · Mar 17, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+#
+# We use GridSearchCV with our custom `best_low_complexity` function as the refit
+# parameter. This function will select the model with the fewest PCA components that
+# still performs within one standard deviation of the best model.

 grid = GridSearchCV(
    pipe,
    cv=10,


Let's recommend users to use non-stratified CV for such a use case. Using stratification can make the standard deviation of the validation scores degenerate on imbalanced data. Here the dataset is balanced, so stratification should have no impact. However, since this example might be copy-pasted to be reused on imbalanced data, I think it's safer to advise a less brittle way to estimate epistemic uncertainty.

Suggested change

cv=10,

# Use a non-stratified CV strategy to make sure that the inter-fold

# standard deviation of the test scores is informative.

cv=ShuffleSplit(n_splits=10, random_state=0),

BTW, using more iterations yields a smoother curves that looks better and should also lead to a more stable selection of the best number of PCA components:

cv=ShuffleSplit(n_splits=30, test_size=0.1, random_state=42)

but it makes the example run a bit slower.

yeah 30 is too slow for the CI I'd say.

ogrisel · Mar 17, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+#
+# We create a pipeline with two steps:
+# 1. Dimensionality reduction using PCA
+# 2. Classification using LinearSVC


I think I would always recommend using LogisticRegression over linear SVC nowadays. Those models have similar ROC-AUC capabilities, but only LR can output interpretable confidence scores with predict_proba (and evaluated with a proper scoring rule such as Brier score or log loss).

Furthermore, Liblinear sample_weight support seems to be broken in subtle ways that might be difficult to fix, so I would rather stop implicitly recommending this model in our examples.

I tried the example with LogisticRegression and the results are very similar but it's required to pass a larger max_iter value to avoid warnings (e.g. max_iter=1000).

examples/model_selection/plot_grid_search_refit_callable.py

glemaitre · Mar 17, 2025

These are the kinds of plots which I really think we should have much easier ways to do, either as Displays in sklearn, or in skore, not sure.

At the end of the day, I'm under the impression that we are doing a validation curve here. Right now, the ValidationCurveDisplay.from_estimator takes an estimator and perform the cross-validation. I'm under the impression that ValidationCurveDisplay.from_cv_results should take the search_cv.cv_results_ and create this plot.

adrinjalali · Mar 17, 2025

@glemaitre That's a very different plot though, here we're simply plotting the measured metric(s). I do agree that ValidationCurveDisplay.from_cv_results should exist, but that's a separate issue I'd say.

lorentzenchr

LGTM
Just wondering, if one could make more use of results_df and shorten code.

examples/model_selection/plot_grid_search_refit_callable.py

ogrisel · Mar 18, 2025

At the end of the day, I'm under the impression that we are doing a validation curve here. Right now, the ValidationCurveDisplay.from_estimator takes an estimator and perform the cross-validation. I'm under the impression that ValidationCurveDisplay.from_cv_results should take the search_cv.cv_results_ and create this plot.

Since the grid search results can include combinations of more than one hyperparameter at once, I am not sure how that would work. I agree, let's keep this discussion for a follow-up issue to avoid side-tracking the review of this example.

ogrisel

Some more feedback. LGTM overall.

examples/model_selection/plot_grid_search_refit_callable.py

ogrisel · Mar 18, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+# selection of the "best" model is desired.
+
+# Adjust layout and display the figure
+plt.tight_layout()
 plt.show()


I think we don't need that last cell now that the example has been converted to a notebook style example.

Also, passing constrained_layout=True to the plt.subplots call above is likely a better solution to fix overlapping label and axis issues in general.

Need to keep the plt.show() to actually show the plot when running the example

I think with 'new' sphinx-gallery (>0.5.0) you don't need it for the plot to show, but it can be useful to avoid the text output. You could also use _ = plt.tight_layout() to avoid text output

ref: https://sphinx-gallery.github.io/stable/faq.html#why-am-i-getting-text-output-for-matplotlib-functions

When running locally, that doesn't show the plot.

examples/model_selection/plot_grid_search_refit_callable.py

ogrisel · Mar 18, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+#
+# We use GridSearchCV with our custom `best_low_complexity` function as the refit
+# parameter. This function will select the model with the fewest PCA components that
+# still performs within one standard deviation of the best model.

 grid = GridSearchCV(
    pipe,
    cv=10,


BTW, using more iterations yields a smoother curves that looks better and should also lead to a more stable selection of the best number of PCA components:

cv=ShuffleSplit(n_splits=30, test_size=0.1, random_state=42)

but it makes the example run a bit slower.

ogrisel · Mar 18, 2025

Also, I would not be opposed to collapsing the 2 subplots into ones that display everything at once but using a bigger figure size.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…drin-11

adrinjalali

New plot:

adrinjalali · Mar 18, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+# selection of the "best" model is desired.
+
+# Adjust layout and display the figure
+plt.tight_layout()
 plt.show()


Need to keep the plt.show() to actually show the plot when running the example

adrinjalali · Mar 18, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+#
+# We use GridSearchCV with our custom `best_low_complexity` function as the refit
+# parameter. This function will select the model with the fewest PCA components that
+# still performs within one standard deviation of the best model.

 grid = GridSearchCV(
    pipe,
    cv=10,


yeah 30 is too slow for the CI I'd say.

adrinjalali · Mar 20, 2025

Are we happy with the new plot?

ogrisel · Mar 24, 2025

The new plot looks good but there is an HTML rendering problem in the end of the example:

Otherwise, LGTM. Thanks.

examples/model_selection/plot_grid_search_refit_callable.py

adrinjalali

Merging since there are no more unresolved comments.

adrinjalali · May 19, 2025

examples/model_selection/plot_grid_search_refit_callable.py

+# selection of the "best" model is desired.
+
+# Adjust layout and display the figure
+plt.tight_layout()
 plt.show()


When running locally, that doesn't show the plot.

…learn#30990) Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

DOC improve plot_grid_search_refit_callable.py and add links

71c9d73

github-actions bot added module:model_selection Documentation labels Mar 13, 2025

Merge remote-tracking branch 'upstream/main' into adrin-11

ae2f124

adrinjalali marked this pull request as ready for review March 13, 2025 15:06

linters

2c21b68

adrinjalali added the No Changelog Needed label Mar 13, 2025

linters

0e62012

MarcoGorelli reviewed Mar 13, 2025

View reviewed changes

examples/model_selection/plot_grid_search_refit_callable.py Outdated Show resolved Hide resolved

refine

4498d22

lucyleeow reviewed Mar 13, 2025

View reviewed changes

sklearn/model_selection/_search_successive_halving.py Outdated Show resolved Hide resolved

lucyleeow reviewed Mar 13, 2025

View reviewed changes

examples/model_selection/plot_grid_search_refit_callable.py Outdated Show resolved Hide resolved

adrinjalali added 2 commits March 14, 2025 08:23

Lucy's comments

ffa9acd

revise plots

036e248

increase range

a397593

ogrisel reviewed Mar 17, 2025

View reviewed changes

lorentzenchr approved these changes Mar 17, 2025

View reviewed changes

examples/model_selection/plot_grid_search_refit_callable.py Outdated Show resolved Hide resolved

examples/model_selection/plot_grid_search_refit_callable.py Outdated Show resolved Hide resolved

examples/model_selection/plot_grid_search_refit_callable.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into adrin-11

8ecc5e5

ogrisel approved these changes Mar 18, 2025

View reviewed changes

adrinjalali and others added 5 commits March 18, 2025 16:52

Apply suggestions from code review

2314449

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Merge branch 'adrin-11' of github.com:adrinjalali/scikit-learn into a…

ae19485

…drin-11

reviews

f1cb064

one plot

65ebcbf

layout

9c6203a

adrinjalali commented Mar 18, 2025

View reviewed changes

StefanieSenger mentioned this pull request Mar 21, 2025

DOC add link to plot_semi_supervised_newsgroups.py example in semi_supervised.rst #30882

Closed

lorentzenchr reviewed Mar 24, 2025

View reviewed changes

examples/model_selection/plot_grid_search_refit_callable.py Outdated Show resolved Hide resolved

adrinjalali added 5 commits March 24, 2025 13:32

fix rendering (hopefully)

9165800

Merge remote-tracking branch 'upstream/main' into adrin-11

11bf5c2

Merge remote-tracking branch 'upstream/main' into adrin-11

5757308

fix sphinx issues

391eb3e

Merge remote-tracking branch 'upstream/main' into adrin-11

8193ce6

adrinjalali commented May 20, 2025

View reviewed changes

adrinjalali merged commit 18cdea7 into scikit-learn:main May 20, 2025
36 checks passed

adrinjalali deleted the adrin-11 branch May 20, 2025 11:44

jeremiedbb pushed a commit that referenced this pull request Jun 5, 2025

DOC improve plot_grid_search_refit_callable.py and add links (#30990)

5b84171

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

-    cv=10,
+    # Use a non-stratified CV strategy to make sure that the inter-fold
+    # standard deviation of the test scores is informative.
+    cv=ShuffleSplit(n_splits=10, random_state=0),

Search code, repositories, users, issues, pull requests...

Uh oh!

DOC improve plot_grid_search_refit_callable.py and add links #30990

DOC improve plot_grid_search_refit_callable.py and add links #30990

Uh oh!

Conversation

adrinjalali commented Mar 13, 2025

Uh oh!

github-actions bot commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrinjalali commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

glemaitre commented Mar 17, 2025

Uh oh!

adrinjalali commented Mar 17, 2025

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel commented Mar 18, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 18, 2025

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Mar 20, 2025

Uh oh!

ogrisel commented Mar 24, 2025

Uh oh!

Uh oh!

adrinjalali left a comment

github-actions bot commented Mar 13, 2025 •

edited

Loading

adrinjalali commented Mar 14, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel Mar 17, 2025 •

edited

Loading