Description
TLDR: Meta-issue for new contributors to add links to the examples in helpful places of the rest of the docs.
Description
This meta-issue is a good place to start with your first contributions to scikit-learn.
This issue builds on top of #26927 and is introduced for easier maintainability. The goal is exactly the same as in the old issue.
Here, we improve the documentation by making the Examples more discoverable by adding links to examples in relevant sections of the documentation in the API documentation and in the User Guide:
- the API documentation is made from the docstrings of public classes and functions which can be found in the
sklearn
folder of the project - the User Guide can be found in the
doc/modules
folder of the project
Together with the examples (which are in the examples
folder of the project), these files get rendered into html when the documentation is build and then are displayed on the scikit-learn website.
Important: We estimate that only 70% of the examples in this list will ultimately be referenced. This means part of the task is deciding which examples deserve being referenced and we are aware that this is not a trivial decision, especially for new contributors. We encourage you to share your reasoning, and a team member will make the final call. We hope this isn’t too frustrating, but please know that evaluating an example is not just an exercise for new contributors; it’s a meaningful and valuable contribution to the project, even (and especially) if the example you worked on doesn’t end up being linked.
Workflow
We recommend this workflow for you:
-
have
pre-commit
installed in your environment as in point 10 of How to contribute in the development guide (this will re-format your contribution to the standards used in scikit-learn and will spare you a lot of confusion when you are a beginner) -
pick an example to work on
- Make sure your example of interest had not recently been claimed by someone else by looking through the discussion of this issue (you will have to load hidden items in this discussion). Hint: If somebody has claimed an example several weeks ago and then never started it, you can take it. You can also take over tasks marked as stalled.
- search the repo for other links to your example and check if the example is already linked in relevant parts of the docs
- how to search the repo: a) find the file name of your example in the examples folder (it starts with
plot_...
); b) use full text search of your IDE to look for where that name appears - you can totally ignore the "Gallery examples" on the website, as it is auto-generated; do only look for real links in the repo
- how to search the repo: a) find the file name of your example in the examples folder (it starts with
- comment on the issue to claim an example (you don't need to wait for a team member's approval before starting to work)
-
find suitable spots in either the API documentation or the User Guide (or both) where users would be happy to find your example linked
- read through your example and understand where it is making its most useful statements
- how to find a good spot (careful: we are extremely picky here)
- if the example demonstrates a certain real world use case: find where in the User Guide the same use case is treated or could be treated
- if the example shows how to use a certain param: the param description in the API documentation might be a good spot to put the link
- if the example compares different techniques: this highly calls for mentioning it in the more theoretical parts of the User Guide
- not all the examples listed here need to be referenced: a link to an example on simply how to use some estimator, doesn't add enough value
- if you find an example that doesn't add enough value to be linked: please leave a comment here; this kind of contribution is highly appreciated
- not a good spot: the
See Also
section, which is (theoretically) reserved for links to other API functionalities, not examples
-
add links
- An example with the path examples/developing_estimators/sklearn_is_fitted.py whould be referenced like this:
:ref:`sphx_glr_auto_examples_developing_estimators_sklearn_is_fitted.py`
- see this example PR, that shows how to add a link to the User Guide: DOC add link to sklearn_is_fitted example in check_is_fitted #26926
- we aim not to use the
.. rubric:: Examples
section to put the example if possible, but to integrate it into the text; be aware that if you add a link like this :ref:`title <link>`, you can change its title so that the example's title gets substituted by your picked title and the link can be fitted more nicely to the sentences - please avoid adding your link to a list of other examples, since we strive to add the links in the most relevant places
- please avoid adding a new
.. rubric:: Examples
section
-
test build the documentation before opening your PR
- have a look into the Documentation part of the Development Guide to learn how to locally build the documentation.
- Check if your changes are displayed as desired by opening the test build in your browser.
-
open PR
- use a PR title like
DOC add links to <name of example>
(starting with DOC) - do not refer to this issue on the title of the PR, instead:
- do refer to this issue using in the Reference Issues/PRs section of your PR, do refer to this issue using "Towards
#30621
" (do not use "Closes #..." or "Fixes #...")
- use a PR title like
-
check the CI
- After the CI tests have finished (~90 minutes) you can find one that says "Check the rendered docs here!". In there, you can look into how the CI has built the documentation for the changed files to check if everything looks alright. You will see something like
auto_examples/path_to_example, [dev], [stable]
, where the first link is your branche's version, the second is the main dev branch and the third link is the last released scikit-learn version that is used for the stable documentation on the website. - if the CI shows any failure, you should to take action by investigating and proposing solutions; as a rule of thump, you can find the most useful information from the CIs, if you click the upper links first; in any case you need to click through several layers until you see actual test results with more information (and until it looks similar to running pytest, ruff or doctest locally)
- if the CI shows linting issues, check if you have installed and activated
pre-commit
properly, and fix the issue by the action the CI proposes (for instance adding or deleting an empty line) - if you are lost and don't know what to do with a CI failure, look through other PRs from this issue; most things have already happened to others
- sometimes, http request errors such as 404 or 405 show up in the CI, in which case you should push an empty commit (
git commit --allow-empty -m "empty commit to re-trigger CI"
)
- After the CI tests have finished (~90 minutes) you can find one that says "Check the rendered docs here!". In there, you can look into how the CI has built the documentation for the changed files to check if everything looks alright. You will see something like
-
wait for reviews and be ready to adjust your contribution later on
Expectation management for new contributors
How long will your first PR take you up until the point you open a PR?
- 8-16 hours if you have never contributed to any project and have only basic or no understanding of the workflow yet
- 2-8 hours if you know the workflow and are just new to scikit-learn (more to the shorter end if you know about linting and sphinx and are able to address CI outputs)
- 1-2 hours for your 2nd, 3rd, ... PR on the same issue for everyone
How long will it take us to merge your PR?
- we strive for a scikit-learn member to look at your PR within a few days and suggest changes depending on technical quality of the PR and an assessment of added value to the user
- we strive for a maintainer to evaluate your PR within a few weeks; they might also suggest changes before approving and merging
- the whole process on average takes several weeks and can take up months, depending of availability of maintainers and on how many review cycles are necessary
ToDo
Here's a list of all the remaining examples:
- examples/applications:
- plot_model_complexity_influence.py #no references need to be added: DOC: Added link to example
model complexity influence
#30814 - plot_out_of_core_classification.py DOC Add link for prediction latency plot for classification benchmark #30462 (stalled)
- plot_prediction_latency.py DOC Add link for prediction latency plot for classification benchmark #30462 (stalled)
- plot_topics_extraction_with_nmf_lda.py
- plot_model_complexity_influence.py #no references need to be added: DOC: Added link to example
- examples/bicluster:
- plot_bicluster_newsgroups.py
- plot_spectral_coclustering.py Added Document refernce to spectral coclustering #29606 (stalled)
- examples/calibration:
- plot_compare_calibration.py
- examples/classification:
- plot_classifier_comparison.py
- plot_digits_classification.py #no references need to be added
- examples/cluster:
- plot_agglomerative_clustering_metrics.py DOC add link to cluster_plot_agglomerative_clustering example in Aggl… #30867
- plot_cluster_comparison.py DOC Linked examples for clustering algorithms in their docstrings (#26927) #30127
- plot_coin_ward_segmentation.py DOC: added link to cluster_plot_coin_ward_segmentation example in feature_extraction.grid_to_graph #30916
- plot_dict_face_patches.py #no references need to be added
- plot_digits_agglomeration.py DOC: add link to example plot_digits_agglomeration.py #30979
- plot_digits_linkage.py
- plot_face_compress.py
- plot_inductive_clustering.py DOC add link plot_inductive_clustering #30182
- plot_segmentation_toy.py #no references need to be added Added documentation reference to plot_segmentation_toy.py to spectral.py #30978
- plot_ward_structured_vs_unstructured.py DOC add link to cluster_plot_ward_structured_vs_unstructured in _aggl… #30861
- examples/covariance:
- plot_mahalanobis_distances.py
- plot_robust_vs_empirical_covariance.py
- plot_sparse_cov.py DOC Add link to plot_sparse_cov example #31278
- examples/decomposition:
- plot_ica_blind_source_separation.py #no references need to be added: DOC Added references to plot_ica_blind_source_separation & plot_ica_vs_pca.py #30786
- plot_ica_vs_pca.py #no references need to be added: DOC Added references to plot_ica_blind_source_separation & plot_ica_vs_pca.py #30786
- plot_image_denoising.py Added example plot_image_denoising.py in User Guide #30864
- plot_sparse_coding.py
- plot_varimax_fa.py
- examples/ensemble:
- plot_bias_variance.py Fix #30621 Added reference to plot_bias_variance.py in learning_curve.rst #30845
- plot_ensemble_oob.py
- plot_feature_transformation.py
- plot_forest_hist_grad_boosting_comparison.py
- plot_forest_importances_faces.py
- plot_forest_importances.py #no references need to be added
- plot_forest_iris.py #no references need to be added
- plot_gradient_boosting_categorical.py DOC added links to plot_gradient_boosting_regularization.py and plot_gradient_boosting_categorical.py #30749
- plot_gradient_boosting_oob.py DOC added links to plot_gradient_boosting_regularization.py and plot_gradient_boosting_categorical.py #30749
- plot_gradient_boosting_regularization.py DOC added links to plot_gradient_boosting_regularization.py and plot_gradient_boosting_categorical.py #30749
- plot_monotonic_constraints.py
- plot_random_forest_regression_multioutput.py
- plot_stack_predictors.py DOC add example plot_stack_predictors.py for Stacked Generalization in ensemble.rst #30747
- plot_voting_decision_regions.py #no references need to be added DOC add link to plot_voting_probas.py for Voting Classifier in ensemble.rst #30847
- plot_voting_probas.py DOC add link to plot_voting_probas.py for Voting Classifier in ensemble.rst #30847
- examples/feature_selection:
- plot_feature_selection.py #no references need to be added Added a reference link to plot_feature_selection.py #31000
- plot_f_test_vs_mi.py #no references need to be added
- plot_rfe_with_cross_validation.py
- plot_select_from_model_diabetes.py
- examples/gaussian_process:
- plot_gpc_iris.py DOC Add missing links to Gaussian Process Classification #30605
- plot_gpc_isoprobability.py DOC Add missing links to Gaussian Process Classification #30605
- plot_gpc.py DOC Add missing links to Gaussian Process Classification #30605
- plot_gpc_xor.py DOC Add missing links to Gaussian Process Classification #30605
- plot_gpr_co2.py
- plot_gpr_noisy.py
- plot_gpr_noisy_targets.py DOC add link to plot_gpr_noisy_targets example in _gpr.py #30850
- plot_gpr_on_structured_data.py DOC add link to plot_gpr_on_structured_data example in gaussian_process #31150
- plot_gpr_prior_posterior.py
- examples/inspection:
- plot_causal_interpretation.py DOC Inspection Examples links in User Guide #30752
- plot_linear_model_coefficient_interpretation.py
- plot_permutation_importance_multicollinear.py
- plot_permutation_importance.py
- examples/linear_model:
- plot_ard.py
- plot_huber_vs_ridge.py
- plot_iris_logistic.py
- plot_lasso_and_elasticnet.py DOC Add link to plot_lasso_and_elasticnet.py example in linear model #30587
- plot_lasso_coordinate_descent_path.py
- plot_lasso_dense_vs_sparse_data.py
- plot_lasso_lars_ic.py
- plot_lasso_lars.py
- plot_lasso_model_selection.py
- plot_logistic_l1_l2_sparsity.py
- plot_logistic_multinomial.py
- plot_logistic_path.py
- plot_logistic.py DOC: Add missing link to plot_logistic.py in Logistic Regression documentation #30942
- plot_multi_task_lasso_support.py
- plot_nnls.py DOC: Add link to plot_nnls example #31280
- plot_ols_3d.py
- plot_ols.py #no references need to be added
- plot_ols_ridge_variance.py DOC add plot_ols_ridge_variance example to the doc #30683
- plot_omp.py
- plot_poisson_regression_non_normal_loss.py
- plot_polynomial_interpolation.py
- plot_quantile_regression.py
- plot_ridge_coeffs.py
- plot_ridge_path.py
- plot_robust_fit.py
- plot_sgd_comparison.py
- plot_sgd_iris.py
- plot_sgd_separating_hyperplane.py
- plot_sgd_weighted_samples.py
- plot_sparse_logistic_regression_20newsgroups.py
- plot_sparse_logistic_regression_mnist.py
- plot_theilsen.py
- plot_tweedie_regression_insurance_claims.py
- examples/manifold:
- plot_lle_digits.py
- plot_manifold_sphere.py DOC Added an example reference for plot_manifold_sphere.py #30959
- plot_swissroll.py
- plot_t_sne_perplexity.py
- examples/miscellaneous:
- plot_anomaly_comparison.py
- plot_display_object_visualization.py
- plot_estimator_representation.py
- plot_johnson_lindenstrauss_bound.py
- plot_kernel_approximation.py
- plot_metadata_routing.py
- plot_multilabel.py
- plot_multioutput_face_completion.py #no references need to be added
- plot_outlier_detection_bench.py
- plot_partial_dependence_visualization_api.py
- plot_pipeline_display.py
- plot_roc_curve_visualization_api.py
- plot_set_output.py
- examples/mixture:
- plot_concentration_prior.py
- plot_gmm_covariances.py doc: add link to the plot_gmm_covariances example #31249
- plot_gmm_init.py
- plot_gmm_pdf.py DOC Add link to plot_gmm_pdf.py in GaussianMixture examples #31230
- plot_gmm.py #no references need to be added: Add Doc for GMM Example #30841
- plot_gmm_selection.py Add Doc for GMM Example #30841
- plot_gmm_sin.py #no references need to be added: Add Doc for GMM Example #30841
- examples/model_selection:
- plot_confusion_matrix.py DOC add link to plot_confusion_matrix example in confusion_matrix.py #30949
- plot_cv_predict.py
- plot_det.py #no references need to be added
- plot_grid_search_digits.py
- plot_grid_search_refit_callable.py
- plot_grid_search_stats.py DOC added reference to plot_grid_search_stats.py #30965
- plot_grid_search_text_feature_extraction.py Added link to plot_grid_search_text_feature_extraction.py under TfidfVectorizer #30974
- plot_likelihood_ratios.py
- plot_multi_metric_evaluation.py
- plot_permutation_tests_for_classification.py
- plot_precision_recall.py #no reference needs to be added
- plot_randomized_search.py
- plot_roc_crossval.py
- plot_roc.py
- plot_successive_halving_heatmap.py
- plot_successive_halving_iterations.py
- plot_train_error_vs_test_error.py
- plot_underfitting_overfitting.py #no references need to be added
-
plot_validation_curve.py#had been merged with another example in DOC merge example presenting the concept of validation curve #29936
- examples/neighbors:
- plot_digits_kde_sampling.py
- plot_kde_1d.py
- plot_lof_novelty_detection.py
- plot_lof_outlier_detection.py
- plot_nca_classification.py #no references need to be added DOC Add missing links to Neighborhood Components Analysis #30849
- plot_nca_dim_reduction.py #no references need to be added DOC Add missing links to Neighborhood Components Analysis #30849
- plot_nca_illustration.py #no references need to be added DOC Add missing links to Neighborhood Components Analysis #30849
- plot_species_kde.py
- examples/semi_supervised:
- plot_label_propagation_digits_active_learning.py #no references need to be added DOC improve headings in LabelSpreading examples #30553
- plot_label_propagation_digits.py #no references need to be added DOC improve headings in LabelSpreading examples #30553
- plot_label_propagation_structure.py #no references need to be added DOC improve headings in LabelSpreading examples #30553
- plot_self_training_varying_threshold.py
- plot_semi_supervised_newsgroups.py DOC add link to plot_semi_supervised_newsgroups.py example in semi_supervised.rst #30882
- plot_semi_supervised_versus_svm_iris.py
- examples/svm:
- plot_custom_kernel.py
- plot_iris_svc.py
- plot_linearsvc_support_vectors.py
- plot_oneclass.py
- plot_rbf_parameters.py
- plot_separating_hyperplane.py DOC Merge plot_svm_margin.py and plot_separating_hyperplane.py into plot_svm_hyperplane_margin.py #31045
- plot_separating_hyperplane_unbalanced.py
- plot_svm_anova.py
- plot_svm_margin.py DOC added link to example plot_svm_margin.py #26969 (stalled) DOC added link to example plot_svm_margin.py #30975 (maybe remove the example) DOC Merge plot_svm_margin.py and plot_separating_hyperplane.py into plot_svm_hyperplane_margin.py #31045 is on merging this example with plot_separating_hyperplane.py
- plot_weighted_samples.py DOC Added references to plot_weighted_samples example in SVM documentation #30676
- examples/tree:
- plot_iris_dtc.py #no references need to be added DOC add link to plot_iris_dtc example in DecisionTreeClassifier documentation #30650
-
plot_tree_regression_multioutput.py# was merged with another example in DOC Add link to plot_tree_regression.py example #26962 - plot_unveil_tree_structure.py #no references need to be added
What comes next?
- after working a bit here, you might want to further explore contributing to scikit learn
- we have Improve tests by using global_random_seed fixture to make them less seed-sensitive #22827 and Fix broken links in the documentation #25024 that are both also suitable for beginners, but might move forwards a little slower than here
- we are looking for people who are willing to do some intense work to improve or merge some examples; these will be PRs that will be intensely discussed and thoroughly reviewed and will probably take several months; if this sounds good to you, please open an issue with a suggestion and maintainers will evaluate your idea
- this could look like DOC rework the example presenting the regularization path of Lasso, Lasso-LARS, and Elastic Net #29963 and DOC merging the examples related to OPTICS, DBSCAN, and HDBSCAN #29962
- we also have an open issue to discuss examples that can be removed: RFC remove some of our examples #27151
- if you are more senior professionally, you can look through the issues with the
help wanted
label or with themoderate
label or you can take over stalled PRs; these kind of contributions need to be discussed with maintainers and I would recommend seeking their approval first and not invest too much work before you get a go