Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

MAINT fragmenting the changelog of 1.6 #30081

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 17, 2024

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Oct 16, 2024

Follow-up of #30046

Fragmenting the changelog of 1.6 for each entry.

A follow-up to this PR is to activate the proper GitHub actions to check the fragment on PR.

For reference, the current fragmentation will generate the following RST file:

Version 1.6.0 (2024-10-17)
==========================

Changes impacting many modules
------------------------------

- |Enhancement| `__sklearn_tags__` was introduced for setting tags in estimators.
  More details in :ref:`estimator_tags`.
  By :user:`Thomas Fan <thomasjpfan>` and :user:`Adrin Jalali <adrinjalali>` in :pr:`29677`

- |API| :func:`utils.validation.validate_data` is introduced and replaces previously
  private `base.BaseEstimator._validate_data` method. This is intended for third party
  estimator developers, who should use this function in most cases instead of
  :func:`utils.validation.check_array` and :func:`utils.validation.check_X_y`.
  By :user:`Adrin Jalali <adrinjalali>` in :pr:`29696`

Support for Array API
---------------------

Additional estimators and functions have been updated to include support for all
`Array API <https://data-apis.org/array-api/latest/>`_ compliant inputs.

See :ref:`array_api` for more details.

- |Feature| :class:`model_selection.GridSearchCV`,
  :class:`model_selection.RandomizedSearchCV`,
  :class:`model_selection.HalvingGridSearchCV` and
  :class:`model_selection.HalvingRandomSearchCV` now support Array API
  compatible inputs when their base estimators do.
  By :user:`Tim Head <betatim>` and :user:`Olivier Grisel <ogrisel>` in :pr:`27096`

- |Feature| :class:`preprocessing.LabelEncoder` now supports Array API compatible inputs.
  By :user:`Omar Salman <OmarManzoor>` in :pr:`27381`

- |Feature| :func:`sklearn.metrics.mean_absolute_error` by :user:`Edoardo Abati <EdAbati>` in :pr:`27736`

- |Feature| :func:`sklearn.metrics.mean_tweedie_deviance` by :user:`Thomas Li <lithomas1>` in :pr:`28106`

- |Feature| :func:`sklearn.metrics.pairwise.cosine_similarity` by :user:`Edoardo Abati <EdAbati>` in :pr:`29014`

- |Feature| :func:`sklearn.metrics.pairwise.paired_cosine_distances` by :user:`Edoardo Abati <EdAbati>` in :pr:`29112`

- |Feature| :func:`sklearn.metrics.cluster.entropy` by :user:`Yaroslav Korobko <Tialo>` in :pr:`29141`

- |Feature| :func:`sklearn.metrics.mean_squared_error` by :user:`Yaroslav Korobko <Tialo>` in :pr:`29142`

- |Feature| :func:`sklearn.metrics.mean_absolute_error` by :user:`Tialo <Tialo>` and
  :user:`Loïc Estève <lesteve>` in :pr:`29143`

- |Feature| :func:`sklearn.metrics.pairwise.additive_chi2_kernel` by
  :user:`Yaroslav Korobko <Tialo>` in :pr:`29144`

- |Feature| :func:`sklearn.metrics.d2_tweedie_score` by :user:`Emily Chen <EmilyXinyi>` in :pr:`29207`

- |Feature| :func:`sklearn.metrics.max_error` by :user:`Edoardo Abati <EdAbati>` in :pr:`29212`

- |Feature| :func:`sklearn.metrics.mean_poisson_deviance` by :user:`Emily Chen <EmilyXinyi>` in :pr:`29227`

- |Feature| :func:`sklearn.metrics.mean_gamma_deviance` by :user:`Emily Chen <EmilyXinyi>` in :pr:`29239`

- |Feature| :func:`sklearn.metrics.pairwise.cosine_distances` by :user:`Emily Chen <EmilyXinyi>` in :pr:`29265`

- |Feature| :func:`sklearn.metrics.pairwise.chi2_kernel` by :user:`Yaroslav Korobko <Tialo>` in :pr:`29267`

- |Feature| :func:`sklearn.metrics.mean_absolute_percentage_error` by
  :user:`Emily Chen <EmilyXinyi>` in :pr:`29300`

- |Feature| :func:`sklearn.metrics.pairwise.paired_euclidean_distances` by :user:`Emily Chen <EmilyXinyi>` in :pr:`29389`

- |Feature| :func:`sklearn.metrics.pairwise.euclidean_distances` and
  :func:`sklearn.metrics.pairwise.rbf_kernel` by :user:`Omar Salman <OmarManzoor>` in :pr:`29433`

- |Feature| :func:`sklearn.metrics.pairwise.linear_kernel`,
  :func:`sklearn.metrics.pairwise.sigmoid_kernel`, and
  :func:`sklearn.metrics.pairwise.polynomial_kernel` by
  :user:`Omar Salman <OmarManzoor>` in :pr:`29475`

- |Feature| :func:`sklearn.metrics.mean_squared_log_error` and
  :func:`sklearn.metrics.root_mean_squared_log_error`
  by :user:`Virgil Chan <virchan>` in :pr:`29709`

- |Feature| :class:`preprocessing.MinMaxScaler` with `clip=True`.
  By :user:`Shreekant Nandiyawar <Shree7676>` in :pr:`29751`

- Support for the soon to be deprecated `cupy.array_api` module has been
  removed in favor of directly supporting the top level `cupy` module, possibly
  via the `array_api_compat.cupy` compatibility wrapper.
  By :user:`Olivier Grisel <ogrisel>` in :pr:`29639`

Metadata routing
----------------

Refer to the :ref:`Metadata Routing User Guide <metadata_routing>` for
more details.

- |Feature| :class:`semi_supervised.SelfTrainingClassifier`
  now supports metadata routing. The fit method now accepts ``**fit_params``
  which are passed to the underlying estimators via their `fit` methods.
  In addition, the `predict`, `predict_proba`, `predict_log_proba`, `score`
  and `decision_function` methods also accept ``**params`` which are
  passed to the underlying estimators via their respective methods.
  By :user:`Adam Li <adam2392>` in :pr:`28494`

- |Feature| :class:`ensemble.StackingClassifier` and
  :class:`ensemble.StackingRegressor` now support metadata routing and pass
  ``**fit_params`` to the underlying estimators via their `fit` methods.
  By :user:`Stefanie Senger <StefanieSenger>` in :pr:`28701`

- |Feature| :func:`model_selection.learning_curve` now supports metadata routing for the
  `fit` method of its estimator and for its underlying CV splitter and scorer.
  By :user:`Stefanie Senger <StefanieSenger>` in :pr:`28975`

- |Feature| :class:`compose.TransformedTargetRegressor` now supports metadata
  routing in its `fit` and `predict` methods and routes the corresponding
  params to the underlying regressor.
  By :user:`Omar Salman <OmarManzoor>` in :pr:`29136`

- |Feature| :class:`feature_selection.SequentialFeatureSelector` now supports
  metadata routing in its `fit` method and passes the corresponding params to
  the :func:`model_selection.cross_val_score` function.
  By :user:`Omar Salman <OmarManzoor>` in :pr:`29260`

- |Feature| :func:`model_selection.permutation_test_score` now supports metadata routing
  for the `fit` method of its estimator and for its underlying CV splitter and scorer.
  By :user:`Adam Li <adam2392>` in :pr:`29266`

- |Feature| :class:`feature_selection.RFE` and :class:`feature_selection.RFECV`
  now support metadata routing.
  By :user:`Omar Salman <OmarManzoor>` in :pr:`29312`

- |Feature| :func:`model_selection.validation_curve` now supports metadata routing for
  the `fit` method of its estimator and for its underlying CV splitter and scorer.
  By :user:`Stefanie Senger <StefanieSenger>` in :pr:`29329`

- |Fix| Metadata is routed correctly to grouped CV splitters via
  :class:`linear_model.RidgeCV` and :class:`linear_model.RidgeClassifierCV` and
  `UnsetMetadataPassedError` is fixed for :class:`linear_model.RidgeClassifierCV` with
  default scoring.
  By :user:`Stefanie Senger <StefanieSenger>` in :pr:`29634`

Dropping official support for PyPy
----------------------------------

Due to limited maintainer resources and small number of users, official PyPy
support has been dropped. Some parts of scikit-learn may still work but PyPy is
not tested anymore in the scikit-learn Continuous Integration.
By :user:`Loïc Estève <lesteve>` in :pr:`29128`

Dropping support for building with setuptools
---------------------------------------------

From scikit-learn 1.6 onwards, support for building with setuptools has been
removed. Meson is the only supported way to build scikit-learn, see
:ref:`Building from source <install_bleeding_edge>` for more details.
By :user:`Loïc Estève <lesteve>` in :pr:`29400`

:mod:`sklearn.base`
-------------------

- |Enhancement| Added a function :func:`base.is_clusterer` which determines whether a given
  estimator is of category clusterer.
  By :user:`Christian Veenhuis <ChVeen>` in :pr:`28936`

:mod:`sklearn.cluster`
----------------------

- |API| The `copy` parameter of :class:`cluster.Birch` was deprecated in 1.6 and will be
  removed in 1.8. It has no effect as the estimator does not perform in-place operations
  on the input data.
  By :user:`Yao Xiao <Charlie-XIAO>` in :pr:`29124`

:mod:`sklearn.compose`
----------------------

- |Enhancement| :func:`sklearn.compose.ColumnTransformer` `verbose_feature_names_out`
  now accepts string format or callable to generate feature names.
  By :user:`Marc Bresson <MarcBresson>` in :pr:`28934`

:mod:`sklearn.covariance`
-------------------------

- |Efficiency| :class:`covariance.MinCovDet` fitting is now slightly faster.
  By :user:`Antony Lee <anntzer>` in :pr:`29835`

:mod:`sklearn.cross_decomposition`
----------------------------------

- |Fix| :class:`cross_decomposition.PLSRegression` properly raises an error when
  `n_components` is larger than `n_samples`.
  By :user:`Thomas Fan <thomasjpfan>` in :pr:`29710`

:mod:`sklearn.datasets`
-----------------------

- |Feature| :func:`datasets.fetch_file` allows downloading arbitrary data-file
  from the web. It handles local caching, integrity checks with SHA256 digests
  and automatic retries in case of HTTP errors.
  By :user:`Olivier Grisel <ogrisel>` in :pr:`29354`

:mod:`sklearn.discriminant_analysis`
------------------------------------

- |Fix| :class:`discriminant_analysis.QuadraticDiscriminantAnalysis`
  will now cause `LinAlgWarning` in case of collinear variables. These errors
  can be silenced using the `reg_param` attribute.
  By :user:`Alihan Zihna <azihna>` in :pr:`19731`

:mod:`sklearn.ensemble`
-----------------------

- |Feature| :class:`ensemble.ExtraTreesClassifier` and
  :class:`ensemble.ExtraTreesRegressor` now support missing-values in the data matrix
  `X`. Missing-values are handled by randomly moving all of the samples to the left, or
  right child node as the tree is traversed.
  By :user:`Adam Li <adam2392>` in :pr:`28268`

- |Efficiency| Small runtime improvement of fitting
  :class:`ensemble.HistGradientBoostingClassifier` and
  :class:`ensemble.HistGradientBoostingRegressor` by parallelizing the initial search
  for bin thresholds.
  By :user:`Christian Lorentzen <lorentzenchr>` in :pr:`28064`

- |Efficiency| :class:`ensemble.IsolationForest` now runs parallel jobs
  during :term:`predict` offering a speedup of up to 2-4x on sample sizes
  larger than 2000 using `joblib`.
  By :user:`Adam Li <adam2392>` and :user:`Sérgio Pereira <sergiormpereira>` in :pr:`28622`

- |Enhancement| The verbosity of :class:`ensemble.HistGradientBoostingClassifier`
  and :class:`ensemble.HistGradientBoostingRegressor` got a more granular control. Now,
  `verbose = 1` prints only summary messages, `verbose >= 2` prints the full
  information as before.
  By :user:`Christian Lorentzen <lorentzenchr>` in :pr:`28179`

- |API| The parameter `algorithm` of :class:`ensemble.AdaBoostClassifier` is deprecated
  and will be removed in 1.8.
  By :user:`Jérémie du Boisberranger <jeremiedbb>` in :pr:`29997`

:mod:`sklearn.feature_extraction`
---------------------------------

- |Fix| :class:`feature_extraction.text.TfidfVectorizer` now correctly preserves the
  `dtype` of `idf_` based on the input data.
  By :user:`Guillaume Lemaitre <glemaitre>` in :pr:`30022`

:mod:`sklearn.impute`
---------------------

- |Fix| :class:`impute.KNNImputer` excludes samples with nan distances when
  computing the mean value for uniform weights.
  By :user:`Xuefeng Xu <xuefeng-xu>` in :pr:`29135`

- |Fix| Fixed :class:`impute.IterativeImputer` to make sure that it does not skip
  the iterative process when `keep_empty_features` is set to `True`.
  By :user:`Arif Qodari <arifqodari>` in :pr:`29779`

:mod:`sklearn.linear_model`
---------------------------

- |Fix| :class:`linear_model.LogisticRegressionCV` corrects sample weight handling
  for the calculation of test scores.
  By :user:`Shruti Nath <snath-xoc>` in :pr:`29419`

- |Fix| :class:`linear_model.LassoCV` and :class:`linear_model.ElasticNetCV` now
  take sample weights into accounts to define the search grid for the internally tuned
  `alpha` hyper-parameter.
  By :user:`John Hopfensperger <s-banach>` and :user:`Shruti Nath <snath-xoc>` in :pr:`29442`

- |Fix| :class:`linear_model.LogisticRegression`, :class:`linear_model.PoissonRegressor`,
  :class:`linear_model.GammaRegressor`, :class:`linear_model.TweedieRegressor`
  now take sample weights into account to decide when to fall back to `solver='lbfgs'`
  whenever `solver='newton-cholesky'` becomes numerically unstable.
  By :user:`Antoine Baker <antoinebaker>` in :pr:`29818`

- |Fix| :class:`linear_model.RidgeCV` now properly uses predictions on the same scale as
  the target seen during `fit`. These predictions are stored in `cv_results_` when
  `scoring != None`. Previously, the predictions were rescaled by the square root of the
  sample weights and offset by the mean of the target, leading to an incorrect estimate
  of the score.
  By :user:`Guillaume Lemaitre <glemaitre>`,
  :user:`Jérôme Dockes <jeromedockes>` and
  :user:`Hanmin Qin <qinhanmin2014>` in :pr:`29842`

- |Fix| :class:`linear_model.RidgeCV` now properly supports custom multioutput scorers
  by letting the scorer manage the multioutput averaging. Previously, the predictions
  and true targets were both squeezed to a 1D array before computing the error.
  By :user:`Guillaume Lemaitre <glemaitre>` in :pr:`29884`

- |API| Deprecates `copy_X` in :class:`linear_model.TheilSenRegressor` as the parameter
  has no effect. `copy_X` will be removed in 1.8.
  By :user:`Adam Li <adam2392>` in :pr:`29105`

:mod:`sklearn.manifold`
-----------------------

- |Efficiency| :func:`manifold.locally_linear_embedding` and
  :class:`manifold.LocallyLinearEmbedding` now allocate more efficiently the memory of
  sparse matrices in the Hessian, Modified and LTSA methods.
  By :user:`Giorgio Angelotti <giorgioangel>` in :pr:`28096`

:mod:`sklearn.metrics`
----------------------

- |Efficiency| :func:`sklearn.metrics.classification_report` is now faster by caching
  classification labels.
  By :user:`Adrin Jalali <adrinjalali>` in :pr:`29738`

- |Enhancement| :func:`sklearn.metrics.check_scoring` now accepts `raise_exc` to specify
  whether to raise an exception if a subset of the scorers in multimetric scoring fails
  or to return an error code.
  By :user:`Stefanie Senger <StefanieSenger>` in :pr:`28992`

- |Enhancement| Adds `zero_division` to :func:`cohen_kappa_score`. When there is a
  division by zero, the metric is undefined and this value is returned.
  By :user:`Marc Torrellas Socastro <marctorsoc>` and
  :user:`Stefanie Senger <StefanieSenger>` in :pr:`29210`

- |Fix| :func:`metrics.roc_auc_score` will now correctly return 0.0 and
  warn user if only one class is present in the labels.
  By :user:`Gleb Levitski <glevv>` in :pr:`27412`

- |Fix| The functions :func:`metrics.mean_squared_log_error` and
  :func:`metrics.root_mean_squared_log_error` now check whether the inputs are within
  the correct domain for the function :math:`y=\log(1+x)`, rather than
  :math:`y=\log(x)`. The functions :func:`metrics.mean_absolute_error`,
  :func:`metrics.mean_absolute_percentage_error`, :func:`metrics.mean_squared_error`
  and :func:`metrics.root_mean_squared_error` now explicitly check whether a scalar
  will be returned when `multioutput=uniform_average`.
  By :user:`Virgil Chan <virchan>` in :pr:`29709`

- |API| The `assert_all_finite` parameter of functions
  :func:`metrics.pairwise.check_pairwise_arrays` and :func:`metrics.pairwise_distances`
  is renamed into `ensure_all_finite`. `force_all_finite` will be removed in 1.8.
  By :user:`Jérémie du Boisberranger <jeremiedb>` in :pr:`29404`

- |API| `scoring="neg_max_error"` should be used instead of `scoring="max_error"`
  which is now deprecated.
  By :user:`Farid "Freddie" Taba <artificialfintelligence>` in :pr:`29462`

- |API| The default value of the `response_method` parameter of
  :func:`metrics.make_scorer` will change from `None` to `"predict"` and `None` will be
  removed in 1.8. In the mean time, `None` is equivalent to `"predict"`.
  By :user:`Jérémie du Boisberranger <jeremiedb>` in :pr:`30001`

:mod:`sklearn.model_selection`
------------------------------

- |Enhancement| Add the parameter `prefit` to
  :class:`model_selection.FixedThresholdClassifier` allowing the use of a pre-fitted
  estimator without re-fitting it.
  By :user:`Guillaume Lemaitre <glemaitre>` in :pr:`29067`

- |Fix| Improve error message when :func:`model_selection.RepeatedStratifiedKFold.split`
  is called without a `y` argument
  By :user:`Anurag Varma <Anurag-Varma>` in :pr:`29402`

:mod:`sklearn.neighbors`
------------------------

- |Fix| :class:`neighbors.LocalOutlierFactor` raises a warning in the `fit` method
  when duplicate values in the training data lead to inaccurate outlier detection.
  By :user:`Henrique Caroço <HenriqueProj>` in :pr:`28773`

:mod:`sklearn.neural_network`
-----------------------------

- |Fix| :class:`neural_network.MLPRegressor` does no longer crash when the model
  diverges and that `early_stopping` is enabled.
  By :user:`Marc Bresson <MarcBresson>` in :pr:`29773`

:mod:`sklearn.preprocessing`
----------------------------

- |Enhancement| Added `warn` option to `handle_unknown` parameter in
  :class:`preprocessing.OneHotEncoder`.
  By :user:`Gleb Levitski <glevv>` in :pr:`28637`

- |Enhancement| The HTML representation of :class:`preprocessing.FunctionTransformer`
  will show the function name in the label.
  By :user:`Yao Xiao <Charlie-XIAO>` in :pr:`29158`

- |Fix| :class:`preprocessing.PowerTransformer` now uses `scipy.special.inv_boxcox`
  to output `nan` if the input of BoxCox's inverse is invalid.
  By :user:`Xuefeng Xu <xuefeng-xu>` in :pr:`27875`

:mod:`sklearn.semi_supervised`
------------------------------

- |API| :class:`semi_supervised.SelfTrainingClassifier`
  deprecated the `base_estimator` parameter in favor of `estimator`.
  By :user:`Adam Li <adam2392>` in :pr:`28494`

:mod:`sklearn.tree`
-------------------

- |Feature| :class:`tree.ExtraTreeClassifier` and :class:`tree.ExtraTreeRegressor` now
  support missing-values in the data matrix ``X``. Missing-values are handled by
  randomly moving all of the samples to the left, or right child node as the tree is
  traversed.
  By :user:`Adam Li <adam2392>` in :pr:`27966`

:mod:`sklearn.utils`
--------------------

- |Enhancement| :func:`utils.validation.check_array` now accepts `ensure_non_negative`
  to check for negative values in the passed array, until now only available through
  calling :func:`utils.validation.check_non_negative`.
  By :user:`Tamara Atanasoska <tamaraatanasoska>` in :pr:`29540`

- |Enhancement| :func:`utils.validation.check_is_fitted` now passes on stateless
  estimators. An estimator can indicate it's stateless by setting the `requires_fit`
  tag. See :ref:`estimator_tags` for more information.
  By :user:`Adrin Jalali <adrinjalali>` in :pr:`29880`

- |Fix| :func:`utils.estimator_checks.parametrize_with_checks` and
  :func:`utils.estimator_checks.check_estimator` now support estimators that
  have `set_output` called on them.
  By :user:`Adrin Jalali <adrinjalali>` in :pr:`29869`

- |API| The `assert_all_finite` parameter of functions :func:`utils.check_array`,
  :func:`utils.check_X_y`, :func:`utils.as_float_array` is renamed into
  `ensure_all_finite`. `force_all_finite` will be removed in 1.8.
  By :user:`Jérémie du Boisberranger <jeremiedb>` in :pr:`29404`

- |API| :func:`check_estimators.check_sample_weights_invariance` replaced by
  :func:`check_estimators.check_sample_weight_equivalence` which uses
  integer (including zero) weights.
  By :user:`Antoine Baker <antoinebaker>` in :pr:`29818`

@glemaitre glemaitre marked this pull request as draft October 16, 2024 21:18
Copy link

github-actions bot commented Oct 16, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 8548118. Link to the linter CI: here

@glemaitre glemaitre marked this pull request as ready for review October 16, 2024 22:36
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trusting you've already made sure nothing's removed from the changelog. Can't really easily check for that here. I'll let @lesteve have a look too and merge.

@glemaitre
Copy link
Member Author

glemaitre commented Oct 17, 2024 via email

@lesteve
Copy link
Member

lesteve commented Oct 17, 2024

Turns out I had a hacky script almost ready to generate the fragments. At least this allowed to double-check that no entry were missing. There are some tiny differences due to different choices (or my script being too dumb) but I think that's fine.

Here is my hacky script in case it can be used later down the line:

"""
Caveats right now (2024-10-16):
- custom sections like "dropping setuptools support" are not seen (because there
  is no bullet points inside these section, easy to do by hand, there are 2 of them)
- Array API had custom grouping (by class, function, other), so needs to be
  looked at closer. Needs content edition. Also everything is categorized
  other, do we want different tags (maybe feature)?
- 2 entries have two PRs listed 29677 (other PR is 22606) 29143 (other PR is
  27736), probably the content needs to be tweaked there is "By" or/and "and" too many

"""

import re
import textwrap
import warnings
from pathlib import Path

from docutils import nodes
from docutils.core import publish_doctree
from docutils.utils import Reporter


def parse_rst(rst_content):
    doctree = publish_doctree(
        # settings_override is used to avoid showing parsing issues we don't
        # care about, for example for sphinx constructs
        rst_content,
        settings_overrides={"report_level": Reporter.SEVERE_LEVEL + 10},
    )
    return extract_sections_and_bullets(doctree)


def extract_sections_and_bullets(node, level=1):
    result = []

    if isinstance(node, nodes.section):
        title = node.children[0].rawsource
        result.append({"type": "section", "level": level, "title": title})
        for child in node.children[1:]:
            result.extend(extract_sections_and_bullets(child, level + 1))

    elif isinstance(node, nodes.bullet_list):
        for item in node.children:
            bullet_text = item.children[0].rawsource
            result.append({"type": "bullet", "text": bullet_text})

    elif isinstance(node, nodes.Element):
        for child in node.children:
            result.extend(extract_sections_and_bullets(child, level))

    return result


def section_to_folder(section):
    if section.startswith(":mod:"):
        return re.sub(r":mod:`(.+)`", r"\1", section)

    section_mapping = {
        "Changes impacting many modules": "many-modules",
        "Support for Array API": "array-api",
        "Metadata Routing": "metadata-routing",
    }
    return section_mapping[section]


def get_pr_number(content):
    matches = re.findall(pr_pattern, content)
    if len(matches) > 1:
        warnings.warn(f"More than one PR {matches} in content {content}")

    return matches[-1]


def get_fragment_type(content):
    m = re.match(tag_pattern, content)

    if m is None:
        return "other"

    return tag_to_type[m.group(1)]


def get_fragment_path(section, content):
    pr_number = get_pr_number(content)
    fragment_type = get_fragment_type(content)
    subfolder = section_to_folder(section)
    root_folder = Path("doc/whats_new/upcoming_changes")
    return root_folder / subfolder / f"{pr_number}.{fragment_type}.rst"


def get_fragment_content(content):
    # need to strip spaces hence + r"\s*" in two lines below
    content = re.sub(tag_pattern + r"\s*", "", content)
    content = re.sub(pr_pattern + r"\s*", "", content)
    # Some people use shorthands rather than :user: ...
    user_pattern = r"by(\s+)(:user:`[^`]+`|`[\w ]+`_)"
    content = re.sub(user_pattern, r"By\1\2", content)

    # Need to indent and add the bullet point
    content = textwrap.indent(content, " " * 2)
    return f"-{content[1:]}"


changelog = Path("~/dev/scikit-learn/doc/whats_new/v1.6.rst").expanduser()
# Remove includes which can cause errors and are not necessary for our purposes
content = re.sub(r"\.\. include.+", "", changelog.read_text())
parsed_content = parse_rst(content)

current_section = None
content_by_section = {}
for item in parsed_content:
    if item["type"] == "section":
        current_section = item["title"]
    elif item["type"] == "bullet":
        content_by_section.setdefault(current_section, []).append(item["text"])

pr_pattern = r":pr:`(\d+)`"

tag_to_type = {
    "MajorFeature": "major-feature",
    "Feature": "feature",
    "Efficiency": "efficiency",
    "Enhancement": "enhancement",
    "Fix": "fix",
    "API": "api",
}

joined_tags = "|".join(tag_to_type)
tag_pattern = rf"\|({joined_tags})\|"


for section, content_list in content_by_section.items():
    for content in content_list:
        path = get_fragment_path(section, content)
        path.parent.mkdir(exist_ok=True)
        path.write_text(get_fragment_content(content))

@lesteve
Copy link
Member

lesteve commented Oct 17, 2024

OK let's merge this one!

I guess there are going to be conflicts in PR that updates v1.6.rst but hopefully https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md is helpful enough to guide people towards creating fragment file. If not, there is probably room for improvement the README.md.

@lesteve lesteve merged commit 62d7f96 into scikit-learn:main Oct 17, 2024
30 checks passed
@adrinjalali
Copy link
Member

now the PAIN starts 😁

@lucyleeow
Copy link
Member

What do you think about adding a link to https://github.com/scikit-learn/scikit-learn/blob/main/doc/whats_new/upcoming_changes/README.md in doc/whats_new/v1.6.rst?

People who have to fix v1.6rst merge conflicts may hopefully see it and people opening the file to add an entry will see it.

@adrinjalali
Copy link
Member

I was thinking the same. Even having a bunch of the important content of the readme in the 1.6 file to make it easier than having to open yet another link.

@lucyleeow
Copy link
Member

I'm okay with that but I would mean updating 2 places if we needed to change things.

@adrinjalali
Copy link
Member

It might be worth it, I personally end up having to spend more time than I like everytime I add a changelog entry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.