Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

FIX Draw indices using sample_weight in Bagging #31414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
Loading
from

Conversation

antoinebaker
Copy link
Contributor

@antoinebaker antoinebaker commented May 22, 2025

Part of #16298 and alternative to #31165.

What does this implement/fix? Explain your changes.

In Bagging estimators, sample_weight is now used to draw the samples and no longer forwarded to the underlying estimators. Bagging estimators now pass the statistical repeated/weighted equivalence test when bootstrap=True (the default, ie draw with replacement).

Compared to #31165, it better decouples two different usages of sample_weight:

  • sample_weight in bagging_estimator.fit are used as probabilities to draw the indices/rows
  • sample_weight in base_estimator.fit are used to represent the indices (more memory efficient than indexing), this is possible only if base_estimator.fit supports sample_weight (through metadata routing or natively).

#31165 introduced a new sampling_strategy argument to choose indexing/weighting for row sampling, but it would be better to do this in a dedicated follow up PR.

cc @ogrisel @GaetandeCast

Copy link

github-actions bot commented May 22, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: bc4e499. Link to the linter CI: here

@antoinebaker
Copy link
Contributor Author

antoinebaker commented May 22, 2025

BaggingRegressor(estimator=Ridge(), max_samples=100) now passes the statistical repeated/weighted equivalence test

image

Idem for BaggingClassifier(estimator=LogisticRegression(), max_samples=100)) and varying max_samples.

@antoinebaker
Copy link
Contributor Author

However it fails (as expected) for bootstrap=False (draw without replacement), for example BaggingRegressor(estimator=Ridge(), bootstrap=False, max_samples=10)

image

@ogrisel
Copy link
Member

ogrisel commented May 23, 2025

However it fails (as expected) for bootstrap=False (draw without replacement).

Could you please document this known limitation, both in the docstring of the __init__ method for the bootstrap parameter and in the docstring of the fit method for the sample_weight parameter?

Something like: "Note that the expected frequency semantics for the sample_weight parameter are only fulfilled when sampling with replacement bootstrap=True".

Maybe we should raise a warning when calling BaggingClassifier(bootstrap=False, max_samples=0.5).fit(X, y, sample_weight=sample_weight) with sample_weight is not None. The warning is already implemented and tested: https://github.com/scikit-learn/scikit-learn/pull/31414/files#diff-b7c01e77fe68ded1e41868f4a7e142190f935261624d4abdb299913ef944cbbbR676-R682.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a pass of review. Could you please add a non-regression test using a small dataset with specifically engineered weights? For instance, you could have a dataset with 100 datapoints, with 98 data points with a null weight, 1 data point, with a weight of 1 and 1 with a weight of 2:

X = np.arange(100).reshape(-1, 1)
y = (X < 99).astype(np.int32)
sample_weight = np.zeros(shape=X.shape[0])
sample_weight[0] = 1
sample_weight[-1] = 2

Then you could fit a BaggingRegressor and a BaggleClassifier with a fake test estimator that just records the values passed as X, y and sample_weight as fitted attribute to be able to write assertions in the test.

Ideally this test should pass both with metadata routing enabled and disabled.

sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved
sklearn/ensemble/_bagging.py Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@@ -0,0 +1,5 @@
- :class:`ensemble.BaggingClassfier`, :class:`ensemble.BaggingRegressor`
and :class:`ensemble.IsolationForest` now use `sample_weight` to draw
the samples instead of forwarding them multiplied with a uniformly sampling
Copy link
Member

@ogrisel ogrisel May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I made a typo in my past suggestion:

Suggested change
the samples instead of forwarding them multiplied with a uniformly sampling
the samples instead of forwarding them multiplied by a uniformly sampled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

2 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.