Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Consolidate _incremental_weighted_mean_and_var into _incremental_mean_and_var (Issue18573) #19422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Feb 10, 2021

Conversation

norbusan
Copy link
Member

Reference Issues/PRs

Fixes: #18573

What does this implement/fix? Explain your changes.

The function _incremental_weighted_mean_and_var is consolidated into
_incremental_mean_and_var by adding a parameter sample_weight,
which, when None makes the function compute the unweighted mean/variance,
and the weighted one otherwise.

Any other comments?

The work is based on the inital commit by maikia in commit e37c3db,
with adjustments to the tests and fixes to the semantics with respect
to NaN treatment.

We also changed the documentation string to be more clear about the
parameters.

@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2021

I did a quick benchmark using the public API to be able to run it both on the main branch and this PR:

Here is the setup:

import numpy as np
from sklearn.utils import gen_batches
from sklearn.preprocessing import StandardScaler


def scale_with_partial_fit(X, sample_weight=None):
    batch_slices = gen_batches(X.shape[0], batch_size=10_000)
    scaler = StandardScaler()
    for batch_slice in batch_slices:
        if sample_weight is None:
            scaler.partial_fit(X[batch_slice])
        else:
            scaler.partial_fit(
                X[batch_slice],
                sample_weight=sample_weight[batch_slice],
            )
    return scaler.mean_, scaler.var_


rng = np.random.default_rng(42)
dtype = np.float32

X = rng.normal(size=(int(1e7), 10)).astype(dtype)
X[:, 0] *= 2.
X[:, 0] += 42.

m, v = scale_with_partial_fit(X)
print(m)
print(v)
print(m.dtype, v.dtype)

sample_weight = np.ones(X.shape[0]).astype(dtype)
m, v = scale_with_partial_fit(X, sample_weight=sample_weight)
print(m)
print(v)
print(m.dtype, v.dtype)

The output of this script looks good (both on main and this PR):

[ 4.19990837e+01 -2.43609946e-04  7.35753603e-04  4.21836518e-04
  2.12059465e-04  1.43233399e-04  5.28809473e-04 -5.39972862e-04
  3.66971185e-04  5.40268179e-04]
[3.99929559 1.0000367  0.99964327 0.99920002 0.99979069 1.0005247
 1.00048552 1.00038451 1.00041132 0.99944344]
float64 float64
[ 4.19990806e+01 -2.43610432e-04  7.35752622e-04  4.21837066e-04
  2.12059725e-04  1.43232462e-04  5.28809468e-04 -5.39972877e-04
  3.66971617e-04  5.40268941e-04]
[3.99929566 1.0000367  0.99964327 0.99920002 0.99979069 1.0005247
 1.00048552 1.00038451 1.00041132 0.99944344]
float64 float64

Benchmark on the main branch

>>> %timeit scale_with_partial_fit(X, sample_weight=None)
1.04 s ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
773 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark on this PR's branch

>>> %timeit scale_with_partial_fit(X, sample_weight=None)
1.03 s ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
813 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark on this PR's branch (updated for the sample_weight is None case)

>>> %timeit scale_with_partial_fit(X, sample_weight=None)
677 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
893 ms ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

  • There seems to be a slight performance regression in this PR on my machine (Apple M1). I wonder if you observe similar results.
  • But more importantly, the sample_weight=None is significantly slower, both on master and this branch, while I would have expected the opposite. It would be good to understand why this is the case.

(X - T)**2, axis=0)
else:
new_unnormalized_variance = (
_safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count)
Copy link
Member

@ogrisel ogrisel Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially the cause of the slowdown for sample_weight=None case:

>>> %timeit np.nansum((X - m) ** 2, axis=0)
420 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit np.nanvar(X, axis=0)
528 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100 ms difference on the full X. We might want to always use np.nansum in both branches and report the suboptimal performance of np.nanvar to the numpy issue tracker.

Or maybe this is not a bug in numpy and it's just the case that np.nanvar as to recompute the mean internally and is therefore costlier.

Copy link
Member

@ogrisel ogrisel Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly the non-nan variants of those functions have opposite relative performance profiles:

>>> %timeit np.sum((X - m) ** 2, axis=0)
257 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit np.var(X, axis=0)
247 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

not sure why: probably that the extra mean computation in this case is negligible compared to the extra RAM transfers and temporary memory allocations caused by the evaluation of the (X - m) ** 2 expression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to do the following ?

Suggested change
_safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count)
T = new_sum / new_sample_count
if sample_weight is not None:
new_unnormalized_variance = np.nansum(sample_weight[:, None] *
(X - T)**2, axis=0)
else:
new_unnormalized_variance = np.nansum((X - T)**2, axis=0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremiedbb do you mean to compute the variance by hand? If this is the case I would rather use it np.nanvar and add a comment that np.nanvar does not support sample_weight and thus we rely to compute by hand?

Copy link
Contributor

@nuka137 nuka137 Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran @ogrisel 's script (10 times) in my environment (Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz).

Main branch This PR
scale_with_partial_fit(sample_weight=None) 1.975517841684632 1.9935310563072561
scale_with_partial_fit(sample_wieght) 2.9529636044986547 2.231434507598169

In my environment, the PR version is much faster.
But for sample_weight=None case, the performance is slight worse.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nuka137 thanks for the benchmark! +1 for updating this PR accordingly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel Thanks, I committed @jeremiedbb 's suggestion code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, on my machine I get consistently better times for sample_weight=None, but in the case of presence of sample_weight, for me main is faster than PR and updated PR:

main
%timeit scale_with_partial_fit(X, sample_weight=sample_weight)
1.26 s ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR pre update
%timeit scale_with_partial_fit(X, sample_weight=sample_weight)
1.4 s ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR post update
%timeit scale_with_partial_fit(X, sample_weight=sample_weight)
1.4 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

No idea why ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that there are quite some statistical fluctuations here ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel can we consider this review part as resolved?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the slight perf regression for the case where sample_weight != None is fine. I prefer this simpler version of the code.

Before merging it would be great to implement the optim I suggested in the comment above and merge it if you can confirm that it's also faster on your machines.

sklearn/utils/extmath.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-ran my benchmark snippet and updated the results. I confirm this is significantly faster for the sample_weight is None case.

@jeremiedbb I think this is ready for merge.

Just a last style nitpick ;)

sklearn/utils/extmath.py Outdated Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel
Copy link
Member

ogrisel commented Feb 10, 2021

@maikia you might also want to give a quick look to this PR before we merge it.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few nitpicks.

I hoped that numpy would do nothing when we ask to multiply by 1, which would simplify everthing, but it still multiplies every element by 1 :(

sklearn/utils/extmath.py Show resolved Hide resolved
sklearn/utils/extmath.py Show resolved Hide resolved
sklearn/utils/extmath.py Show resolved Hide resolved
nuka137 and others added 6 commits February 10, 2021 23:45
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
@maikia
Copy link
Contributor

maikia commented Feb 10, 2021

LGTM. looks so much simpler now :-) thanks for taking care of this

axis=0)
else:
new_sum = _safe_accumulator_op(np.nansum, X, axis=0)
new_sample_count = np.sum(~np.isnan(X), axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are really digging for performance, np.count_nonzero(~np.isnan(X), axis=0) can be much faster

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly better:

without above change
%timeit scale_with_partial_fit(X, sample_weight=None)
1.13 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

with above change
%timeit scale_with_partial_fit(X, sample_weight=None)
1.08 s ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go with that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremiedbb Could you add this as suggested change so that the authorship is properly recorded. Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum it's weird, when I first compared the timings I got a huge speed-up but when I try again I can't reproduce. Let's leave it as is and merge :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me, thanks for checking again

@jeremiedbb jeremiedbb merged commit 8066086 into scikit-learn:main Feb 10, 2021
@jeremiedbb
Copy link
Member

Thanks @norbusan and @nuka137 !

@norbusan
Copy link
Member Author

Thanks a lot!

@nuka137
Copy link
Contributor

nuka137 commented Feb 11, 2021

Thanks!

@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implementation of incremental weighted mean and variance for dense case
6 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.