Consolidate _incremental_weighted_mean_and_var into _incremental_mean_and_var (Issue18573) #19422

norbusan · Feb 10, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

The function _incremental_weighted_mean_and_var is consolidated into
_incremental_mean_and_var by adding a parameter sample_weight,
which, when None makes the function compute the unweighted mean/variance,
and the weighted one otherwise.

Any other comments?

The work is based on the inital commit by maikia in commit e37c3db,
with adjustments to the tests and fixes to the semantics with respect
to NaN treatment.

We also changed the documentation string to be more clear about the
parameters.

ogrisel · Feb 10, 2021

I did a quick benchmark using the public API to be able to run it both on the main branch and this PR:

Here is the setup:

import numpy as np
from sklearn.utils import gen_batches
from sklearn.preprocessing import StandardScaler


def scale_with_partial_fit(X, sample_weight=None):
    batch_slices = gen_batches(X.shape[0], batch_size=10_000)
    scaler = StandardScaler()
    for batch_slice in batch_slices:
        if sample_weight is None:
            scaler.partial_fit(X[batch_slice])
        else:
            scaler.partial_fit(
                X[batch_slice],
                sample_weight=sample_weight[batch_slice],
            )
    return scaler.mean_, scaler.var_


rng = np.random.default_rng(42)
dtype = np.float32

X = rng.normal(size=(int(1e7), 10)).astype(dtype)
X[:, 0] *= 2.
X[:, 0] += 42.

m, v = scale_with_partial_fit(X)
print(m)
print(v)
print(m.dtype, v.dtype)

sample_weight = np.ones(X.shape[0]).astype(dtype)
m, v = scale_with_partial_fit(X, sample_weight=sample_weight)
print(m)
print(v)
print(m.dtype, v.dtype)

The output of this script looks good (both on main and this PR):

[ 4.19990837e+01 -2.43609946e-04  7.35753603e-04  4.21836518e-04
  2.12059465e-04  1.43233399e-04  5.28809473e-04 -5.39972862e-04
  3.66971185e-04  5.40268179e-04]
[3.99929559 1.0000367  0.99964327 0.99920002 0.99979069 1.0005247
 1.00048552 1.00038451 1.00041132 0.99944344]
float64 float64
[ 4.19990806e+01 -2.43610432e-04  7.35752622e-04  4.21837066e-04
  2.12059725e-04  1.43232462e-04  5.28809468e-04 -5.39972877e-04
  3.66971617e-04  5.40268941e-04]
[3.99929566 1.0000367  0.99964327 0.99920002 0.99979069 1.0005247
 1.00048552 1.00038451 1.00041132 0.99944344]
float64 float64

Benchmark on the main branch

>>> %timeit scale_with_partial_fit(X, sample_weight=None)
1.04 s ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
773 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark on this PR's branch

>>> %timeit scale_with_partial_fit(X, sample_weight=None)
1.03 s ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
813 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark on this PR's branch (updated for the `sample_weight is None` case)

>>> %timeit scale_with_partial_fit(X, sample_weight=None)
677 ms ± 3.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
893 ms ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

There seems to be a slight performance regression in this PR on my machine (Apple M1). I wonder if you observe similar results.
But more importantly, the sample_weight=None is significantly slower, both on master and this branch, while I would have expected the opposite. It would be good to understand why this is the case.

ogrisel · Feb 10, 2021

sklearn/utils/extmath.py

+                                                  (X - T)**2, axis=0)
+        else:
+            new_unnormalized_variance = (
+                _safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count)


This is potentially the cause of the slowdown for sample_weight=None case:

>>> %timeit np.nansum((X - m) ** 2, axis=0) 420 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) >>> %timeit np.nanvar(X, axis=0) 528 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

100 ms difference on the full X. We might want to always use np.nansum in both branches and report the suboptimal performance of np.nanvar to the numpy issue tracker.

Or maybe this is not a bug in numpy and it's just the case that np.nanvar as to recompute the mean internally and is therefore costlier.

Interestingly the non-nan variants of those functions have opposite relative performance profiles:

>>> %timeit np.sum((X - m) ** 2, axis=0) 257 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) >>> %timeit np.var(X, axis=0) 247 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

not sure why: probably that the extra mean computation in this case is negligible compared to the extra RAM transfers and temporary memory allocations caused by the evaluation of the (X - m) ** 2 expression.

Is there a reason not to do the following ?

Suggested change

_safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count)

T = new_sum / new_sample_count

if sample_weight is not None:

new_unnormalized_variance = np.nansum(sample_weight[:, None] *

(X - T)**2, axis=0)

else:

new_unnormalized_variance = np.nansum((X - T)**2, axis=0)

@jeremiedbb do you mean to compute the variance by hand? If this is the case I would rather use it np.nanvar and add a comment that np.nanvar does not support sample_weight and thus we rely to compute by hand?

I ran @ogrisel 's script (10 times) in my environment (Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz).

Main branch This PR

scale_with_partial_fit(sample_weight=None) 1.975517841684632 1.9935310563072561

scale_with_partial_fit(sample_wieght) 2.9529636044986547 2.231434507598169

In my environment, the PR version is much faster.
But for sample_weight=None case, the performance is slight worse.

@nuka137 thanks for the benchmark! +1 for updating this PR accordingly.

@ogrisel Thanks, I committed @jeremiedbb 's suggestion code.

Interestingly, on my machine I get consistently better times for sample_weight=None, but in the case of presence of sample_weight, for me main is faster than PR and updated PR:

main %timeit scale_with_partial_fit(X, sample_weight=sample_weight) 1.26 s ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) PR pre update %timeit scale_with_partial_fit(X, sample_weight=sample_weight) 1.4 s ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) PR post update %timeit scale_with_partial_fit(X, sample_weight=sample_weight) 1.4 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

No idea why ...

It looks to me that there are quite some statistical fluctuations here ;-)

@ogrisel can we consider this review part as resolved?

ogrisel

LGTM, the slight perf regression for the case where sample_weight != None is fine. I prefer this simpler version of the code.

Before merging it would be great to implement the optim I suggested in the comment above and merge it if you can confirm that it's also faster on your machines.

sklearn/utils/extmath.py

ogrisel

I re-ran my benchmark snippet and updated the results. I confirm this is significantly faster for the sample_weight is None case.

@jeremiedbb I think this is ready for merge.

Just a last style nitpick ;)

sklearn/utils/extmath.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · Feb 10, 2021

@maikia you might also want to give a quick look to this PR before we merge it.

jeremiedbb

LGTM. Just a few nitpicks.

I hoped that numpy would do nothing when we ask to multiply by 1, which would simplify everthing, but it still multiplies every element by 1 :(

sklearn/utils/extmath.py

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

This reverts commit c32d147.

This reverts commit 3e2e7ca.

This reverts commit 97b3e0e.

maikia · Feb 10, 2021

LGTM. looks so much simpler now :-) thanks for taking care of this

jeremiedbb · Feb 10, 2021

sklearn/utils/extmath.py

+                                  axis=0)
+    else:
+        new_sum = _safe_accumulator_op(np.nansum, X, axis=0)
+        new_sample_count = np.sum(~np.isnan(X), axis=0)


If we are really digging for performance, np.count_nonzero(~np.isnan(X), axis=0) can be much faster

Slightly better:

without above change %timeit scale_with_partial_fit(X, sample_weight=None) 1.13 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) with above change %timeit scale_with_partial_fit(X, sample_weight=None) 1.08 s ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Should we go with that?

@jeremiedbb Could you add this as suggested change so that the authorship is properly recorded. Thanks.

Hum it's weird, when I first compared the timings I got a huge speed-up but when I try again I can't reproduce. Let's leave it as is and merge :)

Fine with me, thanks for checking again

jeremiedbb · Feb 10, 2021

Thanks @norbusan and @nuka137 !

norbusan · Feb 10, 2021

Thanks a lot!

nuka137 · Feb 11, 2021

Thanks!

maikia and others added 18 commits February 9, 2021 23:46

added the incremental variance for sample_weight

bab1e0c

Drop incorrect import

6989bbd

Update tests

4c36f55

Allow testing different code path

fb090ff

reactivate old function for speed comparison

192534f

fix typo

44c3647

Add TODO about speed comparison

66a28b4

My English ....

dddc934

Fix: linter error

b6ea134

Remove _incremental_weighted_mean_and_var

255545c

Drop variant, it is too slow, better to keep separate code path

50375b2

ignore nan in sum

66790e5

Remove TODO comment, fixed.

3b38dbd

Reduce diff to main, don't add newlines

9af1b6f

Tweak

b02aa6f

Document new parameter

204d93c

Fix the documentation

3ced1f3

Update documentation of the function

12dcd38

github-actions bot added module:preprocessing module:utils labels Feb 10, 2021

ogrisel reviewed Feb 10, 2021

View reviewed changes

ogrisel approved these changes Feb 10, 2021

View reviewed changes

ogrisel reviewed Feb 10, 2021

View reviewed changes

sklearn/utils/extmath.py Outdated Show resolved Hide resolved

nuka137 added 2 commits February 10, 2021 13:48

Fix: Address requests from the review

1b9033f

Fix: lint error

9af607d

ogrisel approved these changes Feb 10, 2021

View reviewed changes

sklearn/utils/extmath.py Outdated Show resolved Hide resolved

Apply the commit suggestion

63d4804

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb approved these changes Feb 10, 2021

View reviewed changes

sklearn/utils/extmath.py Show resolved Hide resolved

sklearn/utils/extmath.py Show resolved Hide resolved

sklearn/utils/extmath.py Show resolved Hide resolved

nuka137 and others added 6 commits February 10, 2021 23:45

Apply the commit suggestion

97b3e0e

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Apply the commit suggestion

3e2e7ca

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Apply the commit suggestion

c32d147

Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

Revert "Apply the commit suggestion"

df475a8

This reverts commit c32d147.

Revert "Apply the commit suggestion"

0ddfba6

This reverts commit 3e2e7ca.

Revert "Apply the commit suggestion"

db0a1c4

This reverts commit 97b3e0e.

jeremiedbb reviewed Feb 10, 2021

View reviewed changes

jeremiedbb merged commit 8066086 into scikit-learn:main Feb 10, 2021

glemaitre mentioned this pull request Feb 10, 2021

MRG fix Normalize for linear models when used with sample_weight #19426

Merged

glemaitre mentioned this pull request Apr 22, 2021

Release 0.24.2 #19954

Merged

12 tasks

-                _safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count)
+        T = new_sum / new_sample_count
+        if sample_weight is not None:
+            new_unnormalized_variance = np.nansum(sample_weight[:, None] *
+                                                  (X - T)**2, axis=0)
+        else:
+            new_unnormalized_variance = np.nansum((X - T)**2, axis=0)

	Main branch	This PR
`scale_with_partial_fit(sample_weight=None)`	1.975517841684632	1.9935310563072561
`scale_with_partial_fit(sample_wieght)`	2.9529636044986547	2.231434507598169

Search code, repositories, users, issues, pull requests...

Uh oh!

Consolidate _incremental_weighted_mean_and_var into _incremental_mean_and_var (Issue18573) #19422

Consolidate _incremental_weighted_mean_and_var into _incremental_mean_and_var (Issue18573) #19422

Uh oh!

Conversation

norbusan commented Feb 10, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

ogrisel commented Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark on the main branch

Benchmark on this PR's branch

Benchmark on this PR's branch (updated for the sample_weight is None case)

Comments

Uh oh!

ogrisel Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nuka137 Feb 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel commented Feb 10, 2021

Uh oh!

jeremiedbb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maikia commented Feb 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeremiedbb commented Feb 10, 2021

Uh oh!

norbusan commented Feb 10, 2021

Uh oh!

nuka137 commented Feb 11, 2021

ogrisel commented Feb 10, 2021 •

edited

Loading

Benchmark on this PR's branch (updated for the `sample_weight is None` case)

ogrisel Feb 10, 2021 •

edited

Loading

ogrisel Feb 10, 2021 •

edited

Loading

nuka137 Feb 10, 2021 •

edited

Loading

ogrisel left a comment •

edited

Loading