-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Consolidate _incremental_weighted_mean_and_var into _incremental_mean_and_var (Issue18573) #19422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I did a quick benchmark using the public API to be able to run it both on the main branch and this PR: Here is the setup: import numpy as np
from sklearn.utils import gen_batches
from sklearn.preprocessing import StandardScaler
def scale_with_partial_fit(X, sample_weight=None):
batch_slices = gen_batches(X.shape[0], batch_size=10_000)
scaler = StandardScaler()
for batch_slice in batch_slices:
if sample_weight is None:
scaler.partial_fit(X[batch_slice])
else:
scaler.partial_fit(
X[batch_slice],
sample_weight=sample_weight[batch_slice],
)
return scaler.mean_, scaler.var_
rng = np.random.default_rng(42)
dtype = np.float32
X = rng.normal(size=(int(1e7), 10)).astype(dtype)
X[:, 0] *= 2.
X[:, 0] += 42.
m, v = scale_with_partial_fit(X)
print(m)
print(v)
print(m.dtype, v.dtype)
sample_weight = np.ones(X.shape[0]).astype(dtype)
m, v = scale_with_partial_fit(X, sample_weight=sample_weight)
print(m)
print(v)
print(m.dtype, v.dtype) The output of this script looks good (both on main and this PR):
Benchmark on the main branch>>> %timeit scale_with_partial_fit(X, sample_weight=None)
1.04 s ± 5.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
773 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Benchmark on this PR's branch>>> %timeit scale_with_partial_fit(X, sample_weight=None)
1.03 s ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit scale_with_partial_fit(X, sample_weight=sample_weight)
813 ms ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Benchmark on this PR's branch (updated for the
|
sklearn/utils/extmath.py
Outdated
(X - T)**2, axis=0) | ||
else: | ||
new_unnormalized_variance = ( | ||
_safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is potentially the cause of the slowdown for sample_weight=None
case:
>>> %timeit np.nansum((X - m) ** 2, axis=0)
420 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit np.nanvar(X, axis=0)
528 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
100 ms difference on the full X. We might want to always use np.nansum
in both branches and report the suboptimal performance of np.nanvar
to the numpy issue tracker.
Or maybe this is not a bug in numpy and it's just the case that np.nanvar
as to recompute the mean internally and is therefore costlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly the non-nan variants of those functions have opposite relative performance profiles:
>>> %timeit np.sum((X - m) ** 2, axis=0)
257 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit np.var(X, axis=0)
247 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
not sure why: probably that the extra mean computation in this case is negligible compared to the extra RAM transfers and temporary memory allocations caused by the evaluation of the (X - m) ** 2
expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to do the following ?
_safe_accumulator_op(np.nanvar, X, axis=0) * new_sample_count) | |
T = new_sum / new_sample_count | |
if sample_weight is not None: | |
new_unnormalized_variance = np.nansum(sample_weight[:, None] * | |
(X - T)**2, axis=0) | |
else: | |
new_unnormalized_variance = np.nansum((X - T)**2, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremiedbb do you mean to compute the variance by hand? If this is the case I would rather use it np.nanvar
and add a comment that np.nanvar
does not support sample_weight
and thus we rely to compute by hand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran @ogrisel 's script (10 times) in my environment (Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz).
Main branch | This PR | |
---|---|---|
scale_with_partial_fit(sample_weight=None) |
1.975517841684632 | 1.9935310563072561 |
scale_with_partial_fit(sample_wieght) |
2.9529636044986547 | 2.231434507598169 |
In my environment, the PR version is much faster.
But for sample_weight=None
case, the performance is slight worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nuka137 thanks for the benchmark! +1 for updating this PR accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel Thanks, I committed @jeremiedbb 's suggestion code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, on my machine I get consistently better times for sample_weight=None
, but in the case of presence of sample_weight
, for me main is faster than PR and updated PR:
main
%timeit scale_with_partial_fit(X, sample_weight=sample_weight)
1.26 s ± 21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
PR pre update
%timeit scale_with_partial_fit(X, sample_weight=sample_weight)
1.4 s ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
PR post update
%timeit scale_with_partial_fit(X, sample_weight=sample_weight)
1.4 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
No idea why ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me that there are quite some statistical fluctuations here ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ogrisel can we consider this review part as resolved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the slight perf regression for the case where sample_weight != None
is fine. I prefer this simpler version of the code.
Before merging it would be great to implement the optim I suggested in the comment above and merge it if you can confirm that it's also faster on your machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-ran my benchmark snippet and updated the results. I confirm this is significantly faster for the sample_weight is None
case.
@jeremiedbb I think this is ready for merge.
Just a last style nitpick ;)
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@maikia you might also want to give a quick look to this PR before we merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a few nitpicks.
I hoped that numpy would do nothing when we ask to multiply by 1, which would simplify everthing, but it still multiplies every element by 1 :(
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
This reverts commit c32d147.
This reverts commit 3e2e7ca.
This reverts commit 97b3e0e.
LGTM. looks so much simpler now :-) thanks for taking care of this |
axis=0) | ||
else: | ||
new_sum = _safe_accumulator_op(np.nansum, X, axis=0) | ||
new_sample_count = np.sum(~np.isnan(X), axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are really digging for performance, np.count_nonzero(~np.isnan(X), axis=0)
can be much faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly better:
without above change
%timeit scale_with_partial_fit(X, sample_weight=None)
1.13 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
with above change
%timeit scale_with_partial_fit(X, sample_weight=None)
1.08 s ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we go with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremiedbb Could you add this as suggested change so that the authorship is properly recorded. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum it's weird, when I first compared the timings I got a huge speed-up but when I try again I can't reproduce. Let's leave it as is and merge :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me, thanks for checking again
Thanks a lot! |
Thanks! |
Reference Issues/PRs
Fixes: #18573
What does this implement/fix? Explain your changes.
The function
_incremental_weighted_mean_and_var
is consolidated into_incremental_mean_and_var
by adding a parametersample_weight
,which, when
None
makes the function compute the unweighted mean/variance,and the weighted one otherwise.
Any other comments?
The work is based on the inital commit by maikia in commit e37c3db,
with adjustments to the tests and fixes to the semantics with respect
to NaN treatment.
We also changed the documentation string to be more clear about the
parameters.