Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit a1e948e

Browse filesBrowse files
authored
Add cumulative option for the new statistics.kde() function. (#117033)
1 parent d610d82 commit a1e948e
Copy full SHA for a1e948e

File tree

Expand file treeCollapse file tree

3 files changed

+75
-21
lines changed
Filter options
Expand file treeCollapse file tree

3 files changed

+75
-21
lines changed

‎Doc/library/statistics.rst

Copy file name to clipboardExpand all lines: Doc/library/statistics.rst
+8-5Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -261,11 +261,12 @@ However, for reading convenience, most of the examples show sorted sequences.
261261
Added support for *weights*.
262262

263263

264-
.. function:: kde(data, h, kernel='normal')
264+
.. function:: kde(data, h, kernel='normal', *, cumulative=False)
265265

266266
`Kernel Density Estimation (KDE)
267267
<https://www.itm-conferences.org/articles/itmconf/pdf/2018/08/itmconf_sam2018_00037.pdf>`_:
268-
Create a continuous probability density function from discrete samples.
268+
Create a continuous probability density function or cumulative
269+
distribution function from discrete samples.
269270

270271
The basic idea is to smooth the data using `a kernel function
271272
<https://en.wikipedia.org/wiki/Kernel_(statistics)>`_.
@@ -280,11 +281,13 @@ However, for reading convenience, most of the examples show sorted sequences.
280281
as much as the more influential bandwidth smoothing parameter.
281282

282283
Kernels that give some weight to every sample point include
283-
*normal* or *gauss*, *logistic*, and *sigmoid*.
284+
*normal* (*gauss*), *logistic*, and *sigmoid*.
284285

285286
Kernels that only give weight to sample points within the bandwidth
286-
include *rectangular* or *uniform*, *triangular*, *parabolic* or
287-
*epanechnikov*, *quartic* or *biweight*, *triweight*, and *cosine*.
287+
include *rectangular* (*uniform*), *triangular*, *parabolic*
288+
(*epanechnikov*), *quartic* (*biweight*), *triweight*, and *cosine*.
289+
290+
If *cumulative* is true, will return a cumulative distribution function.
288291

289292
A :exc:`StatisticsError` will be raised if the *data* sequence is empty.
290293

‎Lib/statistics.py

Copy file name to clipboardExpand all lines: Lib/statistics.py
+52-15Lines changed: 52 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@
138138
from itertools import count, groupby, repeat
139139
from bisect import bisect_left, bisect_right
140140
from math import hypot, sqrt, fabs, exp, erf, tau, log, fsum, sumprod
141-
from math import isfinite, isinf, pi, cos, cosh
141+
from math import isfinite, isinf, pi, cos, sin, cosh, atan
142142
from functools import reduce
143143
from operator import itemgetter
144144
from collections import Counter, namedtuple, defaultdict
@@ -803,9 +803,9 @@ def multimode(data):
803803
return [value for value, count in counts.items() if count == maxcount]
804804

805805

806-
def kde(data, h, kernel='normal'):
807-
"""Kernel Density Estimation: Create a continuous probability
808-
density function from discrete samples.
806+
def kde(data, h, kernel='normal', *, cumulative=False):
807+
"""Kernel Density Estimation: Create a continuous probability density
808+
function or cumulative distribution function from discrete samples.
809809
810810
The basic idea is to smooth the data using a kernel function
811811
to help draw inferences about a population from a sample.
@@ -820,20 +820,22 @@ def kde(data, h, kernel='normal'):
820820
821821
Kernels that give some weight to every sample point:
822822
823-
normal or gauss
823+
normal (gauss)
824824
logistic
825825
sigmoid
826826
827827
Kernels that only give weight to sample points within
828828
the bandwidth:
829829
830-
rectangular or uniform
830+
rectangular (uniform)
831831
triangular
832-
parabolic or epanechnikov
833-
quartic or biweight
832+
parabolic (epanechnikov)
833+
quartic (biweight)
834834
triweight
835835
cosine
836836
837+
If *cumulative* is true, will return a cumulative distribution function.
838+
837839
A StatisticsError will be raised if the data sequence is empty.
838840
839841
Example
@@ -847,7 +849,8 @@ def kde(data, h, kernel='normal'):
847849
848850
Compute the area under the curve:
849851
850-
>>> sum(f_hat(x) for x in range(-20, 20))
852+
>>> area = sum(f_hat(x) for x in range(-20, 20))
853+
>>> round(area, 4)
851854
1.0
852855
853856
Plot the estimated probability density function at
@@ -876,6 +879,13 @@ def kde(data, h, kernel='normal'):
876879
9: 0.009 x
877880
10: 0.002 x
878881
882+
Estimate P(4.5 < X <= 7.5), the probability that a new sample value
883+
will be between 4.5 and 7.5:
884+
885+
>>> cdf = kde(sample, h=1.5, cumulative=True)
886+
>>> round(cdf(7.5) - cdf(4.5), 2)
887+
0.22
888+
879889
References
880890
----------
881891
@@ -888,6 +898,9 @@ def kde(data, h, kernel='normal'):
888898
Interactive graphical demonstration and exploration:
889899
https://demonstrations.wolfram.com/KernelDensityEstimation/
890900
901+
Kernel estimation of cumulative distribution function of a random variable with bounded support
902+
https://www.econstor.eu/bitstream/10419/207829/1/10.21307_stattrans-2016-037.pdf
903+
891904
"""
892905

893906
n = len(data)
@@ -903,45 +916,56 @@ def kde(data, h, kernel='normal'):
903916
match kernel:
904917

905918
case 'normal' | 'gauss':
906-
c = 1 / sqrt(2 * pi)
907-
K = lambda t: c * exp(-1/2 * t * t)
919+
sqrt2pi = sqrt(2 * pi)
920+
sqrt2 = sqrt(2)
921+
K = lambda t: exp(-1/2 * t * t) / sqrt2pi
922+
I = lambda t: 1/2 * (1.0 + erf(t / sqrt2))
908923
support = None
909924

910925
case 'logistic':
911926
# 1.0 / (exp(t) + 2.0 + exp(-t))
912927
K = lambda t: 1/2 / (1.0 + cosh(t))
928+
I = lambda t: 1.0 - 1.0 / (exp(t) + 1.0)
913929
support = None
914930

915931
case 'sigmoid':
916932
# (2/pi) / (exp(t) + exp(-t))
917-
c = 1 / pi
918-
K = lambda t: c / cosh(t)
933+
c1 = 1 / pi
934+
c2 = 2 / pi
935+
K = lambda t: c1 / cosh(t)
936+
I = lambda t: c2 * atan(exp(t))
919937
support = None
920938

921939
case 'rectangular' | 'uniform':
922940
K = lambda t: 1/2
941+
I = lambda t: 1/2 * t + 1/2
923942
support = 1.0
924943

925944
case 'triangular':
926945
K = lambda t: 1.0 - abs(t)
946+
I = lambda t: t*t * (1/2 if t < 0.0 else -1/2) + t + 1/2
927947
support = 1.0
928948

929949
case 'parabolic' | 'epanechnikov':
930950
K = lambda t: 3/4 * (1.0 - t * t)
951+
I = lambda t: -1/4 * t**3 + 3/4 * t + 1/2
931952
support = 1.0
932953

933954
case 'quartic' | 'biweight':
934955
K = lambda t: 15/16 * (1.0 - t * t) ** 2
956+
I = lambda t: 3/16 * t**5 - 5/8 * t**3 + 15/16 * t + 1/2
935957
support = 1.0
936958

937959
case 'triweight':
938960
K = lambda t: 35/32 * (1.0 - t * t) ** 3
961+
I = lambda t: 35/32 * (-1/7*t**7 + 3/5*t**5 - t**3 + t) + 1/2
939962
support = 1.0
940963

941964
case 'cosine':
942965
c1 = pi / 4
943966
c2 = pi / 2
944967
K = lambda t: c1 * cos(c2 * t)
968+
I = lambda t: 1/2 * sin(c2 * t) + 1/2
945969
support = 1.0
946970

947971
case _:
@@ -952,6 +976,9 @@ def kde(data, h, kernel='normal'):
952976
def pdf(x):
953977
return sum(K((x - x_i) / h) for x_i in data) / (n * h)
954978

979+
def cdf(x):
980+
return sum(I((x - x_i) / h) for x_i in data) / n
981+
955982
else:
956983

957984
sample = sorted(data)
@@ -963,9 +990,19 @@ def pdf(x):
963990
supported = sample[i : j]
964991
return sum(K((x - x_i) / h) for x_i in supported) / (n * h)
965992

966-
pdf.__doc__ = f'PDF estimate with {h=!r} and {kernel=!r}'
993+
def cdf(x):
994+
i = bisect_left(sample, x - bandwidth)
995+
j = bisect_right(sample, x + bandwidth)
996+
supported = sample[i : j]
997+
return sum((I((x - x_i) / h) for x_i in supported), i) / n
967998

968-
return pdf
999+
if cumulative:
1000+
cdf.__doc__ = f'CDF estimate with {h=!r} and {kernel=!r}'
1001+
return cdf
1002+
1003+
else:
1004+
pdf.__doc__ = f'PDF estimate with {h=!r} and {kernel=!r}'
1005+
return pdf
9691006

9701007

9711008
# Notes on methods for computing quantiles

‎Lib/test/test_statistics.py

Copy file name to clipboardExpand all lines: Lib/test/test_statistics.py
+15-1Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2379,6 +2379,18 @@ def integrate(func, low, high, steps=10_000):
23792379
area = integrate(f_hat, -20, 20)
23802380
self.assertAlmostEqual(area, 1.0, places=4)
23812381

2382+
# Check CDF against an integral of the PDF
2383+
2384+
data = [3, 5, 10, 12]
2385+
h = 2.3
2386+
x = 10.5
2387+
for kernel in kernels:
2388+
with self.subTest(kernel=kernel):
2389+
cdf = kde(data, h, kernel, cumulative=True)
2390+
f_hat = kde(data, h, kernel)
2391+
area = integrate(f_hat, -20, x, 100_000)
2392+
self.assertAlmostEqual(cdf(x), area, places=4)
2393+
23822394
# Check error cases
23832395

23842396
with self.assertRaises(StatisticsError):
@@ -2395,6 +2407,8 @@ def integrate(func, low, high, steps=10_000):
23952407
kde(sample, h='str') # Wrong bandwidth type
23962408
with self.assertRaises(StatisticsError):
23972409
kde(sample, h=1.0, kernel='bogus') # Invalid kernel
2410+
with self.assertRaises(TypeError):
2411+
kde(sample, 1.0, 'gauss', True) # Positional cumulative argument
23982412

23992413
# Test name and docstring of the generated function
24002414

@@ -2403,7 +2417,7 @@ def integrate(func, low, high, steps=10_000):
24032417
f_hat = kde(sample, h, kernel)
24042418
self.assertEqual(f_hat.__name__, 'pdf')
24052419
self.assertIn(kernel, f_hat.__doc__)
2406-
self.assertIn(str(h), f_hat.__doc__)
2420+
self.assertIn(repr(h), f_hat.__doc__)
24072421

24082422
# Test closed interval for the support boundaries.
24092423
# In particular, 'uniform' should non-zero at the boundaries.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.