Description
Describe the bug
KBinsDiscretizer uniform strategy uses numpy.linspace to make bin edges.
numpy.linspace works out a delta like: delta = (max - min)/num_bins
Then the bin edges are computed: delta * n
The issue is the floating point multiplication introduces noise in the low bits.
For example, consider the case of floating point sample values from zero to one and five bins. Then:
delta = 1/5 = 0.2
The right edge of bin 2 (zero indexed) should be 0.6 = 0.2 * 3 but (in my tests) it's 0.6000000000000001
Example python calculation:
>>> 1/5 * 3
0.6000000000000001
This means a sample values of 0.6 get assigned to bin 2 but it should be in bin 3
One work around is to use the fractions module or better still the decimal module. The code below demonstrates the issue
#!/usr/bin/env python
import decimal
import fractions
import sys
from typing import NoReturn
def test_float_fractions():
# check floating point multiplication
step = 1 / 5
f_step = fractions.Fraction(1, 5)
d_step = decimal.Decimal(1) / decimal.Decimal(5)
print('float vs fractions')
for n in range(101):
float_value = step * n
fraction_value = f_step * n
fraction_float = float(fraction_value)
if float_value != fraction_float:
fraction_str = str(fraction_value)
print(f'{n:2} float {float_value:20.16f} fraction {fraction_float:20.16f} {fraction_str:>5}')
print('')
print('float vs decimals')
for n in range(101):
float_value = step * n
decimal_value = d_step * n
if float_value != decimal_value:
print(f'{n:2} float {float_value:23.20f} decimal {decimal_value:23.20f}')
def main(argv) -> NoReturn:
m = 0
try:
test_float_fractions()
except Exception as e:
print(f'Exception: {e}')
sys.exit(m)
if __name__ == '__main__':
main(sys.argv[1:])
Running the above yields the output below:
float vs fractions
3 float 0.6000000000000001 fraction 0.6000000000000000 3/5
6 float 1.2000000000000002 fraction 1.2000000000000000 6/5
7 float 1.4000000000000001 fraction 1.3999999999999999 7/5
12 float 2.4000000000000004 fraction 2.3999999999999999 12/5
14 float 2.8000000000000003 fraction 2.7999999999999998 14/5
17 float 3.4000000000000004 fraction 3.3999999999999999 17/5
19 float 3.8000000000000003 fraction 3.7999999999999998 19/5
23 float 4.6000000000000005 fraction 4.5999999999999996 23/5
24 float 4.8000000000000007 fraction 4.7999999999999998 24/5
28 float 5.6000000000000005 fraction 5.5999999999999996 28/5
29 float 5.8000000000000007 fraction 5.7999999999999998 29/5
33 float 6.6000000000000005 fraction 6.5999999999999996 33/5
34 float 6.8000000000000007 fraction 6.7999999999999998 34/5
38 float 7.6000000000000005 fraction 7.5999999999999996 38/5
39 float 7.8000000000000007 fraction 7.7999999999999998 39/5
41 float 8.2000000000000011 fraction 8.1999999999999993 41/5
46 float 9.2000000000000011 fraction 9.1999999999999993 46/5
48 float 9.6000000000000014 fraction 9.5999999999999996 48/5
51 float 10.2000000000000011 fraction 10.1999999999999993 51/5
53 float 10.6000000000000014 fraction 10.5999999999999996 53/5
56 float 11.2000000000000011 fraction 11.1999999999999993 56/5
58 float 11.6000000000000014 fraction 11.5999999999999996 58/5
61 float 12.2000000000000011 fraction 12.1999999999999993 61/5
63 float 12.6000000000000014 fraction 12.5999999999999996 63/5
66 float 13.2000000000000011 fraction 13.1999999999999993 66/5
68 float 13.6000000000000014 fraction 13.5999999999999996 68/5
71 float 14.2000000000000011 fraction 14.1999999999999993 71/5
73 float 14.6000000000000014 fraction 14.5999999999999996 73/5
76 float 15.2000000000000011 fraction 15.1999999999999993 76/5
78 float 15.6000000000000014 fraction 15.5999999999999996 78/5
82 float 16.4000000000000021 fraction 16.3999999999999986 82/5
87 float 17.4000000000000021 fraction 17.3999999999999986 87/5
92 float 18.4000000000000021 fraction 18.3999999999999986 92/5
96 float 19.2000000000000028 fraction 19.1999999999999993 96/5
97 float 19.4000000000000021 fraction 19.3999999999999986 97/5
float vs decimals
1 float 0.20000000000000001110 decimal 0.20000000000000000000
2 float 0.40000000000000002220 decimal 0.40000000000000000000
3 float 0.60000000000000008882 decimal 0.60000000000000000000
4 float 0.80000000000000004441 decimal 0.80000000000000000000
6 float 1.20000000000000017764 decimal 1.20000000000000000000
7 float 1.40000000000000013323 decimal 1.40000000000000000000
8 float 1.60000000000000008882 decimal 1.60000000000000000000
9 float 1.80000000000000004441 decimal 1.80000000000000000000
11 float 2.20000000000000017764 decimal 2.20000000000000000000
12 float 2.40000000000000035527 decimal 2.40000000000000000000
13 float 2.60000000000000008882 decimal 2.60000000000000000000
14 float 2.80000000000000026645 decimal 2.80000000000000000000
16 float 3.20000000000000017764 decimal 3.20000000000000000000
17 float 3.40000000000000035527 decimal 3.40000000000000000000
18 float 3.60000000000000008882 decimal 3.60000000000000000000
19 float 3.80000000000000026645 decimal 3.80000000000000000000
21 float 4.20000000000000017764 decimal 4.20000000000000000000
22 float 4.40000000000000035527 decimal 4.40000000000000000000
23 float 4.60000000000000053291 decimal 4.60000000000000000000
24 float 4.80000000000000071054 decimal 4.80000000000000000000
26 float 5.20000000000000017764 decimal 5.20000000000000000000
27 float 5.40000000000000035527 decimal 5.40000000000000000000
28 float 5.60000000000000053291 decimal 5.60000000000000000000
29 float 5.80000000000000071054 decimal 5.80000000000000000000
31 float 6.20000000000000017764 decimal 6.20000000000000000000
32 float 6.40000000000000035527 decimal 6.40000000000000000000
33 float 6.60000000000000053291 decimal 6.60000000000000000000
34 float 6.80000000000000071054 decimal 6.80000000000000000000
36 float 7.20000000000000017764 decimal 7.20000000000000000000
37 float 7.40000000000000035527 decimal 7.40000000000000000000
38 float 7.60000000000000053291 decimal 7.60000000000000000000
39 float 7.80000000000000071054 decimal 7.80000000000000000000
41 float 8.20000000000000106581 decimal 8.20000000000000000000
42 float 8.40000000000000035527 decimal 8.40000000000000000000
43 float 8.59999999999999964473 decimal 8.60000000000000000000
44 float 8.80000000000000071054 decimal 8.80000000000000000000
46 float 9.20000000000000106581 decimal 9.20000000000000000000
47 float 9.40000000000000035527 decimal 9.40000000000000000000
48 float 9.60000000000000142109 decimal 9.60000000000000000000
49 float 9.80000000000000071054 decimal 9.80000000000000000000
51 float 10.20000000000000106581 decimal 10.20000000000000000000
52 float 10.40000000000000035527 decimal 10.40000000000000000000
53 float 10.60000000000000142109 decimal 10.60000000000000000000
54 float 10.80000000000000071054 decimal 10.80000000000000000000
56 float 11.20000000000000106581 decimal 11.20000000000000000000
57 float 11.40000000000000035527 decimal 11.40000000000000000000
58 float 11.60000000000000142109 decimal 11.60000000000000000000
59 float 11.80000000000000071054 decimal 11.80000000000000000000
61 float 12.20000000000000106581 decimal 12.20000000000000000000
62 float 12.40000000000000035527 decimal 12.40000000000000000000
63 float 12.60000000000000142109 decimal 12.60000000000000000000
64 float 12.80000000000000071054 decimal 12.80000000000000000000
66 float 13.20000000000000106581 decimal 13.20000000000000000000
67 float 13.40000000000000035527 decimal 13.40000000000000000000
68 float 13.60000000000000142109 decimal 13.60000000000000000000
69 float 13.80000000000000071054 decimal 13.80000000000000000000
71 float 14.20000000000000106581 decimal 14.20000000000000000000
72 float 14.40000000000000035527 decimal 14.40000000000000000000
73 float 14.60000000000000142109 decimal 14.60000000000000000000
74 float 14.80000000000000071054 decimal 14.80000000000000000000
76 float 15.20000000000000106581 decimal 15.20000000000000000000
77 float 15.40000000000000035527 decimal 15.40000000000000000000
78 float 15.60000000000000142109 decimal 15.60000000000000000000
79 float 15.80000000000000071054 decimal 15.80000000000000000000
81 float 16.19999999999999928946 decimal 16.20000000000000000000
82 float 16.40000000000000213163 decimal 16.40000000000000000000
83 float 16.60000000000000142109 decimal 16.60000000000000000000
84 float 16.80000000000000071054 decimal 16.80000000000000000000
86 float 17.19999999999999928946 decimal 17.20000000000000000000
87 float 17.40000000000000213163 decimal 17.40000000000000000000
88 float 17.60000000000000142109 decimal 17.60000000000000000000
89 float 17.80000000000000071054 decimal 17.80000000000000000000
91 float 18.19999999999999928946 decimal 18.20000000000000000000
92 float 18.40000000000000213163 decimal 18.40000000000000000000
93 float 18.60000000000000142109 decimal 18.60000000000000000000
94 float 18.80000000000000071054 decimal 18.80000000000000000000
96 float 19.20000000000000284217 decimal 19.20000000000000000000
97 float 19.40000000000000213163 decimal 19.40000000000000000000
98 float 19.60000000000000142109 decimal 19.60000000000000000000
99 float 19.80000000000000071054 decimal 19.80000000000000000000
My suggestion is to use the decimal module for uniform discretisation.
Steps/Code to Reproduce
#!/usr/bin/env python
import decimal
import fractions
import sys
from typing import NoReturn
import numpy
import sklearn.preprocessing
def test_sklearn_uniform_bug():
# sample values
values = numpy.array([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0])
# expected quantised bin
# right side of bin <.2 <.2 <.4 <.4 <.6 <.6 <.8 <.8 <1 <1 <1
expected = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 4]
# reshape to list of lists
reshaped = numpy.reshape(values, shape=(-1, 1))
qnt = sklearn.preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
qnt.fit(reshaped)
fitted = qnt.transform(reshaped)
# reshape to list of bins
quantised = numpy.reshape(fitted, shape=(-1))
# check bin assignment
for i in range(len(expected)):
if quantised[i] != expected[i]:
print(f'bin for {values[i]} {quantised[i]} != {expected[i]}')
# check bin edges
expected_bins = [0, 0.2, 0.4, 0.6, 0.8, 1.0]
bin_edges = qnt.bin_edges_[0]
for i in range(len(expected_bins)):
if expected_bins[i] != bin_edges[i]:
print(f'bin edge {expected_bins[i]} != {bin_edges[i]}')
# check floating point multiplication
step = 1/5
bin_3 = step * 3
if bin_3 != 0.6:
print(f'floating point multiplication {bin_3} != 0.6')
else:
print('floating point multiplication ok')
# check fractions multiplication
f_step = fractions.Fraction(1, 5)
f_bin_3 = 3 * f_step
if float(f_bin_3) != 0.6:
print(f'fractions multiplication {f_bin_3} != 0.6')
else:
print('fractions multiplication ok')
# check decimal multiplication
d_step = decimal.Decimal(1) / decimal.Decimal(5)
d_bin_3 = 3 * d_step
if float(d_bin_3) != 0.6:
print(f'decimal multiplication {d_bin_3} != 0.6')
else:
print('decimal multiplication ok')
def main(argv) -> NoReturn:
m = 0
try:
test_sklearn_uniform_bug()
except Exception as e:
print(f'Exception: {e}')
sys.exit(m)
if __name__ == '__main__':
main(sys.argv[1:])
Expected Results
floating point multiplication ok
fractions multiplication ok
decimal multiplication ok
Actual Results
bin for 0.6 2.0 != 3
bin edge 0.6 != 0.6000000000000001
floating point multiplication 0.6000000000000001 != 0.6
fractions multiplication ok
decimal multiplication ok
Versions
System:
python: 3.12.8 (main, Dec 30 2024, 15:10:22) [Clang 16.0.0 (clang-1600.0.26.6)]
executable: /Users/simonb/.pyenv/versions/nmc/bin/python
machine: macOS-15.3.1-arm64-arm-64bit
Python dependencies:
sklearn: 1.6.1
pip: 25.0
setuptools: 75.8.0
numpy: 2.2.2
scipy: 1.15.1
Cython: None
pandas: 2.2.3
matplotlib: 3.10.0
joblib: 1.4.2
threadpoolctl: 3.5.0
Built with OpenMP: True
threadpoolctl info:
user_api: openmp
internal_api: openmp
num_threads: 11
prefix: libomp
filepath: /Users/simonb/.pyenv/versions/3.12.8/envs/nmc/lib/python3.12/site-packages/sklearn/.dylibs/libomp.dylib
version: None