Description
Describe the workflow you want to enable
I would like a new class: sklearn.cluster.KMedians (or an option to sklearn.cluster.KMeans) that allows the methods to use medians instead of means.
K-n clustering can greatly improve some instances where there are outliers. Using the median minimizes outliers.
Describe your proposed solution
Create a new class sklearn.cluster.KMedians that works the same as sklearn.cluster.KMeans but instead uses median to compute the new centroids (instead of using mean)
Describe alternatives you've considered, if relevant
Using a version I wrote:
class mu_type(enum.IntEnum):
mean = 1
median = 2
#
# Euclidean distance between two vectors
#
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)):
distance += (row1[i] - row2[i])**2
return math.sqrt(distance)
#
# k-means and k-medians clustering
# mu defines if the algorithm runs as k-means or k-medians
#
def k_m_clustering_2(x, c, mu=mu_type.mean):
d = [[0]*len(c) for i1 in range(len(x))]
l = [0]*len(x)
c_last = [0]*len(c)
while c_last != c:
# Save the last list of center points to compare for updates later
c_last = c.copy()
for i in range(len(x)):
for j in range(len(c)):
if DEBUG:
print('distance between: ', end='')
print(x[i], end='')
print(' and ', end='')
print(c[j], end='')
d[i][j] = euclidean_distance(x[i], c[j])
if DEBUG:
print(' = ', end='')
print(d[i][j])
l[i] = d[i].index(min(d[i]))
if DEBUG:
print("L = ", end='')
print(l)
print()
print()
if DEBUG:
print("Old center points: ", end='')
print(c)
print()
#
# Update center points
#
for i in range(len(c)):
#
# Compute mean or median for new center points
#
if mu == mu_type.mean:
count = 0
_x = []
_y = []
for j in range(len(l)):
if l[j] == i:
count += 1
_x.append(x[j][0])
_y.append(x[j][1])
c[i] = [(sum(_x)/count), (sum(_y)/count)]
elif mu == mu_type.median:
x_y = []
for j in range(len(l)):
if l[j] == i:
x_y.append([x[j][0], x[j][1]])
c[i] = median(x_y)
if DEBUG:
print("New center points: ", end='')
print(c)
print()
if DEBUG:
print("Center points have not changed\n")
print("Final: ", end='')
print(l)
print()
return c, l
# Call it:
DEBUG=True
c = [[2, 2], [3, 4], [6, 2]]
x = [[1, 2], [2, 1], [1, 3], [5, 4], [6, 3], [7, 2], [6, 1]]
c_arr, l_arr = k_m_clustering_2(x, c, mu_type.median)
for i in range(len(x)):
print('x' + str(i+1) + ' = ' + str(l_arr[i]))
print('\n')
Output:
distance between: [1, 2] and [2, 2] = 1.0
distance between: [1, 2] and [3, 4] = 2.8284271247461903
distance between: [1, 2] and [6, 2] = 5.0
distance between: [2, 1] and [2, 2] = 1.0
distance between: [2, 1] and [3, 4] = 3.1622776601683795
distance between: [2, 1] and [6, 2] = 4.123105625617661
distance between: [1, 3] and [2, 2] = 1.4142135623730951
distance between: [1, 3] and [3, 4] = 2.23606797749979
distance between: [1, 3] and [6, 2] = 5.0990195135927845
distance between: [5, 4] and [2, 2] = 3.605551275463989
distance between: [5, 4] and [3, 4] = 2.0
distance between: [5, 4] and [6, 2] = 2.23606797749979
distance between: [6, 3] and [2, 2] = 4.123105625617661
distance between: [6, 3] and [3, 4] = 3.1622776601683795
distance between: [6, 3] and [6, 2] = 1.0
distance between: [7, 2] and [2, 2] = 5.0
distance between: [7, 2] and [3, 4] = 4.47213595499958
distance between: [7, 2] and [6, 2] = 1.0
distance between: [6, 1] and [2, 2] = 4.123105625617661
distance between: [6, 1] and [3, 4] = 4.242640687119285
distance between: [6, 1] and [6, 2] = 1.0
L = [0, 0, 0, 1, 2, 2, 2]
Old center points: [[2, 2], [3, 4], [6, 2]]
New center points: [[1, 3], [5, 4], [6, 3]]
distance between: [1, 2] and [1, 3] = 1.0
distance between: [1, 2] and [5, 4] = 4.47213595499958
distance between: [1, 2] and [6, 3] = 5.0990195135927845
distance between: [2, 1] and [1, 3] = 2.23606797749979
distance between: [2, 1] and [5, 4] = 4.242640687119285
distance between: [2, 1] and [6, 3] = 4.47213595499958
distance between: [1, 3] and [1, 3] = 0.0
distance between: [1, 3] and [5, 4] = 4.123105625617661
distance between: [1, 3] and [6, 3] = 5.0
distance between: [5, 4] and [1, 3] = 4.123105625617661
distance between: [5, 4] and [5, 4] = 0.0
distance between: [5, 4] and [6, 3] = 1.4142135623730951
distance between: [6, 3] and [1, 3] = 5.0
distance between: [6, 3] and [5, 4] = 1.4142135623730951
distance between: [6, 3] and [6, 3] = 0.0
distance between: [7, 2] and [1, 3] = 6.082762530298219
distance between: [7, 2] and [5, 4] = 2.8284271247461903
distance between: [7, 2] and [6, 3] = 1.4142135623730951
distance between: [6, 1] and [1, 3] = 5.385164807134504
distance between: [6, 1] and [5, 4] = 3.1622776601683795
distance between: [6, 1] and [6, 3] = 2.0
L = [0, 0, 0, 1, 2, 2, 2]
Old center points: [[1, 3], [5, 4], [6, 3]]
New center points: [[1, 3], [5, 4], [6, 3]]
Center points have not changed
Final: [0, 0, 0, 1, 2, 2, 2]
x1 = 0
x2 = 0
x3 = 0
x4 = 1
x5 = 2
x6 = 2
x7 = 2
Additional context
No response