k-Means Clustering

Problem:

Given a set of ten points D = {(-6,4), (-3,7), (1,6), (-4,0), (0,-1), (11,7), (8,3), (13,3), (8,-1), (12,-2)} as shown in the figure,

assume two clusters (k = 2) and use the k-means algorithm to find them. The starting points (seeds) can be created arbitrarily or chosen from the given points.

We will repeat the algorithm several times from different starting points.

  1. The first pair of starting points is (-3,7) and (11,7). Compute the two clusters and show the separating line at each iteration.
  2. The second pair of starting points is (8,3) and (8,-1). Compute the two clusters and show the separating line at each iteration.
  3. The third pair of starting points is (0,-1) and (11,7). Compute the two clusters and show the separating line at each iteration.
  4. From the figure of the ten points it looks like there are two clusters – the five points on the left (“left” cluster) and the five points on the right (“right” cluster).
    • There are a total of 45 possible starting pairs of points, the combinations of 10 points taken two at a time. How many of the 45 pairs do you think will yield the “left” and “right” clusters?
    • The usually recommendation for the k-means algorithm is to take the starting points far from each other. Therefore, taking only the pairs of points having an x1 or x2 difference of ten or more units (about 21 pairs), how many of these pairs do you think will yield the “left” and “right” clusters?
In [3]:
# Figure
plt.scatter(all_data[:,0], all_data[:,1]);plt.grid()

Solutions:

note:

  • The separating line is not plotted in the answers below; instead, the line between the final means is plotted with its midpoint where the separating line can be drawn perpendicularly.

  • Solutions of intermediate iterations are not plotted (only final solution).

  • To reproduce the solutions, run the cells in the same order they are labeled in In[..]

__author__ = "A.Aziz Altowayan"
__email__ = "aa10212w@pace.edu"
__date__ = "Wed Dec  3 23:55:05 EST 2014"
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import math
from __future__ import division
In [2]:
all_data = np.array([[-6,4], [-3,7], [1,6], [-4,0], [0,-1], [11,7], [8,3], [13,3], [8,-1], [12,-2]])

1)

In [5]:
# exmple 1
seed1, seed2 = [-3,7], [11,7] # ex1: LEFT-RIGHT clusters
c1, c2, m, mid = cluster(seed1, seed2)
report_result(seed1, seed2, c1, c2 ,m , mid)
ITERATION:
	starting seed: [-3, 7][11, 7] means: [-2.4  3.2] [ 10.4   2. ]
FINAL:
pair: [-3, 7][11, 7]
means: [-2.4  3.2][ 10.4   2. ]
mid: [ 4.   2.6]
cluster1:[array([-6,  4]), array([-3,  7]), array([1, 6]), array([-4,  0]), array([ 0, -1])]
cluster2:[array([11,  7]), array([8, 3]), array([13,  3]), array([ 8, -1]), array([12, -2])]

2)

In [6]:
# example 2
seed1, seed2 = [8,3], [8,-1] # ex2: UP-DOWN clusters
c1, c2, m, mid = cluster(seed1, seed2)
report_result(seed1, seed2, c1, c2 ,m , mid)
ITERATION:
	starting seed: [8, 3][8, -1] means: [ 4.  5.] [ 4. -1.]
FINAL:
pair: [8, 3][8, -1]
means: [ 4.  5.][ 4. -1.]
mid: [ 4.  2.]
cluster1:[array([-6,  4]), array([-3,  7]), array([1, 6]), array([11,  7]), array([8, 3]), array([13,  3])]
cluster2:[array([-4,  0]), array([ 0, -1]), array([ 8, -1]), array([12, -2])]

3)

In [7]:
# example 3
seed1, seed2 = [0,-1], [11,7] # ex3: LEFT-RIGHT clusters
c1, c2, m, mid = cluster(seed1, seed2)
report_result(seed1, seed2, c1, c2 ,m , mid)
ITERATION:
	starting seed: [0, -1][11, 7] means: [-0.67  2.5 ] [ 11.     2.75]
ITERATION:
	starting seed: [-0.67  2.5 ][ 11.     2.75] means: [-2.4  3.2] [ 10.4   2. ]
FINAL:
pair: [0, -1][11, 7]
means: [-2.4  3.2][ 10.4   2. ]
mid: [ 4.   2.6]
cluster1:[array([-6,  4]), array([-3,  7]), array([1, 6]), array([-4,  0]), array([ 0, -1])]
cluster2:[array([11,  7]), array([8, 3]), array([13,  3]), array([ 8, -1]), array([12, -2])]

4)

hard-coded

In [9]:
lr = len([f for f in mids if np.all(f == [4,2.6])])
ud = len([f for f in mids if np.all(f == [4,2])])
print("Number of pairs yield LEFT-RIGHT clusters: {}".format(lr))
print("Number of pairs yield UP-DOWN clusters: {}".format(ud))
Number of pairs yield LEFT-RIGHT clusters: 41
Number of pairs yield UP-DOWN clusters: 4

In [8]:
from itertools import combinations
i = 0
mids = []
for a,b in combinations(all_data, 2):
    c1, c2, m, mid = cluster(a, b)
    print("Pair{}: {}{}".format(i,a,b)),
    print("\tfinal means: {} {}\tmidpoint: {}".format(m[0], m[1], mid)),
    if np.all(mid == [4,2]):
        print(":: UP and DOWN clusters ")
    elif np.all(mid == [4,2.6]):
        print(":: LEFT and RIGHT clusters ")
    mids.append(mid)
    i+=1
Pair0: [-6  4][-3  7] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair1: [-6  4][1 6] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair2: [-6  4][-4  0] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair3: [-6  4][ 0 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair4: [-6  4][11  7] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair5: [-6  4][8 3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair6: [-6  4][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair7: [-6  4][ 8 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair8: [-6  4][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair9: [-3  7][1 6] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair10: [-3  7][-4  0] 	final means: [ 4.  5.] [ 4. -1.]	midpoint: [ 4.  2.] :: UP and DOWN clusters 
Pair11: [-3  7][ 0 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair12: [-3  7][11  7] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair13: [-3  7][8 3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair14: [-3  7][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair15: [-3  7][ 8 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair16: [-3  7][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair17: [1 6][-4  0] 	final means: [ 10.4   2. ] [-2.4  3.2]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair18: [1 6][ 0 -1] 	final means: [ 4.  5.] [ 4. -1.]	midpoint: [ 4.  2.] :: UP and DOWN clusters 
Pair19: [1 6][11  7] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair20: [1 6][8 3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair21: [1 6][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair22: [1 6][ 8 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair23: [1 6][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair24: [-4  0][ 0 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair25: [-4  0][11  7] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair26: [-4  0][8 3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair27: [-4  0][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair28: [-4  0][ 8 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair29: [-4  0][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair30: [ 0 -1][11  7] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair31: [ 0 -1][8 3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair32: [ 0 -1][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair33: [ 0 -1][ 8 -1] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair34: [ 0 -1][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair35: [11  7][8 3] 	final means: [ 10.4   2. ] [-2.4  3.2]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair36: [11  7][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair37: [11  7][ 8 -1] 	final means: [ 10.4   2. ] [-2.4  3.2]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair38: [11  7][12 -2] 	final means: [ 4.  5.] [ 4. -1.]	midpoint: [ 4.  2.] :: UP and DOWN clusters 
Pair39: [8 3][13  3] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair40: [8 3][ 8 -1] 	final means: [ 4.  5.] [ 4. -1.]	midpoint: [ 4.  2.] :: UP and DOWN clusters 
Pair41: [8 3][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair42: [13  3][ 8 -1] 	final means: [ 10.4   2. ] [-2.4  3.2]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair43: [13  3][12 -2] 	final means: [ 10.4   2. ] [-2.4  3.2]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 
Pair44: [ 8 -1][12 -2] 	final means: [-2.4  3.2] [ 10.4   2. ]	midpoint: [ 4.   2.6] :: LEFT and RIGHT clusters 

In [4]:
# k-means algorithm 'recursively'
def cluster(seed1, seed2):

    c1 = []
    c2 = []
    # cluster each data point to its closest seed
    for p in all_data:
        nearest = get_closer(seed1, seed2, p)
        if np.all(seed1 == nearest):
            c1.append([p[0],p[1]])
        else:
            c2.append([p[0],p[1]])
    c1, c2 = np.array(c1), np.array(c2)
    
    # get the new means
    m11 = float("{0:.2f}".format(c1[:,0].mean()))
    m12 = float("{0:.2f}".format(c1[:,1].mean()))
    m21 = float("{0:.2f}".format(c2[:,0].mean()))
    m22 = float("{0:.2f}".format(c2[:,1].mean()))
    
    # update seeds as the new means
    c1_seed = np.array([m11, m12])
    c2_seed = np.array([m21, m22])

    # return when converge
    if np.all(seed1 == c1_seed) and np.all(seed2 == c2_seed):
        m = np.array([seed1, seed2])
        mid = get_midpoint(m[0], m[1])
        return c1, c2, m, mid

    # if not converge, recurse with new means as the seeds
    print "ITERATION:\n\t", "starting seed: {}{} means: {} {}".format(seed1, seed2, c1_seed, c2_seed)
    return cluster(c1_seed, c2_seed)


# Return distance between two points 'Euclidean'
def distance(p0, p1):
    return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)


# Return the point (c1 or c2) closer to p
def get_closer(c1, c2, p):
    dc1 = distance(c1, p)
    dc2 = distance(c2, p)
    if dc1 < dc2:
        closer = c1
    else:
        closer = c2
    return closer

# Return midpoint between p1 and p2
def get_midpoint(p1, p2):
    mid = (p1[0] + p2[0])/2, (p1[1] + p2[1])/2
    return np.array(mid)


# plot and print results
def report_result(seed1, seed2, c1, c2 ,m , mid):
    s1 = np.array([seed1, seed2])
    print("FINAL:\npair: {}{}\nmeans: {}{}\nmid: {}\ncluster1:{}\ncluster2:{}".format(seed1,seed2,m[0],m[1], mid,[c for c in c1],[c for c in c2]))
    # plot
    plt.scatter(c1[:,0], c1[:,1], c='r', label="cluster 1")
    plt.scatter(c2[:,0], c2[:,1], c='b', label="cluster 2")
    plt.plot(s1[:,0], s1[:,1], '--g', label="starting pair")
    plt.scatter(m.T[0], m.T[1], marker="d", c='y', label="means: 1 and 2")
    plt.plot(m.T[0], m.T[1], c='g') # line between means
    plt.scatter(mid[0], mid[1], marker='x', label='midpoint')
    plt.grid();plt.legend(bbox_to_anchor = (2, 1));plt.show()