Given a set of ten points D = {(-6,4), (-3,7), (1,6), (-4,0), (0,-1), (11,7), (8,3), (13,3), (8,-1), (12,-2)} as shown in the figure,

assume two clusters (k = 2) and use the k-means algorithm to find them. The starting points (seeds) can be created arbitrarily or chosen from the given points.

We will repeat the algorithm several times from different starting points.

- The first pair of starting points is (-3,7) and (11,7). Compute the two clusters and show the separating line at each iteration.
- The second pair of starting points is (8,3) and (8,-1). Compute the two clusters and show the separating line at each iteration.
- The third pair of starting points is (0,-1) and (11,7). Compute the two clusters and show the separating line at each iteration.
- From the figure of the ten points it looks like there are two clusters – the five points on the left (“left” cluster) and the five points on the right (“right” cluster).
- There are a total of 45 possible starting pairs of points, the combinations of 10 points taken two at a time. How many of the 45 pairs do you think will yield the “left” and “right” clusters?
- The usually recommendation for the k-means algorithm is to take the starting points far from each other. Therefore, taking only the pairs of points having an x1 or x2 difference of ten or more units (about 21 pairs), how many of these pairs do you think will yield the “left” and “right” clusters?

In [3]:

```
# Figure
plt.scatter(all_data[:,0], all_data[:,1]);plt.grid()
```

note:

The separating line is not plotted in the answers below; instead, the line between the final means is plotted with its midpoint where the separating line can be drawn perpendicularly.

Solutions of intermediate iterations are not plotted (only final solution).

To reproduce the solutions, run the cells in the same order they are labeled in

`In[..]`

```
__author__ = "A.Aziz Altowayan"
__email__ = "aa10212w@pace.edu"
__date__ = "Wed Dec 3 23:55:05 EST 2014"
```

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
import math
from __future__ import division
```

In [2]:

```
all_data = np.array([[-6,4], [-3,7], [1,6], [-4,0], [0,-1], [11,7], [8,3], [13,3], [8,-1], [12,-2]])
```

In [5]:

```
# exmple 1
seed1, seed2 = [-3,7], [11,7] # ex1: LEFT-RIGHT clusters
c1, c2, m, mid = cluster(seed1, seed2)
report_result(seed1, seed2, c1, c2 ,m , mid)
```

In [6]:

```
# example 2
seed1, seed2 = [8,3], [8,-1] # ex2: UP-DOWN clusters
c1, c2, m, mid = cluster(seed1, seed2)
report_result(seed1, seed2, c1, c2 ,m , mid)
```

In [7]:

```
# example 3
seed1, seed2 = [0,-1], [11,7] # ex3: LEFT-RIGHT clusters
c1, c2, m, mid = cluster(seed1, seed2)
report_result(seed1, seed2, c1, c2 ,m , mid)
```

hard-coded

In [9]:

```
lr = len([f for f in mids if np.all(f == [4,2.6])])
ud = len([f for f in mids if np.all(f == [4,2])])
print("Number of pairs yield LEFT-RIGHT clusters: {}".format(lr))
print("Number of pairs yield UP-DOWN clusters: {}".format(ud))
```

In [8]:

```
from itertools import combinations
i = 0
mids = []
for a,b in combinations(all_data, 2):
c1, c2, m, mid = cluster(a, b)
print("Pair{}: {}{}".format(i,a,b)),
print("\tfinal means: {} {}\tmidpoint: {}".format(m[0], m[1], mid)),
if np.all(mid == [4,2]):
print(":: UP and DOWN clusters ")
elif np.all(mid == [4,2.6]):
print(":: LEFT and RIGHT clusters ")
mids.append(mid)
i+=1
```

In [4]:

```
# k-means algorithm 'recursively'
def cluster(seed1, seed2):
c1 = []
c2 = []
# cluster each data point to its closest seed
for p in all_data:
nearest = get_closer(seed1, seed2, p)
if np.all(seed1 == nearest):
c1.append([p[0],p[1]])
else:
c2.append([p[0],p[1]])
c1, c2 = np.array(c1), np.array(c2)
# get the new means
m11 = float("{0:.2f}".format(c1[:,0].mean()))
m12 = float("{0:.2f}".format(c1[:,1].mean()))
m21 = float("{0:.2f}".format(c2[:,0].mean()))
m22 = float("{0:.2f}".format(c2[:,1].mean()))
# update seeds as the new means
c1_seed = np.array([m11, m12])
c2_seed = np.array([m21, m22])
# return when converge
if np.all(seed1 == c1_seed) and np.all(seed2 == c2_seed):
m = np.array([seed1, seed2])
mid = get_midpoint(m[0], m[1])
return c1, c2, m, mid
# if not converge, recurse with new means as the seeds
print "ITERATION:\n\t", "starting seed: {}{} means: {} {}".format(seed1, seed2, c1_seed, c2_seed)
return cluster(c1_seed, c2_seed)
# Return distance between two points 'Euclidean'
def distance(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
# Return the point (c1 or c2) closer to p
def get_closer(c1, c2, p):
dc1 = distance(c1, p)
dc2 = distance(c2, p)
if dc1 < dc2:
closer = c1
else:
closer = c2
return closer
# Return midpoint between p1 and p2
def get_midpoint(p1, p2):
mid = (p1[0] + p2[0])/2, (p1[1] + p2[1])/2
return np.array(mid)
# plot and print results
def report_result(seed1, seed2, c1, c2 ,m , mid):
s1 = np.array([seed1, seed2])
print("FINAL:\npair: {}{}\nmeans: {}{}\nmid: {}\ncluster1:{}\ncluster2:{}".format(seed1,seed2,m[0],m[1], mid,[c for c in c1],[c for c in c2]))
# plot
plt.scatter(c1[:,0], c1[:,1], c='r', label="cluster 1")
plt.scatter(c2[:,0], c2[:,1], c='b', label="cluster 2")
plt.plot(s1[:,0], s1[:,1], '--g', label="starting pair")
plt.scatter(m.T[0], m.T[1], marker="d", c='y', label="means: 1 and 2")
plt.plot(m.T[0], m.T[1], c='g') # line between means
plt.scatter(mid[0], mid[1], marker='x', label='midpoint')
plt.grid();plt.legend(bbox_to_anchor = (2, 1));plt.show()
```