K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).

K Means searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import seaborn
from pprint import pprint
np.random.seed(sum(map(ord, "aesthetics")))
seaborn.set_context('notebook')

#Generating some random samples (blobs) - basically generates random samples X, y with 3 clusters (centers parameter)
#also making it fairly easy to spot the clusters, note the cluster_std parameter
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=3,
                  random_state=0, cluster_std=0.60)
print X.shape, y.shape #y can take values 0,1,2 for 3 clusters but we're going to ignore y for the time being

# Visualizing the blobs as a scatter plot
plt.scatter(X[:, 0], X[:, 1], s=30, cmap=plt.cm.Paired);

(300, 2) (300,)

#The clusters are fairly easy to identify just by looking at it. But we'll now try to identify this using
#the K-Means algorithm.

from sklearn.cluster import KMeans
ktest = KMeans(n_clusters=3) #number of clusters to predict 3
ktest.fit(X) #fitting the model to X
y_pred = ktest.predict(X) #predicting labels (y) and saving to y_pred

#Plotting predicted labels as a scatter
plt.scatter(X[:,0], X[:,1], c=y_pred, cmap=plt.cm.Paired)
plt.show()

#That's pretty impressive, it identifies the clusters neatly

#Let's try changing around some parameters

#If we insist of finding 4 instead of 3 clusters

ktest = KMeans(n_clusters=4) #number of clusters to predict 4
ktest.fit(X) #fitting the model to X
y_pred = ktest.predict(X) #predicting labels (y) and saving to y_pred

#Plotting predicted labels as a scatter
plt.scatter(X[:,0], X[:,1], c=y_pred, cmap=plt.cm.Paired)
plt.show()

#K-Means tries its best to find separation

#Let's try changing around some parameters

#Setting max iterations to 100 instead of the default 300

ktest = KMeans(n_clusters=3, max_iter=100) #number of clusters to predict 4
ktest.fit(X) #fitting the model to X
y_pred = ktest.predict(X) #predicting labels (y) and saving to y_pred

#Plotting predicted labels as a scatter
plt.scatter(X[:,0], X[:,1], c=y_pred, cmap=plt.cm.Paired)
plt.show()

#Result doesn't look much different

#Let's try to make it harder to find the clusters using the cluster_std parameter

X, y = make_blobs(n_samples=300, centers=3,
                  random_state=0, cluster_std=0.85)
print X.shape, y.shape #y can take values 0,1,2 for 3 clusters but we're going to ignore y for the time being

# Visualizing the blobs as a scatter plot
plt.scatter(X[:, 0], X[:, 1], s=30, cmap=plt.cm.Paired)
plt.show()

(300, 2) (300,)

#Now again running K-Means with just 100 iterations

ktest = KMeans(n_clusters=3, max_iter=100) #number of clusters to predict 4
ktest.fit(X) #fitting the model to X
y_pred = ktest.predict(X) #predicting labels (y) and saving to y_pred

#Plotting predicted labels as a scatter
plt.scatter(X[:,0], X[:,1], c=y_pred, cmap=plt.cm.Paired)
plt.show()

#It's actually done a pretty good job of seperating the data

#Let's quickly compare cluster labels and original labels (y) using scatter plots
fig, ax=plt.subplots(1,2,figsize=(8,4))
ax[0].scatter(X[:,0], X[:,1],c=y, cmap=plt.cm.Paired)
ax[1].scatter(X[:,0], X[:,1],c=y_pred, cmap=plt.cm.Paired)
plt.show()

#Looks very close, K-Means has done a great job finding the clusters.

Example 1 Digits data available in scikit-learn

from sklearn.datasets import load_digits
digits = load_digits()
print digits.keys()

#As the keys indicate, Digits dataset has images of digits stored under data and the actual digit stored under target
#There are 1797 images in total
print digits.data.shape
print digits.target

#the images are 8x8 pixels each stored as 64 dimensions to make it understandable for the ML algorithms.
print digits.images.shape

['images', 'data', 'target_names', 'DESCR', 'target']
(1797, 64)
[0 1 2 ..., 8 9 8]
(1797, 8, 8)

#Let's ignore targets for the time being and use K-Means to find clusters in these images. We know that there are a total
#10 clusters for 0-9 digits
X,y=digits.data, digits.target
k_digits = KMeans(n_clusters=10)
y_pred=k_digits.fit_predict(X)

#Let's check the parameter cluster centers of the estimator
print k_digits.cluster_centers_.shape

(10, 64)

#The shape tells us these are centroids, that is the center of each of the clusters. In essence, 8x8 images of each of the 
#10 digits (atleast we expect it to be). Let's find out by visualizing these using scatter.

fig = plt.figure(figsize=(8,3))
for i in range(10):
    ax = fig.add_subplot(2, 5, i+1, xticks=[], yticks=[])
    ax.imshow(np.reshape(k_digits.cluster_centers_[i],(8,8)), cmap=plt.cm.binary)

#That is pretty great. It actually was able to find the clusters of the 10 digits nicely.
#Number 8 doesn't look very clear but everything else looks very good.

#Let's visualize predicted labels and original labels to see if they're close.

#Now, X is a 64 dimension dataset. If we need to visualize this in 2D, we'll need to use PCA to reduce
#the dimensionality. Refer to << >> for a demonstration of PCA.

from sklearn.decomposition import RandomizedPCA
pca=RandomizedPCA(2).fit(X)
X_proj = pca.transform(X)

fig, ax = plt.subplots(1, 2, figsize=(8,4))
ax[0].scatter(X_proj[:,0], X_proj[:,1], c=y_pred, cmap=plt.cm.Paired)
ax[0].set_title('Clusters reduced to 2D with PCA', fontsize=10)

ax[1].scatter(X_proj[:,0], X_proj[:,1], c=y, cmap=plt.cm.Paired)
ax[1].set_title('Original Dataset reduced to 2D with PCA', fontsize=10)

plt.show()

#This looks pretty reasonable considering we never used the original labels to learn (like so with Supervised learning)
#the clusters were automatically identified and they reasonably resemble the targets.

Example 2 Color compression with Clustering

One interesting application of clustering is supposed to be color image compression. For example, imagine you have an image with millions of colors. In most images, a large number of the colors will be unused, and conversely a large number of pixels will have similar or identical colors. Clustering can help find N_color clusters in the data, and we can use this to create a new image where the true input color is replaced by the color of the closest cluster.

Scikit-learn has a number of images that we can play with, accessed through the datasets module. Let's give this a whirl.

from sklearn.datasets import load_sample_image
with seaborn.axes_style('dark'):
    img=load_sample_image("china.jpg");
    plt.imshow(img)

#OK this is the most commonly used example of china.jpg for illustration.
#We can envision this image as a cloud of points in a 3-dimensional RGB color space, 427*640*3=819840
print img.shape

(427, 640, 3)

#We'll rescale the colors so they lie between 0 and 1, then reshape the array to be a typical scikit-learn input:
#rescaling is very critical since we don't want to run these calculations on large numbers
img_r = (img / 255.0).reshape(-1,3)
print img_r.shape

(273280, 3)

#OK now the task is to reduce the huge number of colors in this dataset to a smaller number, then use that instead
#to render the image and compare to the original.
k_colors = KMeans(n_clusters=64).fit(img_r)
y_pred=k_colors.predict(img_r)

#Great, now lets look at the cluster centers. We must have a total of 64 centroids, shape must be of the input 
#dataset with 64x3 dimensions
print k_colors.cluster_centers_.shape #These are the cluster centers. That is the centroid of each of the 64 color clusters.

#these are the labels for each color in the original array. That is, for each color in the original dataset,
#it specifies the centroid of the cluster to which the color belongs. Not the location but the label. So, if we were 
#to replace each color to its closest match in the 64 colors we have selected, it would be using this label. 
#This array would be of length equal to the number of colors in the original dataset.
print k_colors.labels_.shape 

#here is a beautiful technique by using array indexes. We will specify labels as the indexes to the cluster_centers_ array 
#thereby changing cluster_centers to hold the centroid locations for each of the color points in the original dataset.
newimg=k_colors.cluster_centers_[k_colors.labels_]
print newimg.shape #great we're pretty close, just need to reshape this so we can get it to look similar to img.

newimg=np.reshape(newimg, (img.shape))
print newimg.shape #we just used img's shape to reshape, awesome we're almost there!

(64, 3)
(273280,)
(273280, 3)
(427, 640, 3)

#Now let's plot the two images side and by side and examine the differences.
fig = plt.figure(figsize=(10,10))
ax=fig.add_subplot(1,2,1,xticks=[],yticks=[],title='Original Image')
ax.imshow(img)
ax=fig.add_subplot(1,2,2,xticks=[],yticks=[],title='Color Compressed Image using K-Means')
ax.imshow(newimg)
plt.show()