Python Basics: Kmeans with Python

0

The K-means algorithm is one of the basic (yet effective) clustering algorithms. In this tutorial, we will have a quick look at what is clustering and how to do a Kmeans with Python.

Clustering algorithm

The goal of clustering is to detect patterns in an unlabeled dataset. For instance, clustering should detect similar clients, with similar characteristics and should also detect clients that are not at all the same. Similar clients will be put in the same group whereas different client should be in different groups. At the end, each group will show us the average clients or member of the group.

Different type of clustering algorithms exist such a k-means, k-medoids, hierarchical clustering, …

K-means algorithm will iteratively build a given number of groups around their means. You will find more detail on the algorithm in a future post. You can already learn more on it here.

The Dataset

For this tutorial, we are gong to use the digits dataset from scikit-learn. This dataset contains the greyscale intensity of handwritten digits from 0 to 9. Each of 1797 observations is labeled. The digits are represented by an 8 by 8 matrix, the gray intensity is encoded from 0 to 16.

We are using a labeled dataset since it makes it easy to show how kmeans work and perform. Intuitively, the 0 should be in the same cluster, all the 3 should be in the same cluster and so on

First, let’s load the data and plot some numbers. We will only load 3 classes (Digits from 0 to 2) to make the clustering easier.

###Loading data for three classes
from sklearn.datasets import load_digits
X,Y=load_digits(3,True)
X=np.array(X)

Now we can plot the first four observations:

##Plotting some figures
import matplotlib.pyplot as plt
for i in range(4):
 print('This is a ',Y[i])
 plt.pcolor(np.reshape(X[i,:],(8,8)))
 plt.gray()
 plt.show()

This slideshow requires JavaScript.

Kmeans with Python

Now we can use the kmeans on our Data set. The scikit learn has a kmeans function:

##Init
kmeans = KMeans(n_clusters=3, random_state=123)
##Fit
kmeans.fit(X)

The n_clusters argument stands for the number of cluster we want. Here the choice is easy since we know there are three classes. The random state is used to initialise the kmeans at a given state for reproductability. The initial centers and the final ones will change as the random seed does.

Visualising results

First, let’s visualize the three centers computed by the kmeans.

for i in range(3):
 plt.pcolor(np.reshape(kmeans.cluster_centers_[i,:],(8,8)))
 plt.gray()
 plt.show()

The first picture on the left represents the center for the 1, the picture in the middle for the 2 and the picture on the right the center for the 0. It seems that the Kmeans has been able to detect efficiently an averaged representation of each class!

Kmeans performance

Now, let’s see how efficient the kmeans have been in discriminating the different classes. To do so, the number of observations of each class in each cluster will be plotted using ggplot.

import pandas as pd
DF2=pd.DataFrame({'real':Y,'pred':cluster_predicted})
from ggplot import *
gg=ggplot(DF2,aes(x='factor(real)',fill='factor(pred)'))+geom_bar()
print(gg)

Model performance

As the plot shows, most of the observations from the same class are in the same cluster. The kmeans even detect perfectly that all zero belong to the same cluster! The algorithm detected underlying structure in the data without supervision and these structures were the correct ones.

That’s it, you know how to implement kmeans with Python !

LEAVE A REPLY

Please enter your comment!
Please enter your name here