Kmeans is one of the most popular and simple algorithm to discover underlying structures in your data. The goal of kmeans is simple, split your data in k different groups represented by their mean. The mean of each group is assumed to be a good summary of each observation of this cluster.

## Kmeans Algorithm

We assume that we want to split the data into k groups, so we need to find and assign k centers. How to define and find these centers?

They are the solution to the equation: ** ** where** ** if the observation** i** is assigned to the center **j** and** 0** otherwise.

Basically, this equation means that we are looking for the k centers which will minimize the distance between them and the points of their cluster. This is an optimization problem, but since the function, we want to minimize is not convex and some variables are binary, it cannot be solved in classic ways with gradient descent.

The usual way to solve it is the following:

- Initialize randomly the centers by selecting k-observations
- While some convergence criterion is not met
- Assign each observation to its closest center
- Update each center. The new centers are the mean of the observation of each group.
- Update the convergence criterion.

## K-means from scratch with R

Now that we have the algorithm in pseudocode, let’s implement kmeans from scratch in R. First,we’ll create some toys data based on five 2D gaussian distributions.

require(MASS) require(ggplot2) set.seed(1234) set1=mvrnorm(n = 300, c(-4,10), matrix(c(1.5,1,1,1.5),2)) set2=mvrnorm(n = 300, c(5,7), matrix(c(1,2,2,6),2)) set3=mvrnorm(n = 300, c(-1,1), matrix(c(4,0,0,4),2)) set4=mvrnorm(n = 300, c(10,-10), matrix(c(4,0,0,4),2)) set5=mvrnorm(n = 300, c(3,-3), matrix(c(4,0,0,4),2)) DF=data.frame(rbind(set1,set2,set3,set4,set5),cluster=as.factor(c(rep(1:5,each=300)))) ggplot(DF,aes(x=X1,y=X2,color=cluster))+geom_point()

On this dataset, Kmeans will work well since each distribution has a circular shape. Here are what the data look like:

Now that we have a dataset, let’s inplement kmeans.

### Initialisation of the centroids

The initialisation of the centroids is crucial and will change how the algorithm behave. Here, we will wimply takes K random points from the data.

#Centroids initialisation centroids=data[sample.int(nrow(data),K),] ##Stopping criteria initilisation. current_stop_crit=10e10 ##Vector where the assigned centers of each points will be saved cluster=rep(0,nrow(data)) ##Has the alogrithm converged ? converged=F it=1

### Assigning points to their clusters

At each iteration, every points will be assigned to its closest cluster. To do so, the euclidian distance between each points and each centers is computed, the lowest distance and the center for which it’s reached is saved.

###Iterating over observations for (i in 1:nrow(data)) { ##Setting a high minimum distance min_dist=10e10 ##Iterating over centroids for (centroid in 1:nrow(centroids)) { ##Computing the L2 distance distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) ##This centroid is the closest centroid to the point if (distance_to_centroid<=min_dist) { ##The point is assigned to this centroid/cluster cluster[i]=centroid min_dist=distance_to_centroid } } }

### Centroids update

Once each point has been assigned to the closest centroids, the coordinates of each centroid are updated. The new coordinates are the means of the observations which belongs to the cluster.

##For each centroid for (i in 1:nrow(centroids)) { ##The new coordinates are the means of the point in the cluster centroids[i,]=apply(data[cluster==i,],2,mean) }

### Stopping criterion

We do not want the algorithm to run indefinetely, hence we need a stopping criterion to stop the algorthm when we are close enough to a minimum. The criterion is simply that when the centroids stop moving, the algorithm should stop.

while(current_stop_crit>=stop_crit & converged==F) { it=it+1 if (current_stop_crit<=stop_crit) { converged=T } old_centroids=centroids ###Run previous step ####Recompute stop criterion current_stop_crit=mean((old_centroids-centroids)^2)

### Complete function

kmeans=function(data,K=4,stop_crit=10e-5) { #Initialisation of clusters centroids=data[sample.int(nrow(data),K),] current_stop_crit=1000 cluster=rep(0,nrow(data)) converged=F it=1 while(current_stop_crit>=stop_crit & converged==F) { it=it+1 if (current_stop_crit<=stop_crit) { converged=T } old_centroids=centroids ##Assigning each point to a centroid for (i in 1:nrow(data)) { min_dist=10e10 for (centroid in 1:nrow(centroids)) { distance_to_centroid=sum((centroids[centroid,]-data[i,])^2) if (distance_to_centroid<=min_dist) { cluster[i]=centroid min_dist=distance_to_centroid } } } ##Assigning each point to a centroid for (i in 1:nrow(centroids)) { centroids[i,]=apply(data[cluster==i,],2,mean) } current_stop_crit=mean((old_centroids-centroids)^2) } return(list(data=data.frame(data,cluster),centroids=centroids)) }

You can easily run the code to see your clusters:

res=kmeans(DF[1:2],K=5) res$centroids$cluster=1:5 res$data$isCentroid=F res$centroids$isCentroid=T data_plot=rbind(res$centroids,res$data) ggplot(data_plot,aes(x=X1,y=X2,color=as.factor(cluster),size=isCentroid,alpha=isCentroid))+geom_point()

## Exploring kmeans results

Now let’s try the algorithm on two different datasets. First, on the 5 Gaussians distributions:

The centroids move and split the data in clusters which are very close to the original ones. Kmeas is doing a great job here.

Now, instead of having nice gaussian distributions, we will build three rings on into another.

##Building three sets set1=data.frame(r=runif(300,0.1,0.5),theta=runif(300,0,360),set='1') set2=data.frame(r=runif(300,1,1.5),theta=runif(300,0,360),set='2') set3=data.frame(r=runif(300,3,5),theta=runif(300,0,360),set='3') ##Transformation in rings data_2=rbind(set1,set2,set3) data_2$x=data_2$r*cos(2*3.14*data_2$theta) data_2$y=(data_2$r)*sin(2*3.14*data_2$theta)

The kmeans is performing very poorly on these new data. Actually, the euclidian distance is not adapted to this kind of problem, since the data are not in a circular shape.

So before using kmeans, you should ensure that the data is in appropriate shapes, if not, you can apply transformations or change the distance you are using in the kmeans.

## The importance of a good initialisation

The kmeans algorithm only looks for a local mimimum which is often not a global optimum. Hence, different initialisation can lead to very different results.

We ran the kmean algorithm over more than 60 different starting positions.As you can see, sometimes, the algorithm results in poor centroids due to an unlucky initialization. The solution to this is simply to run kmeans several times and to take the best centroids set. The quality of initialization can also be improved with kmeans ++, the algorithm selects starting points which are less likely to perform poorly.

Want to learn more on Machine Learning ? Here is a selection of Machine Learning Explained posts:

– Dimensionality reduction

– Supervised vs unsupervised vs reinforcement learning

– Regularization in machine learning

Similar iluatration here:

https://eranraviv.com/understanding-k-means-clustering/

With some background on distance metrics,

Hello Antoine Guillot:

I am trying to follow your tutorial.

http://enhancedatascience.com/2017/10/24/machine-learning-explained-kmeans/

I am obviously not understanding what is expected here since I am getting an error

centroids <- data[sample.int(nrow(data),K),]

Error in sample.int(nrow(data), K) : object 'K' not found

Have I taken your code to literally and there is some substitution of some kind necessary?

I have tried this:

centroids=DF[sample.int(nrow(DF),k),]

with no luck

Error in sample.int(nrow(DF), k) : object 'k' not found

Please advise

Thank you