R Basics: K-means with R

0
Iris scatter lot matrix

The K-means algorithm is one of the basic (yet effective) clustering algorithms. In this tutorial, we will have a quick look at what is clustering and how to do a K-means with R.

Clustering algorithm

The goal of clustering is to detect patterns in an unlabeled dataset. For instance, clustering should detect similar clients, with similar characteristics and should also detect clients that are not at all the same. Similar clients will be put in the same group whereas different client should be in different groups. At the end, each group will show us the average clients or member of the group.

Different type of clustering algorithms exist such a k-means, k-medoids, hierarchical clustering, …

K-means algorithm will iteratively build a given number of groups around their means. You will find more detail on the algorithm in a future post.

The Iris dataset

The dataset is the Iris dataset, this dataset contains data on flowers from three different species of Iris: setosa, versicolor and virginica. Each observation contains 4 variables, the petal width, petal length, sepal width and sepal length. This dataset is labeled since it contains the species of the flower. Let’s see if the unsupervised kmeans algorithm can detect the species on its own!

First, let’s take a quick look at the data:

require(data.table)
require(ggplot2)
require(GGally)
data_iris=data.table(iris)
ggpairs(data_iris,columns = 1:5,mapping=aes(colour=Species))

Iris scatter lot matrix

The dataset should be easy for the kmeans to deal with. The different groups are fairly distinct

Kmeans on Iris

Now let’s run the kmeans, the kmeans function run the algorithm on the data. Since we know we want three groups, the number of centers in the function will be three:

set.seed(123)
not_norm_cluster=kmeans(data_iris[,1:4,with=F],3)
data_iris$cluster=as.factor(not_norm_cluster$cluster)

We need to specify a seed for reproducibility, the initial centers of the algorithm being random.

ggpairs(data_iris,columns = 1:5,mapping=aes(colour=cluster))

Cluster results

Each color represents a different cluster.In the lower right corner, you can see the repartition of each species between the different clusters. All the setosa are in the same cluster, the versicolor are almost all in the same cluster while some virginica are in the versicolor cluster. Overall the kmeans did a great job at detecting the different species.

Scaling

Here, the scale and variations are more or less the same for all variables. However, with some datasets, some variables have huge variances while others have a small variance. Hence, scaling can improve clustering when all the variables should have the same importances. Here, scaling does not improve the clustering.

norm_cluster=kmeans(scale(data_iris[,1:4,with=F],scale = T),3)
data_iris$cluster=as.factor(norm_cluster$cluster)
ggpairs(data_iris,columns = 1:5,mapping=aes(colour=cluster))

Choosing the right number of centers

In this example, we had to specify the number of clusters. This was easy since we knew there were three underlying groups in the data. When you have no idea of the ‘real’ number of cluster, the best compromise between model complexity and model performance is often the best pick.

The metric of performance I used is the sum of square between group over the total sum of square. In a perfect model, most of the variance is explained by the different cluster. Hence the sum of square between groups is close to the total sum of square. To explore the right number of groups, k-means were computed for up to 100 centers. The computation was done several times to take into account the randomness of the algorithm.

set.seed(456)
performance=c()
for (i in rep(1:100,times=30))
{
 clust=kmeans(data_iris[,1:4,with=F],i)
 performance=c(performance,1-clust$tot.withinss/clust$totss)
}
perf_df=data.frame(metrics=performance,number_of_center=rep(1:100,times=30))
ggplot(perf_df,aes(x=number_of_center,y=metrics))+geom_point(alpha=0.2)+geom_vline(xintercept = 3,color='red')

And here is the plot:

Number Of Centers Vs Errors

So, with 3 clusters, we already account for more than 75% of the total variance! Hadn’t we known there were 3 groups, four clusters would also have been a viable choice. More clusters do not reduce the error enough for the new parameters added.

K-means with R

Now you know how to perform a k-means with R from fitting to exploration and analysis. I hope you liked the tutorial !

 

 

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here