Nowadays most datasets have many variables and hence dimensions. However, due to colinearity and non-linear relationship between the different variables, most of the datasets could be represented by fewer variables. PCA (aka principal components analysis) is an algebraic method to reduce dimensionality in a dataset. The goal of this tutorial is to show how you can perform, analyze and visualize PCA with R.

## What is PCA?

The main idea behind PCA is easy. The goal of the PCA is to find orthogonal axes which explain most of the variance in the dataset. These axes will create a new space in which the original dataset will be represented.

In the plot above, the x and y variables are strongly correlated (r²=0.86). Hence, most of the variance in the dataset could be explained by projecting the points on the line. This is the underlying idea of PCA. PCA seeks the linear combinations of the axis where there is the most dispersion among points. This linear combination will create a new basis for the data space. The first components of this basis will explain most of the variance, the latest one will explain almost no variance.

Here is the code for both examples and plot. I won’t comment it much since this is the goal of the latter parts of the post:

require(data.table) ###Generating a random ellipse, direction pi/4 x_coord=rnorm(1000,10,50) y_coord=rnorm(1000,30,10) direction=pi/4 DT=data.table(x=x_coord*cos(direction)+y_coord*sin(direction),y=-y_coord*sin(direction)+x_coord*cos(direction)) ###Creating the ellipse with a require(ggplot2) ggplot(DT,aes(x,y))+geom_point()+ coord_fixed()+ggtitle('PCA exemple')+geom_smooth(method='lm') ##First PCA PCA1=prcomp(DT,center = T) ##Centering the dataframe to plot it DT$x=DT$x-mean(DT$x,na.rm=T) DT$y=DT$y-mean(DT$y,na.rm=T) ##Dataframe containing the arrow with principal directions data_arrow=data.table(x= c(0,0),y= c(0,0),xend=c(PCA1$rotation[1,]),yend=c(PCA1$rotation[2,])) ##Plot ggplot(DT,aes(x,y))+geom_point(alpha=0.2)+ coord_fixed()+ggtitle('PCA exemple')+ geom_segment(data=data_arrow,aes(x = x,xend=xend*max(DT$x),y=y,yend=yend*max(DT$x)), arrow = arrow(length = unit(0.03, "npc")),size=2,color="blue")

## The dataset

For this tutorial, we will use the MNIST data set from kaggle. This dataset contains handwritten grayscale digits from 0 to 9. Each image is encoded by a 28*28 matrix with gray intensity from 0 to 255.

Let’s plot some digits:

###Reading dataset require(data.table) mnist_data=fread('MNIST.csv') ##Building a 3*3 grid par(mfrow=c(3,3)) for (i in 1:9) { ##Changing i-th row to matrix mat=matrix(as.numeric(mnist_data[i,785:2,with=F]),nrow = 28,ncol=28,byrow = F) ##Inverting row order mat=mat[nrow(mat):1,] ##plot image(mat,main=paste0('This is a ',mnist_data[i,1,with=F])) }

The digits are easy to recognize for us. Some correlation can be expected, some pixels are mainly used by a given digits. These pixels are correlated and do no bring much more extra information to the dataset. The PCA will help us to keep only the combination of pixels that makes sense.

## Preprocessing data

In addition to this, some pixels are always black (there are not used by any observation). If we want to scale the observations of the dataset, no variables should be constant. Hence, we need to remove constant pixels.

which(mnist_data[,sapply(.SD,FUN = function(x){min(x)==max(x)}),.SDcols=2:ncol(mnist_data)])

The code above print the constant pixel. As one would expect, they are located on the borders. Hence, we will remove the borders.

mnist_data=mnist_data[,-(which((1:784)%%28<=2|(1:784)%%28>=26|1:784%/%28<=2|1:784%/%28>=26)+1), with=F]

The code above removes the cells which are located closer than two cells from the borders. This removes all pixels that are constant and should not crop too much the digits.

## PCA with R

Now that the data are ready for PCA, we can do one using the prcomp function.

PCA1=prcomp(mnist_data[,(2:ncol(mnist_data)),with=F],center = T,scale. = F)

The computation can take a few minutes. Now, let’s compute the variances explain by each of the principal components:

par(mfrow=c(1,1),mar=c(2.1,2.1,2.1,2.1)) plot(cumsum(PCA1$sdev)/sum(PCA1$sdev)*100,main='Cumulative proportion of variance explained')

With around 100 main directions, almost 60% of the variance is explained! We can also see that the marginal for one more principal component is decreasing to zero.

## Visualizing the PCA

The goal is to be able to visualize the PCA and how the projection in the new space performs. First, we need to create a projection of our data in the space created by the PCA.

projected=scale(mnist_data[,(2:ncol(mnist_data)),with=F], PCA1$center, PCA1$scale) %*% PCA1$rotation

This is the R syntax to project the data in the orthogonal basis created by the PCA, you can learn more on it here.

Now, each of the observation is described as a linear combination of the principal components. You could use this representation of the data to fit a classification model. By keeping 200 principal components, you would still keep 80% of the variance and you would reduce overfitting.

Here we want to plot our data. To do so, we need to go back to the original pixel space by projecting back a few PC to this space.

Code:

###Keeping only the three main dimensions n_dim=3 ##Projecting the data back using only the 3 principal components coord_x=data.table(mnist_data$label,projected[,1:n_dim]%*%t(PCA1$rotation)[1:n_dim,]) par(mfrow=c(6,6),mar=c(0.1,0.1,0.1,0.1)) ##Plotting 36 observations for (i in 1:36) { mat=matrix(as.numeric(coord_x[i,530:2,with=F]),nrow = 23,ncol=23,byrow = F) mat=mat[nrow(mat):1,] image(mat,main=paste0('This is a ',coord_x[i,1,with=F])) }

Representation of the 36 first observation using 3 PCs

The lore principal components, the more the digits look like their original self. With only 3PCs too much information has been lost and the digits look like a hybrid between a 0 and a 9. Using only 10PCs, it is almost possible to recognize all the digits and 50PCs seems to be enough to have the same accuracy we would have with all the dimensions.

That’s it, you know how to perform PCA with R!

[…] R Basics: PCA with R […]

[…] R Basics: PCA with R […]

[…] R Basics: PCA with R 和帖子 How to reverse PCA and reconstruct original variables from several principal components? […]