R Basics: PCA with R

2
Representation of the 36 first observation using 10 PCs

Nowadays most datasets have many variables and hence dimensions. However, due to colinearity and non-linear relationship between the different variables, most of the datasets could be represented by fewer variables. PCA (aka principal components analysis) is an algebraic method to reduce dimensionality in a dataset. The goal of this tutorial is to show how you can perform, analyze and visualize PCA with R.

What is PCA?

The main idea behind PCA is easy. The goal of the PCA is to find orthogonal axes which explain most of the variance in the dataset. These axes will create a new space in which the original dataset will be represented.PCA example

In the plot above, the x and y variables are strongly correlated (r²=0.86). Hence, most of the variance in the dataset could be explained by projecting the points on the line. This is the underlying idea of PCA. PCA seeks the linear combinations of the axis where there is the most dispersion among points. This linear combination will create a new basis for the data space. The first components of this basis will explain most of the variance, the latest one will explain almost no variance.

PCA done

Here is the code for both examples and plot. I won’t comment it much since this is the goal of the latter parts of the post:

require(data.table)
###Generating a random ellipse, direction pi/4
x_coord=rnorm(1000,10,50)
y_coord=rnorm(1000,30,10)
direction=pi/4
DT=data.table(x=x_coord*cos(direction)+y_coord*sin(direction),y=-y_coord*sin(direction)+x_coord*cos(direction))
###Creating the ellipse with a 
require(ggplot2)
ggplot(DT,aes(x,y))+geom_point()+ coord_fixed()+ggtitle('PCA exemple')+geom_smooth(method='lm')

##First PCA
PCA1=prcomp(DT,center = T)
##Centering the dataframe to plot it
DT$x=DT$x-mean(DT$x,na.rm=T)
DT$y=DT$y-mean(DT$y,na.rm=T)
##Dataframe containing the arrow with principal directions
data_arrow=data.table(x= c(0,0),y= c(0,0),xend=c(PCA1$rotation[1,]),yend=c(PCA1$rotation[2,]))
##Plot
ggplot(DT,aes(x,y))+geom_point(alpha=0.2)+ coord_fixed()+ggtitle('PCA exemple')+
 geom_segment(data=data_arrow,aes(x = x,xend=xend*max(DT$x),y=y,yend=yend*max(DT$x)),
 arrow = arrow(length = unit(0.03, "npc")),size=2,color="blue")

The dataset

For this tutorial, we will use the MNIST data set from kaggle. This dataset contains handwritten grayscale digits from 0 to 9. Each image is encoded by a 28*28 matrix with gray intensity from 0 to 255.

Let’s plot some digits:

MNIST digits

###Reading dataset
require(data.table)
mnist_data=fread('MNIST.csv')

##Building a 3*3 grid
par(mfrow=c(3,3))
for (i in 1:9)
{
##Changing i-th row to matrix
 mat=matrix(as.numeric(mnist_data[i,785:2,with=F]),nrow = 28,ncol=28,byrow = F)
##Inverting row order
mat=mat[nrow(mat):1,]
##plot
 image(mat,main=paste0('This is a ',mnist_data[i,1,with=F]))
}

The digits are easy to recognize for us. Some correlation can be expected, some pixels are mainly used by a given digits. These pixels are correlated and do no bring much more extra information to the dataset. The PCA will help us to keep only the combination of pixels that makes sense.

Preprocessing data

In addition to this, some pixels are always black (there are not used by any observation). If we want to scale the observations of the dataset, no variables should be constant. Hence, we need to remove constant pixels.

which(mnist_data[,sapply(.SD,FUN = function(x){min(x)==max(x)}),.SDcols=2:ncol(mnist_data)])

The code above print the constant pixel. As one would expect, they are located on the borders. Hence, we will remove the borders.

mnist_data=mnist_data[,-(which((1:784)%%28<=2|(1:784)%%28>=26|1:784%/%28<=2|1:784%/%28>=26)+1),
with=F]

The code above removes the cells which are located closer than two cells from the borders. This removes all pixels that are constant and should not crop too much the digits.

PCA with R

Now that the data are ready for PCA, we can do one using the prcomp function.

PCA1=prcomp(mnist_data[,(2:ncol(mnist_data)),with=F],center = T,scale. = F)

The computation can take a few minutes. Now, let’s compute the variances explain by each of the principal components:

par(mfrow=c(1,1),mar=c(2.1,2.1,2.1,2.1))
plot(cumsum(PCA1$sdev)/sum(PCA1$sdev)*100,main='Cumulative proportion of variance explained')

Cumulative proportion of variance explainedWith around 100 main directions, almost 60% of the variance is explained! We can also see that the marginal for one more principal component is decreasing to zero.

Visualizing the PCA

The goal is to be able to visualize the PCA and how the projection in the new space performs. First, we need to create a projection of our data in the space created by the PCA.

projected=scale(mnist_data[,(2:ncol(mnist_data)),with=F], PCA1$center, PCA1$scale) %*% PCA1$rotation

This is the R syntax to project the data in the orthogonal basis created by the PCA, you can learn more on it here.

Now, each of the observation is described as a linear combination of the principal components. You could use this representation of the data to fit a classification model. By keeping 200 principal components, you would still keep 80% of the variance and you would reduce overfitting.

Here we want to plot our data. To do so, we need to go back to the original pixel space by projecting back a few PC to this space.

Code:

###Keeping only the three main dimensions
n_dim=3
##Projecting the data back using only the 3 principal components
coord_x=data.table(mnist_data$label,projected[,1:n_dim]%*%t(PCA1$rotation)[1:n_dim,])
par(mfrow=c(6,6),mar=c(0.1,0.1,0.1,0.1))
##Plotting 36 observations
for (i in 1:36)
{
 mat=matrix(as.numeric(coord_x[i,530:2,with=F]),nrow = 23,ncol=23,byrow = F)
mat=mat[nrow(mat):1,]
image(mat,main=paste0('This is a ',coord_x[i,1,with=F]))
}

 

Representation of the 36 first observation using 3 PCs

 

Representation of the 36 first observation using 10 PCs
Representation of the 36 first observation using 50 PCs

The lore principal components, the more the digits look like their original self. With only 3PCs too much information has been lost and the digits look like a hybrid between a 0 and a 9. Using only 10PCs, it is almost possible to recognize all the digits and 50PCs seems to be enough to have the same accuracy we would have with all the dimensions.

That’s it, you know how to perform PCA with R!

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here