Dealing with a lot of dimensions can be painful for machine learning algorithms. High dimensionality will increase the computational complexity, increase the risk of overfitting (as your algorithm has more degrees of freedom) and the sparsity of the data will grow. Hence, dimensionality reduction will project the data in a space with less dimension to limit these phenomena.
In this post, we will first try to get an intuition of what is dimensionality reduction, then we will focus on the most widely used techniques.
Spaces and variables with many dimensions
Let’s say we own a shop and want to collect some data on our clients. We can collect their ages, how frequently they come to our shop, how much they spend on average and when was the last time they came to our shop. Hence each of our clients can be represented by four variables (ages, frequency, spendings and last purchase date) and can be seen as a point in a four dimension space. Later on, variables and dimensions will have the same meaning.
Now, let’s add some complexity and some variables by using images. How many dimensions are used to represent the image below?
There are 8 by 8 pixels, hence 64 pixels are needed and each of these pixels is represented by a quantitative variable. 64 variables are needed!
Even bigger, how many variables do you need to represent the picture below?
Its resolution is 640 by 426 and you need three color channels (Red, Green, and Blue). So this picture requires no more than … 640*426*3= 817,920 variables. That’s how you go from four to almost one million of dimensions (and you can go well above!)
Dimensionality reduction in everyday life
Before seeing any algorithm, everyday life provides us a great example of dimensionality reduction.
Each of these people can be represented as points in a 3 Dimensional space. With a gross approximation, each people is in a 50*50*200 (cm) cube. If we use a resolution of 1cm and three color channels, then can be represented by 1,000,000 variables.
On the other hand, the shadow is only in 2 dimensions and in black and white, so each shadow only needs 50*200=10,000 variables.
The number of variables was divided by 100 ! And if your goal is to detect human vs cat, or even men vs women, the data from the shadow may be enough.
Why should you reduce the number of dimensions?
Dimensionality reduction has several advantages from a machine learning point of view.
- Since your model has fewer degrees of freedom, the likelihood of overfitting is lower. The model will generalize more easily on new data.
- If you use feature selection or linear methods (such as PCA), the reduction will promote the most important variables which will improve the interpretability of your model.
- Most of features extraction techniques are unsupervised. You can train your autoencoder or fit your PCA on unlabeled data. This can be really helpful is you have a lot of unlabeled data (for instance text or images) and labeling is time-consuming and expensive.
Features selection as a basic reduction
The most obvious way to reduce dimensionality is to remove some dimensions and to select the more suitable variables for the problem.
Here are some ways to select variables:
- Greedy algorithms which add and remove variables until some criterion is met. For instance, the stepwise regression with forward selection will add at each step the variable which improve the fit in the most significant way
- Shrinking and penalization methods, which will add cost for having too many variables. For instance, L1 regularization will cut some variables’ coefficient to zero. Regularization limits the space where the coefficients can live in.
- Selection of models based on criteria taking the number of dimensions into accounts such as the adjusted R², AIC or BIC. Contrary to regularization, the model is not trained to optimize these criteria. The criteria are computed after fitting to compare several models.
- Filtering of variables using correlation, VIF or some “distance measure” between the features.
Features Engineering and extraction to reduce dimensionality
While features selection is efficient, it is brutal since variables are either kept or removed. However in some interval or for some modalities, removed variable may be useful while kept variable may be redundant. Features extraction (or engineering) seeks to keep only the intervals or modalities of the features which contain information.
PCA and Kernel PCA
Principal component analysis (or PCA), is a linear transformation of the data which looks for the axis where the data has the most variance. PCA will create new variables which are linear combinations of the original ones, these new variables will be orthogonal (i.e. correlation equals to zero). PCA can be seen as a rotation of the initial space to find more suitable axis to express the variability in the data.
On the new variables have been created, you can select the most important ones. The threshold is up-to-you and depends on how much variance you want to keep. You can check the tutorial below to see a working R example:
Since PCA is linear, it mostly works on linearly separable data. Hence, if you want to perform classification on other data (Like the donut below), linear PCA will probably make you lose a lot of information.
Example of Kernel PCA from Scikit-Learn
On the other hand, kernel PCA can work on nonlinearly separable data. The trick is simple:
- before applying PCA, a function is applied project the initial space in a bigger space where the data are separable. The function can be a polynomial, radial or some activation function.
- Then a PCA is applied in this new space. (NB: It is not exactly PCA but close to it).
Hence you can easily separate the two rings of the donut, which will improve the performance of classifiers. Kernel PCA raises two problems, you need to find the right kernel and the new variables are hard to understand for humans.
LDA and Kernel LDA
Linear discriminant analysis is similar to PCA but is supervised (while PCA does not require labels). The goal of the LDA is not to maximize variance but to create linear surfaces to separate the different groups. The new features will be axes that separate the data when the data are projected on them.
As with PCA, you can also apply the kernel trick to apply LDA on non linearly separable data.
ICA and Kernel ICA
Independant component analysis aims at creating independent variables from the original variables.
The typical example Cocktail party problem: you are in a room with lots of different people having different conversations and your goal is to separate the different conversations. If you have a lot of microphones in the room, each and every of them will get a linear combination of all the conversation (some kind of noise). The goal of ICA is to disentangle these conversations from the noise.
This can be seen as dimensionality reduction if you have 200 microphones but only ten independent conversations, you should be able to represent them with ten independant variables from the ICA.
Autoencoder is a powerful method to reduce the dimensionality of data. It is composed of a neural network (it can be feed-forward, convolutional or recurrent, most of the architecture can be adapted into an autoencoder) which will try to learn its input. For instance, an autoencoder trained on images will try to reconstruct these images.
In addition to this, the autoencoder has a bottleneck, the number of neurons in the autoencoder’s hidden layers will be smaller than the number of input variables. Hence, the autoencoder has to learn a compressed representation of the data where the number of dimensions is the number of neuron in the smallest hidden layer.
The main advantages of autoencoder are their non-linearity and how flexible they can be.
T-SNE or t-distributed stochastic neighbor embedding is mainly used to fit data from large dimensions in 2D or 3D space. That is the technique we used to fit and visualize the twitter data in our analysis of French election. The main idea behind T-SNE is that nearby point in the original space should be close in the low dimensional space while distant point should also be distant in the smaller space.
T-sne is highly non-linear, is originally non-parametric, dependant on the random seed and does not keep distance alike. Hence, I mainly use it to plot high dimension and to visualise cluster and similarity.
NB: There is also a parametric variant which seems less widely used than the original T-SNE.
Here ends our presentation of the most widely used dimensionality reduction techniques.