With Big data and the growth of data set size, the number of variables and hence the number of dimension is now often far more than the 3 dimensions we can easily visualize. So how can we visualize data with observations having more than 40 variables? How should we handle variables of different natures in visualization ? Is there a way to plot easily more than 2 dimensions ?

Since visualizations are done on a flat surface, the more convenient numbers of variables to visualize is two. In this post, we will see several ways to represent data with more than two dimensions. Most visualisations are done on the data from Kaggle World Happinness report 2016.

The R code which was used to do the plot is available and commented here.

**Representing continuous dimension**

Continuous color scale to represent a continuous range, size of the point, width of the line.

Scatter plot Matrix. Instead a of only plotting one variable versus the other, the idea is to plot each variable against the other one in a scatter plot to detect pattern. The Cells show the scatter plots of on variable versus the other, the cells on the diagonal shows the density of the variable.

Parallel coordinates plot: each variable is represented by an axe; all these axes are parallel. The top of the axe represents the maximum value of this variable, and the bottom represents the minimum. An observation is represented by a line connecting all axis. The value of a variable for this observation is given by the intersection of the line and the axis representing the variable.

Correlation plot, this plot represents the correlation between variables in a Matrix. Correlation plot is a very easy way to see and group correlated variables before going further. It worth keeping in mind that correlation is very imperfect to show non-linear relationships.

If you are studying time series, synchronized line graphs are also useful to see the evolution of the different variables against time.

**Adding more discrete or categorical dimension:**

As for the continuous variable, a discrete color scale will add one more dimension. Symbols and letters, where each one of them represents a class is also a way to account for categorical variables

Facets, implemented in ggplot as facet_wrap, it divides the plot in several plots. In each plot, only the observation belonging to the categories are plotted.

**Sequential data or how to plot a sequence of steps in a group. **

For instance, the number of visitors at each step on a conversion channel, or the sequences of the pages viewed by each user are sequential data. Since we want to plot sequence traditional ways to represent data are not suitable. However, some other plots do the job fine.

Sunburst, this is a radial chart, the users’ starting point is the center of the plot and step by step the users are getting away from the center. The number of users is a given sequence is shown by the width of the slice.

Tree plot, this is a unidirectional graph (for instance going left to right). The further you are going to right, the further in your users journey you are going. The width of the edge shows the size of the sample that went up to this step.

**Hierarchical data or how to plot nested sets and groups.**

Let’s say you want to plot different nested groups, it can be hard to do so because of the nesting and the recursive nature of these data. For instance you want to plot the GDP per continent then per geographic area then per country and finally per state.

Hence you would have a nested list with 5 levels which can tough to imagine and visualise.

As for sequential data you could use sunburst to plot hierarchical data? Since the sunburst does not make explicit that a sub-group belong to its parent group, It is not my faovrite way to plot these kind of data.

Treemap: Each group is represented by a square/rectangle proportional to the group size or total value (for example the sum of the GDP in this group). The treemap makes the parenthood between groups explicit since a child in embedded in its parent.

Circle treemap: The principle behind a circle tree map is the same that is behind a Tree Map. Instead of using square and rectangles the circle treemap uses circle.

## High dimensional data and dimensionality reduction

There are numerous ways to plot multivariate data and data with intrinsic relationship (as sequential data). However when you really have a lot of variables and the data is really high dimensional you need to use dimensionality reductions techniques such as PCE or T-SNE, and this will be for another post.

The R code which was used to do the plot is available and commented here.

[…] the post How to visualize more than 2 dimension?, several plots and techniques were shown to plot data with numerous dimensions. In this post, the R […]