In the post How to visualize more than 2 dimension?, several plots and techniques were shown to plot data with numerous dimensions. In this post, the R code will be shown and explained.
The data are taken from Kaggle and the World happiness report 2016 data set. The data are well structured and do not need a lot of processing.
Reading the data
First, you’ll need to create a project and put the data from kaggle in its directory. Now, you can read the data and take a look at it.
require(data.table) HappinessData=data.table(read.csv('2016.csv')) print(head(HappinessData))
The data set has 13 columns, 11 are numerical variables and the first two are categorical. The variables 3 to 6 (“Happiness.Rank”, “Happiness.Score”, “Lower.Confidence.Interval”, “Upper.Confidence.Interval”) are strongly correlated since they all depends on the Happiness Score.
Quantitative variables plot
1. Colors and size to account for quantitative variable
require(ggplot) ggplot(HappinessData,aes(x=Economy..GDP.per.Capita.,y=Happiness.Score,color=Freedom,size=Health..Life.Expectancy.))+ geom_point(alpha=0.4)+ xlab('GDP per capita')+ylab('Happiness score')
Here the ggplot magic is happening, with only 2 lines, you can easily plot a rather complicated plot.
2. Scatter plot matrix
Again, this is gg magic, the plot only need one line.
You can go further by adding a different color for each region for instance:
3. Parallel axis plot
The package GGally provides a function returning a parallel axis ggplot.
require('GGally') ggparcoord(HappinessData,c(4,7:10),alphaLines=0.5,groupColumn=2)+ ggtitle('Parallel axis diagram of happinness')
Since the function is returning a ggplot, you can also add a facet wrap easily:
ggparcoord(HappinessData,c(4,7:10),alphaLines=0.5,groupColumn=1)+ ggtitle('Parallel axis diagram of happinness')+ facet_wrap(~as.character(Region),ncol = 2)+ theme(legend.position="none")
4. Correlation plot
To plot correlation matrix, the corrplot package and the cor functions are the easiest way:
require(corrplot) corrplot(cor(HappinessData[,c(4,7:13),with=F]),order = 'hclust',addrect = 3)
The order option indicates according to which criteria the variables should be ordered. Here a hierarchical clustering is used. The rectangles show the closest variables according to this criterion.
Sequential data and hierarchical data
To plot these two type of data, the package D3partitionR was used. The random option which use random data to plot these plots was used.
require(D3partitionR) ##Sequential plot D3partitionR(T,type = 'sunburst',trail = T) D3partitionR(T,type = 'collapsibleTree',trail = T) ##Hierachical plot D3partitionR(T,type = 'treeMap',trail = T) D3partitionR(T,type = 'circleTreeMap',trail = T)
If you have other packages or ways to do the plot, please share and comment.
Thanks for reading.