Data processing and cleaning take most of the time in a data science process and let’s be frank, that’s not the funniest step of the process.Even if base R is already quite efficient and powerful at data processing some packages almost make data processing easy and beautiful. The most popular ones are dplyr, tidyr and data.table.
The post focus on data.table. It does intend to be very practical and to be a datatables cookbook and to answer very practical questions.
Why should I use datatable?
The first reason why datatable is great is because it’s fast. Let’s see this with a small example.
First, we create a 9 million observations data.table with 4 variables. Two are continuous while the others are grouping variables.
set.seed(123) var1=rnorm(9000000,mean=1:3,sd=1:3) var2=rexp(9000000,rate=1:10) var_group_1=rep(as.factor(paste('grp',1:3)),length(var1)/3) var_group_2=rep(as.factor(paste('sub_grp',1:10)),length(var1)/10) require(data.table) DT=data.table(grp=var_group_1,sub_grp=var_group_2,var1=var1,var2=var2) DF=data.frame(grp=var_group_1,sub_grp=var_group_2,var1=var1,var2=var2)
Now, let’s try to compute the mean of var1 for each group. The microbenchmark package will compute the average times for both data.table and data.frame.
require(microbenchmark) microbenchmark( aggregate(var1~grp,data = DF,mean), DT[,mean(var1),by=grp],times=5L )
The mean computation time is around 9200ms for the dataframe while it is only 126ms with datatables! Datatable are two orders of magnitude faster for this operation!
How can I compute a mean or any aggregated value?
Computing a mean or any aggregated value for a variable is easy with datatable:
##Mean DT[,mean(var1)] ##sum DT[,sum(var1)] ##cumulative sum DT[,cumsum(var1)]
You can do the same for several variables:
You select the columns you want to compute the function on with the .SDcols argument. The .SD indicates that you are applying the functions on the columns selected in the .SDcols.
How can I compute a mean or any aggregated value by group?
Data.table has a by option which applies the action groupwise. The groups are each of the distinct value of the variable specified by the by option.
##Mean DT[,mean(var1), by=grp] ##sum DT[,sum(var1), by=grp] ##cumulative sum DT[,mean(var2), by=sub_grp]
For instance to compute the mean by group, we only needed to provide the by=grp. You can do the same for several group. For instance, you can compute the mean of var1 for each group and subgroup. The code below computes the mean of var1 for each group and subgroup:
The output tells you that the mean of var1 for people in group 1 and in subgroup 1 is 0.9976593 while it is 1.9984373 for people in group 1 and in subgroup 2.
How can I add or multiply columns?
Let’s say we want to add var1 and var2:
This will return a list where each element is equal to var1+var2. As you can guess, multiplication works the same way:
You can also apply function easily:
if you want to do several operations on different variables:
This will return a two columns datatable, the first one will be equal to cos(var1) and the second one to sin(var2)
How can I add a new variable?
You can add a new variable using the ‘:=’ operator:
var3 will be the addition of var1 and var2. You can also add several new variables:
As previously var3 is the sum of var1 and var2, var4 is the product of the two variables.
This also works with a list of variables and values:
And you can add grouping to compute aggregated values by group:
Remark: You don’t need to reassign to datatable when using the ‘:=’ operator. The operator is already modifying the datatable. Hence you should avoid:
How can I modify a variable?
The steps are very similar to previous question, let’s say you want to add 1 to var1:
This will add 1 to var1. Using the ‘by’ argument, you can easily standardised var1 by group:
How can I cast a datatable from long to wide format ? How can I do crosstabs?
Let’s say that you want a datatable so that the first column is the subgroup, and the other columns are the means of var 2 for each group:
The dcast function (which works the same way as reshape:cast) provides a interface to do so:
dcast(DT,sub_grp~grp,fun.aggregate = mean) ##You can do the same for var1: dcast(DT,sub_grp~grp,fun.aggregate = mean,value.var='var1')
How can I compute lags and leads?
To compute lags and leads, data.table use the shift function. For this question, we will use a different data.table.
We will create fake data simulating the revenue of growing firms.
set.seed(456) time=rep(1:10,10) value=as.vector(replicate(10,(100+abs(rnorm(10,1:10,sd=2))))) firm=rep(paste0('firm_',1:10),each=10) Data_firm=data.table(firm,time,value) require(ggplot2) ggplot(Data_firm,aes(x=time,y=value,color=firm))+geom_line()
Now, we can use the lag function to compute the growth from one period to the other:
Data_firm[,growth:=(value-shift(value,type = 'lag'))/shift(value,type = 'lag') ,by=c('firm')]
And here is the growth of the firms. If you want to to compute a n-th order lag or lead, you just have to do:
shift(value,n=4L,type = 'lag') shift(value,n=4L,type = 'lead')
How can I subset the data.table?
Subsetting is easy in data.table, for instance, if you only want the observations from group 1:
##return the complete table with obs from group 1 DT[grp=='grp 1'] ##return the selected variables from observation from group 1 DT[grp=='grp 1',.(var1,var2)]
You can also subset the data.table using the row number, for instance to return the five first observation
How can I conditionally modify a variable?
Here, the goal is to add 1 the var1 when the observation belong to group 1, to add 2 when the observation belong to group 2 and 0 otherwise.
Data_firm[,var1:=var1+('grp 1'==grp)+2*('grp 2'==grp)]
The boolean returned by
are coerced to numeric by R (0 if false, 1 if true), hence the multiplication can be done.
You can also apply the function to each subset:
Data_firm['grp 1'==grp,var1:=var1+1] Data_firm['grp 2'==grp,var1:=var1+2]
How can I count the number of observations per group? How can I count the number of observations per group and subgroup?
If you want to know how many observation there are in each group and in each group intersection, you can use .N
##Number of observations per group Data_firm[,.N, by=grp] ##Number of observations in each group and subgroup intersection Data_firm[,.N, by=c('grp','sub_grp')]