Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: **vectorization**. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.

With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective.

## A basic exemple of vectorization

### Preliminary exemple – Python

Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.

import numpy as np import time W=np.random.normal(0,1,100000) n_rep=1000

The naive way to compute the sum iterates over all the elements of the array and stores the sum:

start_time = time.time() for i in range(n_rep): loop_res=0 for elt in W: loop_res+=elt time_loop = time.time() - start_time

If is our vector of interest, the sum of its elements can be expressed as the dot product :

start_time = time.time() for i in range(n_rep): one_dot=np.ones(W.shape) vect_res=one_dot.T.dot(W) time_vect = time.time() - start_time

Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.

print(np.abs(vect_res-loop_res)<10e-10) print(time_loop/time_vect)

Note: The same results can be obtained with *np.sum*.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.

start_time = time.time() for i in range(n_rep): vect_res=np.sum(W) time_vect_np = time.time() - start_time

### Preliminary exemple – R

The previous experiments can be replicated in R:

##Creation of the vector W=matrix(rnorm(100000)) n_rep=10000 #Naive way: library(tictoc) tic('Naive computation') for (rep in 1:n_rep) { res_loop=0 for (w_i in W) res_loop=w_i+res_loop } toc() tic('Vectorized computation') # vectorized way for (rep in 1:n_rep) { ones=rep(1,length(W)) res_vect= crossprod(ones,W) } toc() tic('built-in computation') # built-in way for (rep in 1:n_rep) { res_built_in= sum(W) } toc()

In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.

### Preliminary exemple – Results

Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.

## Why vectorization is (often) faster

- R and Python are
**interpreted language**, this means that your instructions are analyzed and interpreted at each execution. Since they are not statically typed, the loops have to assess the type of the operands at each iteration which leads to a computational overhead. - R and Python linear algebra relies on
**optimized back-end for matrix operations and linear algebra**. These back-end are written in C/C++ and can process loops efficiently. Furthermore, while loops are sequential, these back-end can run operations in parallel which improves the computation speed on modern CPU.

**Note 1**: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.

**Note 2**: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).

## A small zoo of matrix operations

The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.

### Column-wise matrix sum

The column wise sum (and mean) can be expressed as a matrix product. Let be our matrix of interest. Using the matrix multiplication formula, we have: . Hence, the column-wise sum of is .

**Python code:**

def colWiseSum(W): ones=np.ones((W.shape[0],1)) return ones.T.dot(W)

**R code:**

colWiseSum=function(W) { ones=rep(1,nrow(W)) crossprod(W,ones) }

### Row-wise matrix sum

Similarly, the row-wise sum is .

**Python code:**

def rowWiseSum(W): ones=np.ones((W.shape[1],1)) return W.dot(ones)

**R code:**

rowWiseSum=function(W) { ones=rep(1,ncol(W)) W%*%ones }

### Matrix sum

The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of is .

**Python code:**

def matSum(W): rhs_ones=np.ones((W.shape[1],1)) lhs_ones=np.ones((W.shape[0],1)) return lhs_ones.T.dot(W).dot(rhs_ones)

**R code:**

matSum=function(W) { rhs_ones=rep(1,ncol(W)) lhs_ones=rep(1,nrow(W)) crossprod(lhs_ones,W)%*% rhs_ones }

### Similarity matrix (Gram matrix)

Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension (using word2vect). Let (our set of words) and (our dictionary) be two matrices resp. in and . To compute the similarity of all the observations of and we simply need to compute .

**Python code:**

def gramMatrix(X,Y): return X.dot(Y.T)

**R code:**

gramMatrix=function(X,Y) { tcrossprod(X,t(Y)) }

### L2 distance

We want to compute the pair-wise distance between two sets of vector. Let and be two matrix in and . For each vector of we need to compute the distance with all the vectors of .Hence, the output matrix should be of size .

If and are two vectors, their distance is:

. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size , then the output vector of distance will be of size , which can be reshaped into a vector of size .

The first two terms and need to be computed for each and . is the element-wise multiplication of X with itself (its elements are ). Hence, the i-th element of is the squared sum of the coordinate of the i-th observation, .

However, this is a vector and its shape is . By replicating each of its elements times, we will get a matrix of size . The replication can be done easily, if we consider the right matrix .

Let be a vector of one of size . Let be the matrix of size with repetitions of on the “diagonal”:

Then, our final vector is (The same expression holds for ). We denote a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:

The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.

**Python code:**

def L2dist(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0) delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0) return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)

**R Code:**

L2dist=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) }

### L2 distance (improved)

Actually the previous *L2dist* is not completely optimized requires a lot of memory since has cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor ):

**Python code:**

def L2dist_improved(X,Y): n_1=X.shape[0] n_2=Y.shape[0] p=X.shape[1] ones=np.ones((p,1)) x_sq=(X**2).dot(ones)[:,0] y_sq=(Y**2).dot(ones)[:,0] ##Replace multiplication by a simple repeat X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2)) Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T return X_rpt+Y_rpt-2*gramMatrix(X,Y)

**R code:**

L2dist_improved=function(X,Y) { n_1=dim(X)[1] n_2=dim(Y)[1] p=dim(X)[2] ones=rep(1,p) x_sq=X**2 %*% ones x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1)) y_sq=Y**2 %*% ones y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2) x_sq+y_sq-2*gramMatrix(X,Y) }

**Note :** Actually, this code can be made even shorter and more efficient by using a custom outer product (Thanks to Florian Privé for the solution):

L2dist_improved2 <- function(X, Y) { xn <- rowSums(X ** 2) yn <- rowSums(Y ** 2) outer(xn, yn,'+') - 2 * tcrossprod(X, Y) }

### L2 distance – benchmark

To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.

The vectorized implementation is 2 to 3 orders of magnitude faster than the naive implementation and on par with the scikit implementation.

Hello,

Sorry for being picky but your R code is very pythonish to my sense.

Two remarks:

– gramMatrix is tcrossprod

– not sure people use matrix operations to compute rowMeans in R

For the computation of the distance in base R, I would use:

L2dist_improved2 <- function(X, Y) {

xn <- rowSums(X ** 2)

yn <- rowSums(Y ** 2)

tcrossprod(xn, yn) – 2 * tcrossprod(X, Y)

}

Best,

Florian

Hello Florian,

Thanks a lot for your feedback. You are right my R code is very pythonish, I directly translated it from the python code. Concerning the remarks:

– crossprod and tcrossprod would be better and faster, I’ll modify the post and replace the matrix products.

– In real life, whenever possible, I would use built-in operations (rowSums, colSums, sum, pdist, …) since they are faster and less error-prone. The matrix operations are here for the sake of demonstration and I should say it more clearly.

– Thanks for the improved version, but the function does not yield the same result as pdist. I think that

`tcrossprod(xn, yn)`

computes the product instead of the sum .Again, thanks for your feedback,

Best,

Antoine

You’re completely right.

Replace `tcrossprod(xn, yn)` with `outer(xn, yn, ‘+’)` and it should be OK.

This is far more elegant than my previous approach, thanks!