Machine Learning Explained: Vectorization and matrix operations

4
compute time L2
Computation time of the L2 distance

Today in Machine Learning Explained, we will tackle a central (yet under-looked) aspect of Machine Learning: vectorization. Let’s say you want to compute the sum of the values of an array. The naive way to do so is to loop over the elements and to sequentially sum them. This naive way is slow and tends to get even slower with large amounts of data and large data structures.
With vectorization these operations can be seen as matrix operations which are often more efficient than standard loops. Vectorized versions of algorithm are several orders of magnitudes faster and are easier to understand from a mathematical perspective.

A basic exemple of vectorization

Preliminary exemple – Python

 
Let’s compare the naive way and the vectorized way of computing the sum of the elements of an array. To do so, we will create a large (100,000 elements) Numpy array and compute the sum of its element 1,000 times with each algorithm. The overall computation time will then be compared.

import numpy as np
import time
W=np.random.normal(0,1,100000)
n_rep=1000

The naive way to compute the sum iterates over all the elements of the array and stores the sum:

start_time = time.time()
for i in range(n_rep):
    loop_res=0
    for elt in W:
        loop_res+=elt
time_loop = time.time() - start_time

If \mathbf{W} is our vector of interest, the sum of its elements can be expressed as the dot product \mathbf{1}^T \mathbf{W} = \sum W_i:

start_time = time.time()
for i in range(n_rep):
    one_dot=np.ones(W.shape)
    vect_res=one_dot.T.dot(W)
time_vect = time.time() - start_time

Finally, we can check that both methods yield the same results and compare their runtime. The vectorized version run approximately 100 to 200 times faster than the naive loop.

print(np.abs(vect_res-loop_res)<10e-10)
print(time_loop/time_vect)

Note: The same results can be obtained with np.sum.The numpy version has a very similar runtime to our vectorized version. Numpy being very optimized, this show that our vectorized sum is reasonably fast.

start_time = time.time()
for i in range(n_rep):
    vect_res=np.sum(W)
time_vect_np = time.time() - start_time

Preliminary exemple – R

The previous experiments can be replicated in R:

##Creation of the vector
W=matrix(rnorm(100000))
n_rep=10000
#Naive way:
library(tictoc)
tic('Naive computation')
for (rep in 1:n_rep)
{
  res_loop=0
  for (w_i in W)
    res_loop=w_i+res_loop
}
toc()

tic('Vectorized computation')
# vectorized way
for (rep in 1:n_rep)
{
  ones=rep(1,length(W))
  res_vect= crossprod(ones,W)
}
toc()

tic('built-in computation')
# built-in way
for (rep in 1:n_rep)
{
  res_built_in= sum(W)
}
toc()

In R, the vectorized version is only an order of magnitude faster than the naive way. The built-in way achieves the best performances and is an order of magnitude faster than our vectorized way.

Preliminary exemple – Results

    \[\begin{tabular}{|l|c|c|r|}   \hline   Lang & Naive & Vectorized & Built in \\   \hline   Python & 115 s & 0.55 s & 0.28 s \\   R & 25 s & 3.8 s & 0.9 s \\   \hline \end{tabular}\]

    \[\text{Computation time of the sum of an array of 100,000 elements (10,000 rep)}\]

Vectorization divides the computation times by several order of magnitudes and the difference with loops increase with the size of the data. Hence, if you want to deal with large amount of data, rewriting the algorithm as matrix operations may lead to important performances gains.

Why vectorization is (often) faster

  • R and Python are interpreted language, this means that your instructions are analyzed and interpreted at each execution. Since they are not statically typed, the loops have to assess the type of the operands at each iteration which leads to a computational overhead.
  • R and Python linear algebra relies on optimized back-end for matrix operations and linear algebra. These back-end are written in C/C++ and can process loops efficiently. Furthermore, while loops are sequential, these back-end can run operations in parallel which improves the computation speed on modern CPU.

Note 1: Though vectorization is often faster, it requires to allocate the memory of the array. If your amount of RAM is limited or if the amount of data is large, loops may be required.
Note 2: When you deal with large arrays or computationally intensive algebra ( like inversion, diagonalization, eigenvalues computations, ….) computations on GPU are even order of magnitudes faster than on CPU. To write efficient GPU code, the code needs to be composed of matrix operations. Hence, having vectorized code maked it easier to translate CPU code to GPU (or tensor-based frameworks).

A small zoo of matrix operations

The goal of this part is to show some basic matrix operations/vectorization and to end on a more complex example to show the thought process which underlies vectorization.

Column-wise matrix sum

The column wise sum (and mean) can be expressed as a matrix product. Let \mathbf{M} be our matrix of interest. Using the matrix multiplication formula, we have: [\mathbf{1}^T\mathbf{M}]_j= \sum_i \mathbf{1}_i \mathbf{M}_{i,j} =\sum_i \mathbf{M}_{i,j}. Hence, the column-wise sum of \mathbf{M} is \mathbf{1}^T\mathbf{M}.

Python code:

def colWiseSum(W):
    ones=np.ones((W.shape[0],1))
    return ones.T.dot(W)

R code:

colWiseSum=function(W)
{
  ones=rep(1,nrow(W))
  crossprod(W,ones)
}

Row-wise matrix sum

Similarly, the row-wise sum is \mathbf{M}\mathbf{1}.

Python code:

def rowWiseSum(W):
    ones=np.ones((W.shape[1],1))
    return W.dot(ones)

R code:

rowWiseSum=function(W)
{
  ones=rep(1,ncol(W))
  W%*%ones
}

Matrix sum

The sum of all the elements of a matrix is the sum of the sum of its rows. Using previous expression, the sum of all the terms of \mathbf{M} is \mathbf{1}^T\mathbf{M}\mathbf{1}.

Python code:

def matSum(W):
    rhs_ones=np.ones((W.shape[1],1))
    lhs_ones=np.ones((W.shape[0],1))
    return lhs_ones.T.dot(W).dot(rhs_ones)

R code:

matSum=function(W)
{
  rhs_ones=rep(1,ncol(W))
  lhs_ones=rep(1,nrow(W))
  crossprod(lhs_ones,W)%*% rhs_ones
}

Similarity matrix (Gram matrix)

Let’s say we have a set of words and for each of this words we want to find the most similar words from a dictionary. We assume that the words have been projected in space of dimension p (using word2vect). Let \mathbf{X} (our set of words) and \mathbf{Y} (our dictionary) be two matrices resp. in \mathbb{R}^{n_1,p} and \mathbb{R}^{n_2,p}. To compute the similarity of all the observations of \mathbf{X} and \mathbf{Y} we simply need to compute \mathbf{X} \mathbf{Y}^T.

Python code:

def gramMatrix(X,Y):
    return X.dot(Y.T)

R code:

gramMatrix=function(X,Y)
{
  tcrossprod(X,t(Y))
}

L2 distance

We want to compute the pair-wise distance between two sets of vector. Let \mathbf{X} and \mathbf{Y} be two matrix in \mathbb{R}^{n_1,p} and \mathbb{R}^{n_2,p}. For each vector of X we need to compute the distance with all the vectors of Y.Hence, the output matrix should be of size (n_1,n_2).

If x and y are two vectors, their \mathbf{L}^2 distance is:

    \[\sum (x_i - y_i)^2 = \sum x_i^2 + \sum y_i^2 - 2 \sum x_i y_i = x^T x + y^T y - 2 x^T y\]

. To compute all pairwise distance, some work is required on the last equality. All the matrices should be of size (n_1 \times n_2,p), then the output vector of distance will be of size (n_1 \times n_2,1), which can be reshaped into a vector of size (n_1,n_2).

The first two terms x^T x and y^T y need to be computed for each x and y. \mathbf{X} \odot \mathbf{X} is the element-wise multiplication of X with itself (its elements are x_{i,j}^2). Hence, the i-th element of x_{sq}=(\mathbf{X} \odot \mathbf{X}) \mathbf{1} is the squared sum of the coordinate of the i-th observation, \sum_j x_{i,j}^2.

However, this is a vector and its shape is (n_1,1). By replicating each of its elements n_2 times, we will get a matrix of size (n_1 \times n_2,1). The replication can be done easily, if we consider the right matrix \mathbf{\Delta}_{n_2}^{n_1}.

Let \mathbf{1}_{n_2} be a vector of one of size n_2. Let \mathbf{\Delta}_{n_2}^{n_1} be the matrix of size (n_1\times n_2, n_1) with n_1 repetitions of \mathbf{1}_{n_2} on the “diagonal”:

    \[\mathbf{\Delta}_{n_2}^{n_1}=\begin{matrix}                 \mathbf{1}_{n_2} &  0  & \ldots & 0 \\                 0  & \mathbf{1}_{n_2} & \ldots & 0 \\                 \vdots & \vdots & \ddots & \vdots \\                  0  &   0       &\ldots & \mathbf{1}_{n_2} \end{matrix}\]

Then, our final vector is \Delta_{n_2}^{n_1}x_{sq} (The same expression holds for \mathbf{Y}). We denote \Phi a reshape operator (used to transform the previous vector in matrix). With previous part on similarity matrix, we get the following expression of the pairwise distance:

    \[\Phi(\mathbf{\Delta}_{n_2}^{n_1} ((\mathbf{X} \odot \mathbf{X}) \mathbf{1}_{p}))+\Phi(\mathbf{\Delta}_{n_1}^{n_2} ((\mathbf{Y} \odot \mathbf{Y}) \mathbf{1}_{p})) -2 \mathbf{X} \mathbf{Y}^T\]

The previous expression can seem complex, but this will help us a lot to code the pairwise distance. We only have to do the translation from maths to Numpy or R.

Python code:

def L2dist(X,Y):
    n_1=X.shape[0]
    n_2=Y.shape[0]
    p=X.shape[1]
    ones=np.ones((p,1))
    x_sq=(X**2).dot(ones)[:,0]
    y_sq=(Y**2).dot(ones)[:,0]
    delta_n1_n2=np.repeat(np.eye(n_1),n_2,axis=0)
    delta_n2_n1=np.repeat(np.eye(n_2),n_1,axis=0)
    return np.reshape(delta_n1_n2.dot(x_sq),(n_1,n_2))+np.reshape(delta_n2_n1.dot(y_sq),(n_2,n_1)).T-2*gramMatrix(X,Y)

R Code:

L2dist=function(X,Y)
{
  n_1=dim(X)[1]
  n_2=dim(Y)[1]
  p=dim(X)[2]
  ones=rep(1,p)
  x_sq=X**2 %*% ones
  x_sq=t(matrix(diag(n_1) %x% rep(1, n_2) %*% x_sq, n_2,n_1))
  
  y_sq=Y**2 %*% ones
  y_sq=matrix(diag(n_2) %x% rep(1, n_1) %*% y_sq,n_1,n_2)
  
  x_sq+y_sq-2*gramMatrix(X,Y)
}

L2 distance (improved)

Actually the previous L2dist is not completely optimized \mathbf{\Delta}_{n_2}^{n_1} ((\mathbf{X} \odot \mathbf{X}) \mathbf{1}_{p}) requires a lot of memory since \Delta}_{n_2}^{n_1} has n_2 \times n_1^2 cells and is mostly empty. Using Numpy built-in function, we can circumvent this multiplication by directly repeating the vector (which reduces the memory footprints by a factor max(n_1,n2)):

Python code:

def L2dist_improved(X,Y):
    n_1=X.shape[0]
    n_2=Y.shape[0]
    p=X.shape[1]
    ones=np.ones((p,1))
    x_sq=(X**2).dot(ones)[:,0]
    y_sq=(Y**2).dot(ones)[:,0]
##Replace multiplication by a simple repeat
    X_rpt=np.repeat(x_sq,n_2).reshape((n_1,n_2))
    Y_rpt=np.repeat(y_sq,n_1).reshape((n_2,n_1)).T
    return X_rpt+Y_rpt-2*gramMatrix(X,Y)

R code:

L2dist_improved=function(X,Y)
{
  n_1=dim(X)[1]
  n_2=dim(Y)[1]
  p=dim(X)[2]
  ones=rep(1,p)
  x_sq=X**2 %*% ones
  x_sq=t(matrix(rep(x_sq,each=n_2),n_2,n_1))
  
  y_sq=Y**2 %*% ones
  y_sq=matrix(rep(y_sq,each=n_1),n_1,n_2)
  
  x_sq+y_sq-2*gramMatrix(X,Y)
}

Note : Actually, this code can be made even shorter and more efficient by using a custom outer product (Thanks to Florian Privé for the solution):

L2dist_improved2 <- function(X, Y) {
  
  xn <- rowSums(X ** 2)
  yn <- rowSums(Y ** 2)

  outer(xn, yn,'+') - 2 * tcrossprod(X, Y)
}

L2 distance – benchmark

To show the interest of our previous work, let’s compare the computation speed of the vectorized L2 distance, the naive implementation and the scikit-learn implementation. The experiments are run on different size of dataset with 100 repetitions.

compute time L2
Computation time of the L2 distance

The vectorized implementation is 2 to 3 orders of magnitude faster than the naive implementation and on par with the scikit implementation.

4 COMMENTS

  1. Hello,
    Sorry for being picky but your R code is very pythonish to my sense.
    Two remarks:
    – gramMatrix is tcrossprod
    – not sure people use matrix operations to compute rowMeans in R

    For the computation of the distance in base R, I would use:
    L2dist_improved2 <- function(X, Y) {

    xn <- rowSums(X ** 2)
    yn <- rowSums(Y ** 2)

    tcrossprod(xn, yn) – 2 * tcrossprod(X, Y)
    }

    Best,
    Florian

    • Hello Florian,
      Thanks a lot for your feedback. You are right my R code is very pythonish, I directly translated it from the python code. Concerning the remarks:
      – crossprod and tcrossprod would be better and faster, I’ll modify the post and replace the matrix products.
      – In real life, whenever possible, I would use built-in operations (rowSums, colSums, sum, pdist, …) since they are faster and less error-prone. The matrix operations are here for the sake of demonstration and I should say it more clearly.
      – Thanks for the improved version, but the function does not yield the same result as pdist. I think that tcrossprod(xn, yn) computes the product x_i^2 y_j^2 instead of the sum x_i^2 + y_j^2.

      library(pdist)
      
      X=matrix(rnorm(10000),100,100)
      Y=matrix(rnorm(10000),100,100)
      
      L2dist_improved2 <- function(X, Y) {
        
        xn <- rowSums(X ** 2)
        yn <- rowSums(Y ** 2)
        
        tcrossprod(xn, yn) - 2 * tcrossprod(X, Y)
      }
      
      max(abs(sqrt(L2dist_improved2(X,Y))-as.matrix(pdist(X,Y))))
      

      Again, thanks for your feedback,
      Best,
      Antoine

LEAVE A REPLY

Please enter your comment!
Please enter your name here