Create your Machine Learning library from scratch with R ! (1/3)

0

When dealing with Machine Learning problems in R, most of the time you rely on already existing libraries. This fastens the analysis process, but do you really understand what is behind the algorithms? Could you implement a logistic regression from scratch with R?
The goal of this post is to create our own basic machine learning library from scratch with R. We will only use the linear algebra tools available in R. There will be three posts:

  1. Linear and logistic regression (this one)
  2. PCA and k-nearest neighbors classifiers and regressors
  3. Tree-based methods and SVM

Linear Regression (Least-Square)

The goal of liner regression is to estimate a continuous variable Y given a matrix of observations X. Before dealing with the code, we need to derive the solution of the linear regression.

Solution derivation of linear regression

Given a matrix of observations X and the target Y. The goal of the linear regression is to minimize the L2 norm between Y and a linear estimate of Y : \hat{Y}=WX. Hence, linear regression can be rewritten as an optimization problem: arg \; min_W ||Y-WX||^2. A closed-form solution can easily be derived and the optimal W is \hat{W}=(X^T X)^{-1}X^T Y

Linear regression in R

Using the closed-form solution, we can easily code the linear regression. Our linear model object will have three methods, an init method where the model is fitted, a predict method to work with new data and a plot method to visualize the residuals’ distribution.

###Linear model
fit_lm<-function(x,y,intercept=TRUE,lambda=0)
{
  ##Conversion to matrix if required
  if (!is.matrix(x))
  {
    x=as.matrix(x)
  }
  if (!is.matrix(y))
  {
    y=as.matrix(y)
  }
  #Add the intercept coefficient
  if (intercept)
  {
    x=cbind(x,1)
  }
  my_lm=list(intercept=intercept)
  ##Compute coefficients estimates
  my_lm[['coeffs']]=solve(t(x) %*% x) %*% t(x) %*% y
  ##Compute estimates for the train dataset
  my_lm[['preds']]=x %*% my_lm[['coeffs']]
  my_lm[['residuals']]=my_lm[['preds']]-y
  my_lm[['mse']]=mean(my_lm[['residuals']]^2)
  attr(my_lm, "class") <- "my_lm"
  return(my_lm)
}

The fit function is simple, the first few lines transform the data to matrices and add an intercept if required. Then, the ‘my_lm’ object is created and the coefficients are computed. The solve() function is used to invert the matrix and %*% denotes matrix multiplication. A the end, the residuals and the estimates are computed and the class of the object is set as ‘my_lm’.
Now let’s implement the predict and plot methods for the my_lm class:


predict.my_lm<-function(my_lm,x,..)
{
  if (!is.matrix(x))
  {
    x=as.matrix(x)
  }
  if (my_lm[["intercept"]])
  {
    x=cbind(x,1)
  }
  x%*%my_lm[["coeffs"]]
}

plot.my_lm<-function(my_lm,bins=30,..)
{
  library(ggplot2)
  qplot(my_lm[["residuals"]], geom="histogram",bins=bins) + xlab('Residuals values') + ggtitle('Residual distribution')
  
}

You can test the code on some preinstalled R dataset such as the car one. The code will give you the same coefficients estimates as the lm function. For instance, on the car dataset:

my_lm1=fit_lm(cars[,1],cars[,2])
vanilla_lm=lm(dist~speed,cars)
print(vanilla_lm[['coefficients']])
print(my_lm1[['coeffs']])

Logistic regression

Previously, we worked on regression and the estimation of a continuous variable. Now, with logistic regression, we try to estimate a binary outcome (for instance, ill vs healthy, pass vs failed, …). Again, let’s deal with the maths first:

The mathematics of logistic regression

The goal is to estimate a binary outcome Y given the observations X. We assume that Y|X follows a Bernoulli distribution of parameter \theta(X)=\frac{e^{WX}}{1+e^{WX}}=\sigma(WX). \sigma is called the sigmoid function.
Hence, we have P(y|x)=\theta(X)^y(1-\theta(X))^{1-y}.
We want to maximize the log-likelihood of the observed sample(over \theta and hence over W):

    \[l(\theta)=\sum_{x,y} \; y.log(\theta(X)) + (1-y).log(1-\theta(X)) \]

This maximization will be done using Newton’s Method. Newton’s method is a variant of gradient descent in which we try to find the optimal curvature of the function to increase the speed of convergence. If you are not familiar with the Newton method, you can just see it as a variant of batch gradient descent.. The weights updates has the following form:

    \[ W_{t+1}=W_t+(\nabla^2 l(W_t))^{-1} \nabla l(W_t) \]

with: The Hessian

    \[ \nabla^2 l(W_t))=-X^T Diag(\sigma( W^T X)_i(1-\sigma( W^T X))_i) X\]

and the gradient

    \[\nabla l(W_t)=X^T.(y-\sigma( W^T X))\]

The algorithm in R will update the weights using this update until the termination criterion is met. Here, the termination criterion is met when the mean square error is below the user-defined tolerance.

Logistic regression in R

###Sigmoid function
sigmoid=function(x) {1/(1+exp(-x))}

###Fit logistic regression
fit_logit=function(x,y,intercept=T,tol=10e-5,max_it=100)
{
  ##Type conversion
  if (!is.matrix(x))
  {
    x=as.matrix(x)
  }
  if (!is.matrix(y))
  {
    y=as.matrix(y)
  }
   ##Add intercept is required
  if (intercept)
  {
    x=cbind(x,1)
  }
  ##Algorithm initialization
  iterations=0
  converged=F
  ##Weights are initialized to 1 
  coeffs=matrix(1,dim(x)[2])
  
  ##Updates the weight until the max number of iter
  ##Or the termination criterion is met
  while (iterations<max_it& !converged)
  {
    iterations=iterations+1
    nu<-sigmoid(x %*% coeffs)
    old_pred=sigmoid(x %*% coeffs)
    nu_diag=diag(nu[,1])
    ##Weights update
    coeffs=coeffs + solve(t(x) %*% nu_diag %*% x)%*% t(x) %*% (y-nu)
    ##compute mse to check termination
    mse=mean((y-sigmoid(x%*%coeffs))^2)
    ##Stop computation if tolerance is reached
    if (mse<tol)
    {
      converged=T
    }
      
  }
  ##Creates the logit objects 
  my_logit=list(intercept=intercept)
  my_logit[['coeffs']]=coeffs
  my_logit[['preds']]=sigmoid(x%*%coeffs)
  my_logit[['residuals']]=my_logit[['preds']]-y
  my_logit[['mse']]=mean(my_logit[['residuals']]^2)
  my_logit[['iteration']]=iterations
  attr(my_logit, "class") <- "my_logit"
  return(my_logit)

}

##Predict the outcome on new data
predict.my_logit<-function(my_logit,x,probs=T,..)
{
  if (!is.matrix(x))
  {
    x=as.matrix(x)
  }
  if (my_logit[['intercept']])
  {
    x=cbind(x,1)
  }
  if (probs)
  {
    sigmoid(x %*% my_logit[['coeffs']])
  }
  else
  {
    sigmoid(x %*% my_logit[['coeffs']])>0.5
  }
}

The code is split into two part:

  • The fit part, where the logistic model is fitted using Newton’s method.
    This part has three main components. First, the data is put in the proper matrix format and, if required, the intercept is added. Then the algorithm parameters are updated (initially all the weights are set to one).
    Finally, the algorithm updates the weights until the MSE goes below the tolerance. The most important line is probably the weights update one where the update formula is used to update the weights of the model.
  • The predict method, where the outcome is estimated using the weights computed previously.

We can now use our logistic regression to predict the class of a flower from the iris dataset:

fit_logit(iris[,1:4],iris[,5]=='setosa')

As expected, the algorithm can predict efficiently if a flower is a setosa or not.

If you like this post, follow us to learn how to create your Machine Learning library from scratch with R!

LEAVE A REPLY

Please enter your comment!
Please enter your name here