When dealing with Machine Learning problems in R, most of the time you rely on already existing libraries. This fastens the analysis process, but do you really understand what is behind the algorithms? Could you implement a logistic regression from scratch with R?
The goal of this post is to create our own basic machine learning library from scratch with R. We will only use the linear algebra tools available in R. There will be three posts:
 Linear and logistic regression (this one)
 PCA and knearest neighbors classifiers and regressors
 Treebased methods and SVM
Linear Regression (LeastSquare)
The goal of liner regression is to estimate a continuous variable given a matrix of observations . Before dealing with the code, we need to derive the solution of the linear regression.
Solution derivation of linear regression
Given a matrix of observations and the target . The goal of the linear regression is to minimize the norm between and a linear estimate of : . Hence, linear regression can be rewritten as an optimization problem: . A closedform solution can easily be derived and the optimal is
Linear regression in R
Using the closedform solution, we can easily code the linear regression. Our linear model object will have three methods, an init method where the model is fitted, a predict method to work with new data and a plot method to visualize the residuals’ distribution.
###Linear model fit_lm<function(x,y,intercept=TRUE,lambda=0) { ##Conversion to matrix if required if (!is.matrix(x)) { x=as.matrix(x) } if (!is.matrix(y)) { y=as.matrix(y) } #Add the intercept coefficient if (intercept) { x=cbind(x,1) } my_lm=list(intercept=intercept) ##Compute coefficients estimates my_lm[['coeffs']]=solve(t(x) %*% x) %*% t(x) %*% y ##Compute estimates for the train dataset my_lm[['preds']]=x %*% my_lm[['coeffs']] my_lm[['residuals']]=my_lm[['preds']]y my_lm[['mse']]=mean(my_lm[['residuals']]^2) attr(my_lm, "class") < "my_lm" return(my_lm) }
The fit function is simple, the first few lines transform the data to matrices and add an intercept if required. Then, the ‘my_lm’ object is created and the coefficients are computed. The solve() function is used to invert the matrix and %*% denotes matrix multiplication. A the end, the residuals and the estimates are computed and the class of the object is set as ‘my_lm’.
Now let’s implement the predict and plot methods for the my_lm class:
predict.my_lm<function(my_lm,x,..) { if (!is.matrix(x)) { x=as.matrix(x) } if (my_lm[["intercept"]]) { x=cbind(x,1) } x%*%my_lm[["coeffs"]] } plot.my_lm<function(my_lm,bins=30,..) { library(ggplot2) qplot(my_lm[["residuals"]], geom="histogram",bins=bins) + xlab('Residuals values') + ggtitle('Residual distribution') }
You can test the code on some preinstalled R dataset such as the car one. The code will give you the same coefficients estimates as the lm function. For instance, on the car dataset:
my_lm1=fit_lm(cars[,1],cars[,2]) vanilla_lm=lm(dist~speed,cars) print(vanilla_lm[['coefficients']]) print(my_lm1[['coeffs']])
Logistic regression
Previously, we worked on regression and the estimation of a continuous variable. Now, with logistic regression, we try to estimate a binary outcome (for instance, ill vs healthy, pass vs failed, …). Again, let’s deal with the maths first:
The mathematics of logistic regression
The goal is to estimate a binary outcome given the observations . We assume that follows a Bernoulli distribution of parameter . is called the sigmoid function.
Hence, we have .
We want to maximize the loglikelihood of the observed sample(over and hence over ):
This maximization will be done using Newton’s Method. Newton’s method is a variant of gradient descent in which we try to find the optimal curvature of the function to increase the speed of convergence. If you are not familiar with the Newton method, you can just see it as a variant of batch gradient descent.. The weights updates has the following form:
with: The Hessian
and the gradient
The algorithm in R will update the weights using this update until the termination criterion is met. Here, the termination criterion is met when the mean square error is below the userdefined tolerance.
Logistic regression in R
###Sigmoid function sigmoid=function(x) {1/(1+exp(x))} ###Fit logistic regression fit_logit=function(x,y,intercept=T,tol=10e5,max_it=100) { ##Type conversion if (!is.matrix(x)) { x=as.matrix(x) } if (!is.matrix(y)) { y=as.matrix(y) } ##Add intercept is required if (intercept) { x=cbind(x,1) } ##Algorithm initialization iterations=0 converged=F ##Weights are initialized to 1 coeffs=matrix(1,dim(x)[2]) ##Updates the weight until the max number of iter ##Or the termination criterion is met while (iterations<max_it& !converged) { iterations=iterations+1 nu<sigmoid(x %*% coeffs) old_pred=sigmoid(x %*% coeffs) nu_diag=diag(nu[,1]) ##Weights update coeffs=coeffs + solve(t(x) %*% nu_diag %*% x)%*% t(x) %*% (ynu) ##compute mse to check termination mse=mean((ysigmoid(x%*%coeffs))^2) ##Stop computation if tolerance is reached if (mse<tol) { converged=T } } ##Creates the logit objects my_logit=list(intercept=intercept) my_logit[['coeffs']]=coeffs my_logit[['preds']]=sigmoid(x%*%coeffs) my_logit[['residuals']]=my_logit[['preds']]y my_logit[['mse']]=mean(my_logit[['residuals']]^2) my_logit[['iteration']]=iterations attr(my_logit, "class") < "my_logit" return(my_logit) } ##Predict the outcome on new data predict.my_logit<function(my_logit,x,probs=T,..) { if (!is.matrix(x)) { x=as.matrix(x) } if (my_logit[['intercept']]) { x=cbind(x,1) } if (probs) { sigmoid(x %*% my_logit[['coeffs']]) } else { sigmoid(x %*% my_logit[['coeffs']])>0.5 } }
The code is split into two part:

The fit part, where the logistic model is fitted using Newton’s method.
This part has three main components. First, the data is put in the proper matrix format and, if required, the intercept is added. Then the algorithm parameters are updated (initially all the weights are set to one).
Finally, the algorithm updates the weights until the MSE goes below the tolerance. The most important line is probably the weights update one where the update formula is used to update the weights of the model.  The predict method, where the outcome is estimated using the weights computed previously.
We can now use our logistic regression to predict the class of a flower from the iris dataset:
fit_logit(iris[,1:4],iris[,5]=='setosa')
As expected, the algorithm can predict efficiently if a flower is a setosa or not.
If you like this post, follow us to learn how to create your Machine Learning library from scratch with R!
Thank you very much for this article! With it you really find out what is under the bonnet of the packages. Can I suggest that in the linear regression function line 24 the MSE should not be the mean of the squares of the residuals but the sum of the squares of the residuals devided by (n2) where n is the number of observations?
Jan Trommelmans
Yes, to have an unbiased estimator of the MSE, the line 24 should compute
.
[…] analysis (PCA) using only the linear algebra available in R. Previously, we managed to implement linear regression and logistic regression from scratch and next time we will deal with K nearest neighbors […]
[…] neighbors (KNN) using only the linear algebra available in R. Previously, we managed to implement PCA and next time we will deal with SVM and decision […]
I am an absolute beginner with R and Data Science, so please excuse me if this is a very basic question. But I wanted to know what do you mean by
`if(intercept)` on Line 24 in the fit_lm function, and why are you doing a x=cbind(x,1) in that if statement
Hello Rohit,
The line adds a constant to each of the observation. For instance, if you have only one variable: