Machine Learning Explained: Overfitting

2

Welcome to this new post of Machine Learning Explained.After dealing with bagging, today, we will deal with overfitting. Overfitting is the devil of Machine Learning and Data Science and has to be avoided in all of your models.

What is overfitting?

A good model is able to learn the pattern from your training data and then to generalize it on new data (from a similar distribution). Overfitting is when a model is able to fit almost perfectly your training data but is performing poorly on new data. A model will overfit when it is learning the very specific pattern and noise from the training data, this model is not able to extract the “big picture” nor the general pattern from your data. Hence, on new and different data the performance of the overfitted model will be poor.

Overfitting and polynomial regression

Well, let’s see this through an example! We want a model to estimate the relationship between the wind speed and the quantity of ozone in the air.

The relationship does not seem linear hence, using polynomial regression may give some good results. Five polynomial regressions were fitted to the data, respectively with 1,3, 5,10 and 20 degrees. The models were trained on 70% of the data.

As we could expect, the more degree you add to the polynomial, the better the fit to the training data is. However, we can see that high order polynomial (5, 10, 20) tends to learn the patterns from some outliers and are not robust. To confirm this, let’s compute the error (Mean standard error) on the training and testing set.

The red line shows the evolution of the error on the testing set and the black line on the training set. As soon as more than 9 or 10 degrees are used the MSE seem to start growing and explodes when there are even more degrees. For the sake of visibility let’s plot this for models with 1 to 8 degrees.

As we can see, though the train error keep decreasing, the test error is not affected much by the complexity of the model. Here, the simpler models are the best choices.

And why is overfitting happening?

Overfitting happens when your model has too much freedom to fit the data. Then, it is easy for the model to fit the training data perfectly (and to minimize the loss function). Hence, more complex models are more likely to overfit:

  • For instance, a linear regression with a reasonable number of the variable will never overfit the data. The model is simple and restricted to linear relationships between the variables.
  • On the other hand, random forest or neural net can easily overfit. They have a lot of parameters which they can minimize the loss function on.

My advice would be, the more complex the model, the more careful you need to be.

How to detect and avoid overfitting?

To detect overfitting you need to see how the test error evolve. As long as the test error is decreasing, the model is still right. On the other hand, an increase in the test error indicates that you are probably overfitting.

As said before, overfitting is caused by a model having too much freedom. Hence most of the solutions to avoid overfitting add mor constraints to the model:

  • Lasso and ridge regularisation add a penalty for the parameters being to big or too numerous
  • Cross-validation assess the model performance on an independant data set
  • Early stopping stops the model when test error starts growing
  • And also: dropout, adding noise to input, …

R Code to replicate the plot

###Overfitting

require(data.table)
library(rpart)
require(ggplot2)

set.seed(456)

##Reading data
overfitting_data=data.table(airquality)
ggplot(overfitting_data,aes(Wind,Ozone))+geom_point()+ggtitle("Ozone vs wind speed")
data_test=na.omit(overfitting_data[,.(Wind,Ozone)])
train_sample=sample(1:nrow(data_test),size = 0.7*nrow(data_test))

###creation of polynomial models
degree_of_poly=1:20
degree_to_plot=c(1,3,5,10,20)
polynomial_model=list()
df_result=NULL
for (degree in degree_of_poly)
{
 fm=as.formula(paste0("Ozone~poly(Wind,",degree,",raw=T)"))
 polynomial_model=c(polynomial_model,list(lm(fm,data_test[train_sample])))
 Polynomial_degree=paste0(degree)
 data_fitted=tail(polynomial_model,1)[[1]]$fitted.values
 new_df=data.table(Wind=data_test[train_sample,Wind],Ozone_real=data_test[train_sample,Ozone],Ozone_fitted=tail(polynomial_model,1)[[1]]$fitted.values,degree=as.factor(degree))
 if (is.null(df_result))
 df_result=new_df
 else
 df_result=rbind(df_result,new_df)
}
gg=ggplot(df_result[degree%in%degree_to_plot],aes(x=Wind))+geom_point(aes(y=Ozone_real))+geom_line(aes(color=degree,y=Ozone_fitted))
gg+ggtitle('Ozone vs wind for several polynomial regressions')+ylab('Ozone')

###Computing SE
SE_train_list=c()
SE_test_list=c()

for (poly_mod in polynomial_model)
{
 print(summary(poly_mod))
 SE_train_list=c(SE_train_list,sqrt(mean(poly_mod$residuals^2)))
 SE_test=sqrt(mean((data_test[-train_sample]-predict(poly_mod,data_test[-train_sample,]))^2))
 SE_test_list=c(SE_test_list,SE_test)
}

data_plot=data.table(SE_test_list,SE_train_list,degree_of_poly)
ggplot(data_plot[degree_of_poly<=8])+geom_line(aes(x=degree_of_poly,y=SE_test_list),color='red')+geom_line(aes(x=degree_of_poly,y=SE_train_list))+ylab('MSE')+xlab('Degrees of polynomial')

Thanks for reading! You can stay in touch by following us on Twitter :

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here