Following the previous R Basics tutorial on linear regression, we will now deal with logistic regression with R!
The goal of logistic regression is to predict whether an outcome will be positive (aka 1) or negative (i.e: 0). Some real life example could be:
- Will Emmanuel Macron win the French Presidential election or will he lose?
- Does Mr.X has this illness or not?
- Will this visitor click on my link or not?
So, logistic regression can be used in a lot of binary classification cases and will often be run before more advanced methods. For this tutorial, we will use the diabetes detection dataset from Kaggle.
This dataset contains data from Pima Indians Women such as the number of pregnancies, the blood pressure, the skin thickness, … the goal of the tutorial is to be able to detect diabetes using only these measures.
The code of the tutorial can be found on this repository.
Exploring the data
As usual, first, let’s take a look at our data. You can download the data here then please put the file diabetes.csv in your working directory. With the summary function, we can easily summarise the different variables:
##Quick data summary
The mean of the outcome is 0.35 which shows an imbalance between the classes. However, the imbalance should not be too strong to be a problem.
To understand the relationship between variables, a Scatter Plot Matrix will be used. To plot it, the GGally package was used.
Scatter plot matrix
The correlations between explanatory variables do not seem too strong. Hence the model is not likely to suffer from multicollinearity. All explanatory variable are correlated with the Outcome. At first sight, glucose rate is the most important factor to detect the outcome.
##Scatter plot matrix
##’smooth’ adds linear fitted line to each scatter plot
ggpairs(DiabetesData,lower = list(continuous=’smooth’))
Logistic regression with R
After variable exploration, a first model can be fitted using the glm function. With stargazer, it is easy to get nice output in ASCII or even Latex.
The overall model is significant. As expected the glucose rate has the lowest p-value of all the variables. However, Age, Insulin and Skin Thickness are not good predictors of Diabetes.
A second model
Since some variables are not significant, removing them is a good way to improve model robustness. In the second model, SkinThickness, Insulin, and Age are removed.
The model AIC (Akaike information criteria) is lower with a value of 740 vs 741.445 in the first model. Hence we will choose the second model for further predictions.
Classification rate and confusion matrix
Now that we have our model, let’s access its performance.
###Correctly classified observations
Around 77.4% of all observations are correctly classified. Due to class imbalance, we need to go further with a confusion matrix.
###Confusion matrix count
RP=sum((glm2$fitted.values>=0.5)==DiabetesData$Outcome & DiabetesData$Outcome==1)
FP=sum((glm2$fitted.values>=0.5)!=DiabetesData$Outcome & DiabetesData$Outcome==0)
RN=sum((glm2$fitted.values>=0.5)==DiabetesData$Outcome & DiabetesData$Outcome==0)
FN=sum((glm2$fitted.values>=0.5)!=DiabetesData$Outcome & DiabetesData$Outcome==1)
confMat<-matrix(c(RP,FP,FN,RN),ncol = 2)
colnames(confMat)<-c("Pred Diabetes",’Pred no diabetes’)
rownames(confMat)<-c("Real Diabetes",’Real no diabetes’)
And here is the confusion matrix
The model is good to detect people who do not have diabetes. However, its performance on ill people is not great (only 154 out of 268 have been correctly classified).
You can also get the percentage of Real/False Positive/Negative:
###Confusion matrix proportion RPR=RP/sum(DiabetesData$Outcome==1)*100 FNR=FN/sum(DiabetesData$Outcome==1)*100 FPR=FP/sum(DiabetesData$Outcome==0)*100 RNR=RN/sum(DiabetesData$Outcome==0)*100 confMat<-matrix(c(RPR,FPR,FNR,RNR),ncol = 2) colnames(confMat)<-c("Pred Diabetes",'Pred no diabetes') rownames(confMat)<-c("Real Diabetes",'Real no diabetes') confMat
And here is the matrix, 57.5% of people with diabetes are correctly classified. A way to improve the false negative rate would lower the detection threshold. However, as a consequence, the false positive rate would increase.
Plots and decision boundaries
The two strongest predictors of the outcome are Glucose rate and BMI. High glucose rate and high BMI are strong indicators of Diabetes.
####Plot and decision boundaries
We can also plot both BMI and glucose against the outcomes, the other variables are taken at their mean level.
ggplot(DiabetesData,aes(x=BMI,y=Glucose,color=Outcome==(Predicted>0.5)))+geom_point(size=2,alpha=0.5) BMI_plot=data.frame(BMI=((min(DiabetesData$BMI-2)*100):(max(DiabetesData$BMI+2)*100))/100, Glucose=mean(DiabetesData$Glucose), Pregnancies=mean(DiabetesData$Pregnancies), BloodPressure=mean(DiabetesData$BloodPressure), DiabetesPedigreeFunction=mean(DiabetesData$DiabetesPedigreeFunction)) BMI_plot$Predicted=predict(glm2,BMI_plot,type = 'response') ggplot(BMI_plot,aes(x=BMI,y=Predicted))+geom_line() Glucose_plot=data.frame(Glucose=((min(DiabetesData$Glucose-2)*100):(max(DiabetesData$Glucose+2)*100))/100, BMI=mean(DiabetesData$BMI), Pregnancies=mean(DiabetesData$Pregnancies), BloodPressure=mean(DiabetesData$BloodPressure), DiabetesPedigreeFunction=mean(DiabetesData$DiabetesPedigreeFunction)) Glucose_plot$Predicted=predict(glm2,Glucose_plot,type = 'response') ggplot(Glucose_plot,aes(x=Glucose,y=Predicted))+geom_line()
Now you know most of what you need to know to do logistic regression with R! Feel free to comment and ask questions!