Logistic regression is one of the basics of data analysis and statistics. The goal of the regression is to predict an outcome, will I sell my car or not? Is this bank transfer fraudulent? Is this patient ill or not?
All these outcomes can be encoded as 0 and 1, a fraudulent bank transfer could be encoded as 1 while a regular one would be encoded as 0.
As with linear regression, the inputs variable can be either categorical or continuous.
In this tutorial, we will create a Logistic regression model to predict whether or not someone has diabetes or not.
The dataset that will be used is from Kaggle: Pima Indians Diabetes Database.
It has 9 variables: ‘Pregnancies’, ‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’,’Age’, ‘Outcome’.
Here is the variable description from Kaggle:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
All these variables are continuous, the goal of the tutorial is to predict if someone has diabetes (Outcome=1) according to the other variables. It worth noticing that all the observations are from women older than 21 years old.
A quick look at the data
First, please download the data. Then, with pandas, we will read the CSV:
import pandas as pd import numpy as np Diabetes=pd.read_csv('diabetes.csv') table1=np.mean(Diabetes,axis=0) table2=np.std(Diabetes,axis=0)
To understand the data, let’s take a look at the different variables means and standard deviations
The data are unbalanced with 35% of observations having diabetes. The standard deviation of the different variables is also very different, to compare the coefficient of the different variables the coefficient will need to be standardized.
The logistic regression
Now that we know the data, let’s do our logistic regression.
First, the input and output variables are selected:
Then, we create and fit a logistic regression model with scikit-learn LogisticRegression.
from sklearn.linear_model import LogisticRegression logit1=LogisticRegression() logit1.fit(inputData,outputData)
The score function of sklearn can quickly assess the model performance.
Even if the logistic regression is a simple model around 78% of the observation are correctly classified!
Going deeper into model evaluation
Due to class imbalance, we need to check the model performance on each class. Not being able to classify people with diabetes would be a major problem since this is the goal of the model.
First, we will build a confusion matrix ‘by hand’.
##True positive trueInput=Diabetes.ix[Diabetes['Outcome']==1].iloc[:,:8] trueOutput=Diabetes.ix[Diabetes['Outcome']==1].iloc[:,8] ##True positive rate np.mean(logit1.predict(trueInput)==trueOutput) ##Return around 55% ##True negative falseInput=Diabetes.ix[Diabetes['Outcome']==0].iloc[:,:8] falseOutput=Diabetes.ix[Diabetes['Outcome']==0].iloc[:,8] ##True negative rate np.mean(logit1.predict(falseInput)==falseOutput) ##Return around 90%
Around 55% percent of people with diabetes would have been correctly classified. This rate could be improved by using more complex model and also by taking into account class imbalance.
The scikit-learn library also has a confusion matrix function:
###Confusion matrix with sklearn from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score confusion_matrix(logit1.predict(inputData),outputData)
AUC and ROC curve
An other metric used for classification is the AUC (Area under curve), you can find more details on it on Wikipedia. In few words, the ROC curve compares the model true positive and false positive rates to the ones from a random assignation. If the model roc is above the baseline, then the model is better than random assignation.
##Computing false and true positive rates fpr, tpr,_=roc_curve(logit1.predict(inputData),outputData,drop_intermediate=False) import matplotlib.pyplot as plt plt.figure() ##Adding the ROC plt.plot(fpr, tpr, color='red', lw=2, label='ROC curve') ##Random FPR and TPR plt.plot([0, 1], [0, 1], color='blue', lw=2, linestyle='--') ##Title and label plt.xlabel('FPR') plt.ylabel('TPR') plt.title('ROC curve') plt.show()
The code should get you the following plot:
The AUC can be computed with:
In addition to predicting the outcome, the model can be used to see which variables are influencing the probabilities of having diabetes.
Most of the variables are increasing the probabilities of having diabetes. However, it’s hard to detect which one is the “stronger” because the standard deviation of the different coefficients is so different. To do such a comparison, we need to standardize the coefficients. The idea is to correct the coefficient by the variance of the variable.
The standardized coefficients are different from the regular ones. Glucose, BMI and the number of pregnancies are the strongest positive predictors of diabetes. On the other hand blood pressure is the strongest negative predictor of diabetes.
Now, let’s plot some scatter plot to see where the decision boundaries are.
plt.figure() plt.scatter(inputData.iloc[:,1],inputData.iloc[:,5],c=logit1.predict_proba(inputData)[:,1],alpha=0.4) plt.xlabel('Glucose level ') plt.ylabel('BMI ') plt.show() plt.figure() plt.scatter(inputData.iloc[:,1],inputData.iloc[:,5],c=outputData,alpha=0.4) plt.xlabel('Glucose level ') plt.ylabel('BMI ') plt.show()
The first plot shows the probabilities of someone being diabetic, while the other one shows the true outcome. The model did correctly learn the outcome’s gradient when glucose and BMI are getting higher.
The complete code is available on Github here.
Thanks for reading!