Python Basics: Logistic regression with Python

ROC curve
The ROC curve of the model

Logistic regression is one of the basics of data analysis and statistics. The goal of the regression is to predict an outcome, will I sell my car or not? Is this bank transfer fraudulent? Is this patient ill or not?

All these outcomes can be encoded as 0 and 1, a fraudulent bank transfer could be encoded as 1 while a regular one would be encoded as 0.
As with linear regression, the inputs variable can be either categorical or continuous.

In this tutorial, we will create a Logistic regression model to predict whether or not someone has diabetes or not.

The dataset

The dataset that will be used is from Kaggle: Pima Indians Diabetes Database.
It has 9 variables: ‘Pregnancies’, ‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’,’Age’, ‘Outcome’.

Here is the variable description from Kaggle:

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Outcome: Class variable (0 or 1)

All these variables are continuous, the goal of the tutorial is to predict if someone has diabetes (Outcome=1) according to the other variables. It worth noticing that all the observations are from women older than 21 years old.

A quick look at the data

First, please download the data. Then, with pandas, we will read the CSV:

import pandas as pd
import numpy as np

To understand the data, let’s take a look at the different variables means and standard deviations

The data are unbalanced with 35% of observations having diabetes. The standard deviation of the different variables is also very different, to compare the coefficient of the different variables the coefficient will need to be standardized.

The logistic regression

Now that we know the data, let’s do our logistic regression.
First, the input and output variables are selected:


Then, we create and fit a logistic regression model with scikit-learn LogisticRegression.

from sklearn.linear_model import LogisticRegression

The score function of sklearn can quickly assess the model performance.


Even if the logistic regression is a simple model around 78% of the observation are correctly classified!

Going deeper into model evaluation

Due to class imbalance, we need to check the model performance on each class. Not being able to classify people with diabetes would be a major problem since this is the goal of the model.

First, we will build a confusion matrix ‘by hand’.

##True positive
##True positive rate
##Return around 55%

##True negative
##True negative rate
##Return around 90%

Around 55% percent of people with diabetes would have been correctly classified. This rate could be improved by using more complex model and also by taking into account class imbalance.

The scikit-learn library also has a confusion matrix function:

###Confusion matrix with sklearn
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score

AUC and ROC curve

An other metric used for classification is the AUC (Area under curve), you can find more details on it on Wikipedia. In few words, the ROC curve compares the model true positive and false positive rates to the ones from a random assignation. If the model roc is above the baseline, then the model is better than random assignation.

##Computing false and true positive rates
fpr, tpr,_=roc_curve(logit1.predict(inputData),outputData,drop_intermediate=False)

import matplotlib.pyplot as plt
##Adding the ROC
plt.plot(fpr, tpr, color='red',
 lw=2, label='ROC curve')
##Random FPR and TPR
plt.plot([0, 1], [0, 1], color='blue', lw=2, linestyle='--')
##Title and label
plt.title('ROC curve')

The code should get you the following plot:

ROC curve
The ROC curve of the model

The AUC can be computed with:


Model coefficients

In addition to predicting the outcome, the model can be used to see which variables are influencing the probabilities of having diabetes.

Coefficient of the logistic regression
Coefficient of the logistic regression

Most of the variables are increasing the probabilities of having diabetes. However, it’s hard to detect which one is the “stronger” because the standard deviation of the different coefficients is so different. To do such a comparison, we need to standardize the coefficients. The idea is to correct the coefficient by the variance of the variable.

Standardised coefficients
Standardised coefficients

The standardized coefficients are different from the regular ones. Glucose, BMI and the number of pregnancies are the strongest positive predictors of diabetes. On the other hand blood pressure is the strongest negative predictor of diabetes.

Visualising boundaries

Now, let’s plot some scatter plot to see where the decision boundaries are.

plt.xlabel('Glucose level ')
plt.ylabel('BMI ')

plt.xlabel('Glucose level ')
plt.ylabel('BMI ')

This slideshow requires JavaScript.

The first plot shows the probabilities of someone being diabetic, while the other one shows the true outcome. The model did correctly learn the outcome’s gradient when glucose and BMI are getting higher.

The complete code is available on Github here.

Thanks for reading!




Please enter your comment!
Please enter your name here