Python basics: Linear regression

Linear regression

Linear regression is the most basic statistical and machine learning method. Hence linear regression should be your first tool when it comes to estimate a quantitative and continuous variable. In this tutorial, we will see a real case of linear regression in Python.

Our goal: Predicting used car price.

For this tutorial, you will need Pandas, Numpy, Scikit-learn and the matplotlib module. We will use a data set from Kaggle with data from used cars that were on selling on the German Ebay. You can download the data set here.

The data set contains 20 variables, but we will only use a few of these ones:

  • seller, the nature of the seller
  • price, the auction price
  • yearOfRegistration, the year on which the car was registrated first
  • gearbox, type of gearbox
  • powerPS
  • kilometer, number of kilometers
  • fuelType, type of fuel (and if the vehicle is electric or not)
  • notRepairedDamage, whether or not the vehicle has damages and has been repaired

Data preprocessing

Before doing any regression, we need to clean the data set. Some cars have very high price and would highly influence the regression results. So, we will only keep the cars with a price a less than 50,000.
The following script reads the data, select cars which cost less than 50,000.

## Import the package
import numpy as np
import pandas as pd

## Reading the file, latin encoding is used to avoid some errors
##Selecting variable
##Selecting cars which cost less than 50,000

##Creating the list of input and output variables. And 
del inputVariables[1]


Dummy coding of categorical variables.

The regression can only use numerical variable as its inputs data. Due to this, the categorical variables need to be encoded as dummy variables.
Dummy coding encodes the categorical variables as 0 and 1 respectively if the observation does not or does belong to the group.

Basically, the code below select all the variables that are strings, dummy code them thanks to get_dummies and then join it to the data frame.

for column in inputData.columns:
 if inputData[column].dtype==object:
  del inputData[column]

Running the linear regression

Now that data can be used by the scikit-learn module. We will just use the LinearRegression function from the module.

from sklearn.linear_model import LinearRegression


Exploring the results

Let’s explore the result of the previous regression. First let’s print the table of coefficient



Coefficients table

And now let’s have a look at the R² and the MSE of the model:

print(“Mean squared error:”,
np.mean((model_1.predict(inputData) – data[outputVariables]) ** 2))
print(‘R²:’,model_1.score(inputData, data[outputVariables]))

Here we are, now you are able to do a Linear Regression in Python.


  1. Hi,
    I tried to run the code where the categorical variables encoded as dummy variables with my data that has 3 columns with 56,000 rows (consists of 2,500 different value). However, it produces MemoryError in jupyter python. Is there any way to avoid this?

    • Hi,
      Your one-hot dataframe has something like 140,000,000 values which is a lot to handle if you don’t have enough memory.
      You may try to clean your workspace before this line or you can rewrite the code for scipy sparse matrix (Your one-hot encoded matrix will mostly have 0s)


Please enter your comment!
Please enter your name here