Linear regression is the most basic statistical and machine learning method. Hence linear regression should be your first tool when it comes to estimate a quantitative and continuous variable. In this tutorial, we will see a real case of linear regression in Python.

## Our goal: Predicting used car price.

For this tutorial, you will need Pandas, Numpy, Scikit-learn and the matplotlib module. We will use a data set from Kaggle with data from used cars that were on selling on the German Ebay. You can download the data set here.

The data set contains 20 variables, but we will only use a few of these ones:

- seller, the nature of the seller
- price, the auction price
- yearOfRegistration, the year on which the car was registrated first
- gearbox, type of gearbox
- powerPS
- kilometer, number of kilometers
- fuelType, type of fuel (and if the vehicle is electric or not)
- notRepairedDamage, whether or not the vehicle has damages and has been repaired

## Data preprocessing

Before doing any regression, we need to clean the data set. Some cars have very high price and would highly influence the regression results. So, we will only keep the cars with a price a less than 50,000.

The following script reads the data, select cars which cost less than 50,000.

## Import the package import numpy as np import pandas as pd ## Reading the file, latin encoding is used to avoid some errors data=pd.read_csv('autos.csv',encoding=('latin'),quoting=3) ##Selecting variable data=data[[ 'seller', 'price', 'yearOfRegistration', 'gearbox', 'powerPS', 'kilometer', 'fuelType', 'notRepairedDamage']] ##Selecting cars which cost less than 50,000 data=data.ix[data['price']&amp;amp;amp;amp;lt;=50000] ##Creating the list of input and output variables. And inputVariables=list(data) del inputVariables[1] outputVariables=list(data)[1] inputData=data[inputVariables]

## Dummy coding of categorical variables.

The regression can only use numerical variable as its inputs data. Due to this, the categorical variables need to be encoded as dummy variables.

Dummy coding encodes the categorical variables as 0 and 1 respectively if the observation does not or does belong to the group.

Basically, the code below select all the variables that are strings, dummy code them thanks to *get_dummies* and then join it to the data frame.

for column in inputData.columns: if inputData[column].dtype==object: dummyCols=pd.get_dummies(inputData[column]) inputData=inputData.join(dummyCols) del inputData[column]

## Running the linear regression

Now that data can be used by the scikit-learn module. We will just use the LinearRegression function from the module.

from sklearn.linear_model import LinearRegression model_1=LinearRegression() model_1.fit(inputData,data[outputVariables])

## Exploring the results

Let’s explore the result of the previous regression. First let’s print the table of coefficient

coefficients=pd.DataFrame({'name':list(inputData),'value':model_1.coef_})

And now let’s have a look at the R² and the MSE of the model:

print(“Mean squared error:”,

np.mean((model_1.predict(inputData) – data[outputVariables]) ** 2))

print(‘R²:’,model_1.score(inputData, data[outputVariables]))

Here we are, now you are able to do a Linear Regression in Python.

Hi,

I tried to run the code where the categorical variables encoded as dummy variables with my data that has 3 columns with 56,000 rows (consists of 2,500 different value). However, it produces MemoryError in jupyter python. Is there any way to avoid this?

Hi,

Your one-hot dataframe has something like 140,000,000 values which is a lot to handle if you don’t have enough memory.

You may try to clean your workspace before this line or you can rewrite the code for scipy sparse matrix (Your one-hot encoded matrix will mostly have 0s)