Linear regression is the most basic statistical and machine learning method. Hence linear regression should be your first tool when it comes to estimate a quantitative and continuous variable. In this tutorial, we will see a real case of linear regression in Python.
Our goal: Predicting used car price.
For this tutorial, you will need Pandas, Numpy, Scikit-learn and the matplotlib module. We will use a data set from Kaggle with data from used cars that were on selling on the German Ebay. You can download the data set here.
The data set contains 20 variables, but we will only use a few of these ones:
- seller, the nature of the seller
- price, the auction price
- yearOfRegistration, the year on which the car was registrated first
- gearbox, type of gearbox
- kilometer, number of kilometers
- fuelType, type of fuel (and if the vehicle is electric or not)
- notRepairedDamage, whether or not the vehicle has damages and has been repaired
Before doing any regression, we need to clean the data set. Some cars have very high price and would highly influence the regression results. So, we will only keep the cars with a price a less than 50,000.
The following script reads the data, select cars which cost less than 50,000.
## Import the package import numpy as np import pandas as pd ## Reading the file, latin encoding is used to avoid some errors data=pd.read_csv('autos.csv',encoding=('latin'),quoting=3) ##Selecting variable data=data[[ 'seller', 'price', 'yearOfRegistration', 'gearbox', 'powerPS', 'kilometer', 'fuelType', 'notRepairedDamage']] ##Selecting cars which cost less than 50,000 data=data.ix[data['price']&amp;amp;amp;amp;lt;=50000] ##Creating the list of input and output variables. And inputVariables=list(data) del inputVariables outputVariables=list(data) inputData=data[inputVariables]
Dummy coding of categorical variables.
The regression can only use numerical variable as its inputs data. Due to this, the categorical variables need to be encoded as dummy variables.
Dummy coding encodes the categorical variables as 0 and 1 respectively if the observation does not or does belong to the group.
Basically, the code below select all the variables that are strings, dummy code them thanks to get_dummies and then join it to the data frame.
for column in inputData.columns: if inputData[column].dtype==object: dummyCols=pd.get_dummies(inputData[column]) inputData=inputData.join(dummyCols) del inputData[column]
Running the linear regression
Now that data can be used by the scikit-learn module. We will just use the LinearRegression function from the module.
from sklearn.linear_model import LinearRegression model_1=LinearRegression() model_1.fit(inputData,data[outputVariables])
Exploring the results
Let’s explore the result of the previous regression. First let’s print the table of coefficient
And now let’s have a look at the R² and the MSE of the model:
print(“Mean squared error:”,
np.mean((model_1.predict(inputData) – data[outputVariables]) ** 2))
Here we are, now you are able to do a Linear Regression in Python.