Understand and Implement Linear Regression Using Scikit-Learn

Introduction

Regression and Classification are the main two types of supervised machine learning algorithms. The main difference between classification and regression is that, regression predicts the continuous output and classification predicts the discrete output values. You can think of predicting if a person has diabetics or not as a classification problem and predicting the oil price as a regression problem. Regression is useful when you want to estimate a continuous output value  using a set of predictors (inputs) . Further more, predicting the water consumption of a household as another example of regression problem, given the weather conditions, time of day, and number of residents in that household. In this article, we will learn the theory behind linear regression and how we can implement using the Scikit-Learn library of python.

Understanding Linear Regression

First of all, we need to understand the meaning of linearity and regression. The term “linear” relationship means that the relation can be graphically represented as a straight line. Now what does regression mean? regression  is statistical technique for estimating the relationships among variables. Let us make it clear by considering an example.

Assume that there is a clothing company and they spend so much money on advertising their brand. Their sales and advertising data is given below.

Advertising
Sales
14.1
220
16.3
330
18
420
15.2
325
17.4
410

Our aim is to determine the linear relationship between the amount spent for advertising  and the sales.  With this information we want to find out that given the amount spend on advertising, what amount we can expect in the sales?  

 we plot this data  in a two dimensional space, taking the  advertisement (independent variable) on the x axis and sales(dependent variable), a simple linear regression gives us a best fitting line which generalizes/represents the entire data as shown below.

Linear regression Graph

The general equation of the straight line is  y=mx+b

Where b is the intercept of the line and m is the slope. Hence we can draw several lines across the data by changing values of b and m, but not all lines cannot represent the entire data well.  Note that here,the y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept and slope.  Out of all these possible lines linear regression chooses the optimal values for intercept (b) and slope (m) and there by we get a perfect fitting line to represent our data. 

As shown in the graph,generally it is obvious that we cannot draw a straight line connecting all the data points with 100% error free .  All we can do is to identify  a line that pass through  the data points and try to bring the line as close as possible to the data. So definitely there will be errors . An error in simple term is the aggregate of difference between the line and the data points. We will discuss on the error part in the coming sections. The optimal line is the one that generalizes our data pretty well , in another words the line with least error. 

As mentioned earlier, linear regression finds out the optimal value for slope (m) and intercept (b).  Let us assume that in our scenario the linear regression chooses the slope m=10  and intercept (b) = 5.

So our equation of selected straight line   Sales =m* Advertisement  + b  becomes    Sales = 10*Advertisement + 5

Now our simple linear regression model is ready. With this equation if you know the amount planned for advertisement ,we can predict the future sales. This is the core concept of simple linear regression.

What about dealing with multiple variables? The same concept can be expanded in case of  more than two variables.  Consider a portion of our modified advertisement  data with multiple variables below.

TV Ads
Radio Ads
Newspaper Ads
Sales
230.1
37.8
69.2
22.1
44.5
39.3
45.1
10.4
17.2
45.9
69.3
9.3
151.5
41.3
58.5
18.5
180.8
10.8
58.4
12.9

Let us assume that our total advertising amount is  the sum of different types of advertisement like Radio Ads, TV Ads and Newspaper Ads. Now We have three independent variables and one dependent variables (Sales) and we have to predict the Sales based on these three variables. 

The general equation of a linear regression model having multiple variables is given by:   

                                                    y = m0 + m1x1 + m2x2 + m3x3 + … … mnxn

This equation is actually representing  a hyperplane. Note that a linear  regression in two dimension is a straight line; in three dimensions it is a plane, and in more than three dimensions, a hyper plane. Previously we had only one variable and hence we were required to determine only slope and the intercept. But here, we have multiple variables and each has a coefficient. The values.  m1,m2,m3,.. mn are called the regression coefficients m0 is the regression intercept.

Let us consider our scenario of advertisement data set, where we have 3 dependent variables. The regression equation can be written as

                                                  Sales =  m0+ m1 * TV Ads  + m2* Radio Ads + m3 * Newspaper Ads

The goal of regression is to determine the values of the coefficients m₀, m₁, m and m3, such that this hyper plane is as close as possible to the actual data and yield the minimal error. 

Implementation of Linear Regression using Python Scikit-Learn

Almost all problems that we encounter in data science  usually have multiple variables. So, in this section, we will see how we can implement the linear regression with the help of scikit-learn library of python. Before we jump into the coding part, we should have a basic understanding about the data.

Dataset :
In this linear regression task,  we will be predicting the Sales expected based on the amount spend advertisement. The advertisement can be Radio advertisement, TV and  Newspaper advertisement. We have 3 independent variables  to determine the Sales (dependent variable). You can download the dataset from here.  Let’s start!
 
Importing Libraries :
We use jupyter notebook as the editor, you can use any convenient editor which supports python 3. As a first step, we will import the necessary libraries such as numpy,pandas,matplotlib and seaborn.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Please note that %matplotlib inline  is specific to jupyter notebook which will make your plot outputs appear and be stored within the notebook.
 
 Loading the Dataset:
 The following command  imports the dataset  advertising.csv from the D drive of my local machine using pandas.
df=pd.read_csv('D:\\advertising.csv') 

Now we have the data inside the datafram ‘df’. Let us explore the data that we have.

Exploring the Dataset :

The dimension of the data can be viewed using shape command

df.shape

This outputs the number of rows and columns in a tuple as given below.

(200, 4)

So the data frame has 200 rows and 4 columns. You can access the rows and columns individually using df.shape[0] and df.shape[1].

Now we can have a look at the data using the head() method.

df.head()

The above method retrieves the top 5 records present in the dataframe.

TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9

The dataset has 3 independent variables TV, Radio and Newspaper Ads and Sales as dependent variable. In the same way if you want to print the last 5 records of the dataframe, we can use the tail method.

To see the statistical details of the dataset, we can use describe method.

df.describe()

All basic statistical details are obtained as output.

TV Radio Newspaper Sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 14.022500
std 85.854236 14.846809 21.778621 5.217457
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 10.375000
50% 149.750000 22.900000 25.750000 12.900000
75% 218.825000 36.525000 45.100000 17.400000
max 296.400000 49.600000 114.000000 27.000000

To visualize the entire  dataframe effectively we use the seaborn pair  plot. We have already imported the seaborn library now we can call the pairplot function , passing our dataframe as argument. 

 sns.pairplot(df)

The output is displayed as below.

Seaborn Pair Plot for Advertisement and Sales Data

you can see that that out all variables, especially the TV advertisement has a good direct relationship with sales.

Preparing  the Data :
 In the next step, we split the data into  attributes and labels. Attributes are  the independent variables (X) while the label is the dependent variable(y) whose value to be predicted. In our case the label is Sales data and attributes are all columns except the Sales.
X = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']

Next, we split the complete data into two parts.  80% of the data goes to the training set while 20% of the data goes to test set using below code. This is called 80The test_size variable is where we actually specify the proportion of the test set.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now we have the X_train,y_train for training the data, X_test and y_test for prediction and evaluation is prepared. We can print the dimension of train and test data using the code:

  print(f"Dimension of X_train :{X_train.shape} ")
  print(f"Dimension of y_train :{y_train.shape} ")
print(f"Dimension of X_test :{X_test.shape} ")
  print(f"Dimension of y_test :{y_test.shape} ")
Dimension of X_train :(160, 3) 
Dimension of y_train :(160,) 
Dimension of X_test :(40, 3) 
Dimension of y_test :(40,) 
Building and Training the Model:
 To build the model and train it, we need to import LinearRegression class from Scikit-Learn. create an instance of the class LinearRegression which will represent the regression model. Now call all the fit() method along with our training data. Remember for training, we have to pass both X_train and y_train.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train) #training the algorithm
Checking the coefficients selected by the Model:

We know the linear regression will find the optimal coefficients that generalizes the data quite well. We can see the coefficients our model has selected.

 

  print(f"The Regressor Intercept is :{regressor.intercept_} ")
  The Regressor Intercept is :2.9948930304953247 

So the intercept value (independent term in the linear model) is 2.9948

  regress_coeff = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
  print(regress_coeff)
            Coefficient
TV            0.044584
Radio         0.196497
Newspaper    -0.002781

When we interpret this coefficients, we can say From the above result we may infer that if TV price increases by 1 unit it will affect sales by 0.045 units. On the same way 1 unit increase in Radio Ads will increase the sales by 0.19 units.

Predicting on Test Data:
 Now our model is ready but we need to see how good the model is predicting the output. For that, first we can generate some predictions by the model and then we can compare it with the actual values in the next step. We execute the below code to make the predictions on test data.
y_pred = regressor.predict(X_test)

Now y_predict contains the predicted sales values and we already have the y_test which is the actual sales. We can compare both values and see how close we are in the prediction.

df_comp = pd.DataFrame({'Actual Sales': y_test, 'Predicted Sales': y_pred}) 
print(df_comp.head())

We can see the output below.

  Actual Sales Predicted Sales
18 11.3 10.057396
170 8.4 7.452281
107 8.7 7.019708
98 25.4 24.080297
177 11.7 12.017863

As we can observe here that our model has returned  good prediction results. but we need a single value that describes the goodness of the model. Hence, we evaluate the performance of our algorithm using some metrics.

Evaluating the Model :
Before we evaluate based on the error metrics, we can understand the meaning of error.  In linear regression,  it refers to the sum of the deviations within the regression line. See the graph below for linear regression to understand this better.
Error in Linear Regression
  1.   Mean Absolute Error (MAE) : It is the mean of the absolute value of the errors.  Here, we do a summation of the absolute value distance from the points to the line to get the Mean Absolute Error(MAE).
                                                                                           
      2. Mean Squared Error (MSE):  It is the mean of the squared error. Mean squared error is obtained by the summation of the square of distances from the points to the line. 
                                                                                Mean Squared Error Equation
      3 . Root Mean Squared Error (RMSE):  It is the square root of the mean of the squared errors. 
                                                                                                  Root Mean Square Error
 

Let us find out the value of each metrics. Scikit-Learn has made this process easy for us including the pre-built methods for each of these.

  from sklearn import metrics
  print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
  print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
  print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)) )

The output is as given below:

Mean Absolute Error: 1.3617813502090272
Mean Squared Error: 4.402118291449681
Root Mean Squared Error: 2.0981225634956795

 Root Mean Squared Error value is obtained as 2.09 and when we take the mean of our  data we get around 14. So RMSE is which is slightly greater than 10% of the  mean value of all sales  so the model is performing quite decently.

You could also check the R-squared value to evaluate the model. 

  • R-Squared Value : It explains the .degree to which your input variables explain the variation of your output (Sales) variable. So, if R-square is 0.7, it means 70% of the variation in the output variable is explained by the input variables. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model
 from sklearn.metrics import r2_score
 r_squared = r2_score(y_test, y_pred) 
 print('R_square_value :',r_squared)
r_square_value : 0.9058622107532246 

   Finally, we get the r-squared value as 0.90. Which indicates our model is good.

Conclusion

 In this regression task we have successfully implemented linear regression with Scikit-Learn using the Advertisement dataset.  We got a decent accuracy.  This can be further improved by adding more data, using feature selection to identify only the important features,feature scaling,changing test/train size..etc. Sometimes the reduction in accuracy can be due to the poor data features, means the data may not be having good correlation to the values we are trying to predict.  So  you can play around with the data and the code in this article and try to get better results. 

Leave a Reply

Your email address will not be published. Required fields are marked *