## Introduction

Regression and Classification are the main two types of supervised machine learning algorithms. The main difference between classification and regression is that, regression predicts the continuous output and classification predicts the discrete output values. You can think of predicting if a person has diabetics or not as a classification problem and predicting the oil price as a regression problem. Regression is useful when you want to estimate a continuous output value using a set of predictors (inputs) . Further more, predicting the water consumption of a household as another example of regression problem, given the weather conditions, time of day, and number of residents in that household. In this article, we will learn the theory behind linear regression and how we can implement using the Scikit-Learn library of python.

## Understanding Linear Regression

First of all, we need to understand the meaning of linearity and regression. The term “linear” relationship means that the relation can be graphically represented as a straight line. Now what does regression mean? regression is statistical technique for estimating the relationships among variables. Let us make it clear by considering an example.

Assume that there is a clothing company and they spend so much money on advertising their brand. Their sales and advertising data is given below.

Advertising | Sales |
---|---|

14.1 | 220 |

16.3 | 330 |

18 | 420 |

15.2 | 325 |

17.4 | 410 |

Our aim is to determine the linear relationship between the amount spent for advertising and the sales. With this information we want to find out that given the amount spend on advertising, what amount we can expect in the sales?

we plot this data in a two dimensional space, taking the advertisement (independent variable) on the x axis and sales(dependent variable), a simple linear regression gives us a best fitting line which generalizes/represents the entire data as shown below.

The general equation of the straight line is ** y=mx+b**

Where** b** is the intercept of the line and **m** is the slope. Hence we can draw several lines across the data by changing values of **b** and **m**, but not all lines cannot represent the entire data well. Note that here,the y and x variables remain the same, since they are the data features and cannot be changed. The values that we can control are the intercept and slope. Out of all these possible lines linear regression chooses the optimal values for intercept (b) and slope (m) and there by we get a perfect fitting line to represent our data.

As shown in the graph,generally it is obvious that we cannot draw a straight line connecting all the data points with 100% error free . All we can do is to identify a line that pass through the data points and try to bring the line as close as possible to the data. So definitely there will be errors . An error in simple term is the aggregate of difference between the line and the data points. We will discuss on the error part in the coming sections. The optimal line is the one that generalizes our data pretty well , in another words the line with least error.

As mentioned earlier, linear regression finds out the optimal value for slope (m) and intercept (b). Let us assume that in our scenario the linear regression chooses the slope m=10 and intercept (b) = 5.

So our equation of selected straight line **Sales =m* Advertisement + b** becomes **Sales = 10*Advertisement + 5**

Now our simple linear regression model is ready. With this equation if you know the amount planned for advertisement ,we can predict the future sales. This is the core concept of simple linear regression.

What about dealing with multiple variables? The same concept can be expanded in case of more than two variables. Consider a portion of our modified advertisement data with multiple variables below.

TV Ads | Radio Ads | Newspaper Ads | Sales |
---|---|---|---|

230.1 | 37.8 | 69.2 | 22.1 |

44.5 | 39.3 | 45.1 | 10.4 |

17.2 | 45.9 | 69.3 | 9.3 |

151.5 | 41.3 | 58.5 | 18.5 |

180.8 | 10.8 | 58.4 | 12.9 |

Let us assume that our total advertising amount is the sum of different types of advertisement like Radio Ads, TV Ads and Newspaper Ads. Now We have three independent variables and one dependent variables (Sales) and we have to predict the Sales based on these three variables.

The general equation of a linear regression model having multiple variables is given by:

** y**** = m _{0} + m_{1}x_{1} + m_{2}x_{2} + m_{3}x_{3} + … … m_{n}x_{n}**

This equation is actually representing a hyperplane. Note that a linear regression in two dimension is a straight line; in three dimensions it is a plane, and in more than three dimensions, a hyper plane. Previously we had only one variable and hence we were required to determine only slope and the intercept. But here, we have multiple variables and each has a coefficient. The values. m_{1},m_{2},m_{3},.. m_{n} are called the regression coefficients m0 is the regression intercept.

Let us consider our scenario of advertisement data set, where we have 3 dependent variables. The regression equation can be written as

**Sales = m0+ m1 * TV Ads + m2* Radio Ads + m3 * Newspaper Ads**

The goal of regression is to determine the values of the coefficients m₀, m₁, m _{2 }and m_{3}, such that this hyper plane is as close as possible to the actual data and yield the minimal error.

## Implementation of Linear Regression using Python Scikit-Learn

Almost all problems that we encounter in data science usually have multiple variables. So, in this section, we will see how we can implement the linear regression with the help of scikit-learn library of python. Before we jump into the coding part, we should have a basic understanding about the data.

##### Dataset :

##### Importing Libraries :

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

##### Loading the Dataset:

**df=pd.read_csv('D:\\advertising.csv') **

Now we have the data inside the datafram ‘df’. Let us explore the data that we have.

##### Exploring the Dataset :

The dimension of the data can be viewed using shape command

**df.shape**

This outputs the number of rows and columns in a tuple as given below.

**(200, 4)**

So the data frame has 200 rows and 4 columns. You can access the rows and columns individually using df.shape[0] and df.shape[1].

Now we can have a look at the data using the head() method.

**df.head()**

The above method retrieves the top 5 records present in the dataframe.

TV | Radio | Newspaper | Sales | |
---|---|---|---|---|

0 | 230.1 | 37.8 | 69.2 | 22.1 |

1 | 44.5 | 39.3 | 45.1 | 10.4 |

2 | 17.2 | 45.9 | 69.3 | 9.3 |

3 | 151.5 | 41.3 | 58.5 | 18.5 |

4 | 180.8 | 10.8 | 58.4 | 12.9 |

The dataset has 3 independent variables TV, Radio and Newspaper Ads and Sales as dependent variable. In the same way if you want to print the last 5 records of the dataframe, we can use the tail method.

To see the statistical details of the dataset, we can use describe method.

**df.describe()**

All basic statistical details are obtained as output.

TV | Radio | Newspaper | Sales | |
---|---|---|---|---|

count | 200.000000 | 200.000000 | 200.000000 | 200.000000 |

mean | 147.042500 | 23.264000 | 30.554000 | 14.022500 |

std | 85.854236 | 14.846809 | 21.778621 | 5.217457 |

min | 0.700000 | 0.000000 | 0.300000 | 1.600000 |

25% | 74.375000 | 9.975000 | 12.750000 | 10.375000 |

50% | 149.750000 | 22.900000 | 25.750000 | 12.900000 |

75% | 218.825000 | 36.525000 | 45.100000 | 17.400000 |

max | 296.400000 | 49.600000 | 114.000000 | 27.000000 |

To visualize the entire dataframe effectively we use the seaborn pair plot. We have already imported the seaborn library now we can call the pairplot function , passing our dataframe as argument.

** sns.pairplot(df)**

The output is displayed as below.

you can see that that out all variables, especially the TV advertisement has a good direct relationship with sales.

##### Preparing the Data :

X = df[['TV', 'Radio', 'Newspaper']]y = df['Sales']

Next, we split the complete data into two parts. 80% of the data goes to the training set while 20% of the data goes to test set using below code. This is called 80The test_size variable is where we actually specify the proportion of the test set.

**from sklearn.model_selection import train_test_split**

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Now we have the X_train,y_train for training the data, X_test and y_test for prediction and evaluation is prepared. We can print the dimension of train and test data using the code:

print(f"Dimension of X_train :{X_train.shape} ")print(f"Dimension of y_train :{y_train.shape} ")print(f"Dimension of X_test :{X_test.shape} ")print(f"Dimension of y_test :{y_test.shape} ")

Dimension of X_train :(160, 3) Dimension of y_train :(160,) Dimension of X_test :(40, 3) Dimension of y_test :(40,)

##### Building and Training the Model:

**from sklearn.linear_model import LinearRegression**

regressor = LinearRegression()

regressor.fit(X_train, y_train) #training the algorithm

##### Checking the coefficients selected by the Model:

We know the linear regression will find the optimal coefficients that generalizes the data quite well. We can see the coefficients our model has selected.

print(f"The Regressor Intercept is :{regressor.intercept_} ")

The Regressor Intercept is :2.9948930304953247

So the intercept value (independent term in the linear model) is 2.9948

regress_coeff = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient']) print(regress_coeff)

Coefficient TV 0.044584 Radio 0.196497 Newspaper -0.002781

When we interpret this coefficients, we can say From the above result we may infer that if TV price increases by 1 unit it will affect sales by 0.045 units. On the same way 1 unit increase in Radio Ads will increase the sales by 0.19 units.

##### Predicting on Test Data:

**y_pred = regressor.predict(X_test)**

Now y_predict contains the predicted sales values and we already have the y_test which is the actual sales. We can compare both values and see how close we are in the prediction.

**df_comp = pd.DataFrame({'Actual Sales': y_test, 'Predicted Sales': y_pred}) **

print(df_comp.head())

We can see the output below.

Actual Sales | Predicted Sales | |
---|---|---|

18 | 11.3 | 10.057396 |

170 | 8.4 | 7.452281 |

107 | 8.7 | 7.019708 |

98 | 25.4 | 24.080297 |

177 | 11.7 | 12.017863 |

As we can observe here that our model has returned good prediction results. but we need a single value that describes the goodness of the model. Hence, we evaluate the performance of our algorithm using some metrics.

##### Evaluating the Model :

**Mean Absolute Error (MAE) :**It is the mean of the absolute value of the errors. Here, we do a summation of the absolute value distance from the points to the line to get the Mean Absolute Error(MAE).

**2. Mean Squared Error (MSE):**It is the mean of the squared error. Mean squared error is obtained by the summation of the square of distances from the points to the line.

**3 . Root Mean Squared Error (RMSE):**It is the square root of the mean of the squared errors.

Let us find out the value of each metrics. Scikit-Learn has made this process easy for us including the pre-built methods for each of these.

from sklearn import metricsprint('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)) )

The output is as given below:

Mean Absolute Error: 1.3617813502090272 Mean Squared Error: 4.402118291449681 Root Mean Squared Error: 2.0981225634956795

Root Mean Squared Error value is obtained as 2.09 and when we take the mean of our data we get around 14. So RMSE is which is slightly greater than 10% of the mean value of all sales so the model is performing quite decently.

You could also check the R-squared value to evaluate the model.

**R-Squared Value**: It explains the .degree to which your input variables explain the variation of your output (Sales) variable. So, if R-square is 0.7, it means 70% of the variation in the output variable is explained by the input variables. So, in simple terms, higher the R squared, the more variation is explained by your input variables and hence better is your model

from sklearn.metrics import r2_scorer_squared = r2_score(y_test, y_pred)print('R_square_value :',r_squared)

r_square_value : 0.9058622107532246

Finally, we get the r-squared value as 0.90. Which indicates our model is good.