Regression is a very fundamental concept in statistics, machine learning and in Neural Network. Imagine plotting the correlation between rainfall frequency and agriculture production in high school. Increase in rainfall generally increases agriculture production. Fitting a line to those points enables us to predict the production rate under different rain conditions. It was actually a very simplest form of linear regression.
In simple words, regression is a study of how to best fit a curve to summarize a collection of data. It’s one of the most powerful and well-studied types of supervised learning algorithms. In regression, we try to understand the data points by discovering the curve that might have generated them. In doing so, we seek an explanation for why the given data is scattered the way it is. The best-fit curve gives us a model for explaining how the dataset might have been produced. There are many types of regression e.g. simple linear regression, polynomial regression, multivariate regression. In this post, we will discuss simple linear regression only and later we will discuss the rest. We will also provide the python code from scratch at the end of the post
Simple regression, as the name implies, it’s just a very simple form of regression, where we assume that we just have one input and we’re just trying to fit a line.
Data Set
Consider a data set containing age and the number of homicide deaths in the US in the year 2015:
age | num_homicide_deaths |
21 | 652 |
22 | 633 |
23 | 653 |
24 | 644 |
25 | 610 |
If we plot the dataset and the line best fit to it we see:
When we are talking about regression, our goal is to predict a continuous variable output given some input variables. For simple regression, we only have one input variable x which is the age in our case and our desired output y which is num of homicide deaths for each age. Our dataset then consists of many examples of x and y, so: $$ D = \{(x_1,y_1), (x_2,y_2), …, (x_N,y_N)\} $$ where \(N\) is the number of examples in the data set. So, our data set will look like: $$ D=\{(21,652),(22,633), …,(50,197)\} $$
Model
So, how can we mathematically model single linear regression? Since the goal is to find the perfect line, let’s start by defining the model (the mathematical description of how predictions will be created) as a line. It’s very simple. We’re assuming we have just one input, which in this case is, age of people and one output which is the number of homicide deaths and we’re just gonna fit a line. And what’s the equation of a line? $$f(x) = w_0 + w_1*x$$
what this regression model then specifies is that each one of our observations \( y_i\) is simply that function evaluated at \(x_i\), so that’s \(w_0\) plus \(w_1*x_1\) plus the error term which we call \(\epsilon_i\). So this is our regression model $$y_i = w_0 + w_1*x_i + \epsilon_i$$ and to be clear, this error, \(\epsilon_i\), is the distance from our specific observation back down to the line. The parameters of this model are\(w_0\)and \(w_1\) are intercept and slope and we call these the regression coefficients.
Quality Metric
We have chosen our model with two regression coefficients \(w_0\) and \(w_1\). For our data set, there can be infinitely many choices of these parameters. So our task is to find the best choice of parameters and we have to know how to measure the quality of the parameters or measure the quality of the fit. So in particular, we define a loss function (also called a cost function), which measures how good or bad a particular choice of \(w_0\) and \(w_1\) is. Values of \(w_0\) and \(w_1\) that seem poor should result in a large value of the loss function, whereas good values of \(w_0\) and \(w_1\) should result in small values of the loss function. So what’s the cost of using a specific line? It has many forms. But the one we’re gonna talk about here is Residual Sum of Squares (RSS): $$ RSS(w_0, w_1)= \sum_{i=1}^N(y_i-[ w_0 + w_1*x_i])^2$$
what Residual sum of squares assumes is that we’re just gonna add up the errors we made between what we believe the relationship is or what we’ve estimated the relationship to be between \(x\) and \(y\) and what the actual observation \(y\) is. And, we talked about the error as the \(\epsilon_i\).
Model Optimization
Our cost was to find the residual sum of squares, and for any given line, we can compute the cost of that line. So for example, we have two different lines for two different residual sums of squares. How do we know which choice of parameters is better? Ones with the minimum RSS.
Our goal is to minimize over all possible \(w_0\) and \(w_1\) intercepts and slopes respectively, but a question is, how are we going to do this? The mathematical notation for this minimization over all possible \(w_0\) , \(w_1\) is $$min_{w_0,w_1}\sum_{i=1}^N(y_i – [w_0 +w_1x_i])^2$$ So we want to find the specific value of \(w_0\) and \(w_1\) we’ll call that \(\hat{w_0}\) and \(\hat{w_1}\) respectively that minimize this residual sum of squares.
The red dot marked below on the above plot shows where the desired minimum is. We need an algorithm to find this minimum. We will discuss two approaches e.g. Closed-form Solution and Gradient Descent.
Closed-form Solution for Linear Regression
From calculus, we know that at the minimum the derivatives will be \(0\). So, if we compute the gradient of our RSS: $$ \begin{aligned} RSS(w_0, w_1) &= \sum_{i=1}^N(y_i-[ w_0 + w_1*x_i])^2 \end{aligned}$$ $$\begin{aligned} &\nabla RSS(w_0, w_1) = \begin{bmatrix} \frac{\partial RSS}{\partial w_0} \\ \\ \frac{\partial RSS}{\partial w_1} \end{bmatrix} \\ &=\begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} \end{aligned}$$
Take this gradient, set it equal to zero and find the estiamates for \(w_0\) ,\(w_1\). Those are gonna be the estimates of our two parameters of our model that define our fitted line. $$ \begin{aligned} &\nabla RSS(w_0, w_1) = 0 \end{aligned} $$ implies, $$\begin{aligned} &\begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} = 0 \\ &-2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} = 0, \\ &-2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} * x_i = 0 \end{aligned}$$ Solving for\(w_0\)and\(w_1\)we get, $$ \begin{aligned} \hat{w_0} = \frac{\sum_{i = 1}^N y_i}{N} – \hat{w_1}\frac{\sum_{i=1}^N x_i}{N} \\ \hat{w_1} = \frac{\sum y_i x_i – \frac{\sum y_i \sum x_i}{N}}{\sum x_i^2 – \frac{\sum x_i \sum x_i}{N}} \end{aligned} $$
Now that we have the solutions, we just have to compute \( \hat{w}_1\) and then plug that in and compute \(\hat{w}_0\). To compute \( \hat{w}_1\) we need to compute a couple of terms e.g. sum over all of our observations \(\sum y_i\) and sum over all of our inputs \(\sum x_i\) and then a few other terms that are multipliers of our input and output \(\sum y_i x_i\) and \(\sum x_i^2\). Plug them into these equations and we get out what our optimal \(\hat{w}_0\) and \(\hat{w}_1\) are, that minimize our residual sum of squares.
Gradient Descent for Linear Regression
The other approach that we will discuss is Gradient descent where we’re walking down this surface of residual sum of squares trying to get to the minimum. Of course, we might overshoot it and go back and forth but that’s a general idea that we’re doing this iterative procedure. $$\begin{aligned} &\nabla RSS(w_0, w_1) \\ &= \begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} \\&= \begin{bmatrix} -2\sum_{i=1}^N{[y_i – \hat{y}_i (w_0, w_1)]} \\ \\ -2\sum_{i=1}^N[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix} \end{aligned}$$
Then our gradient descent algorithm will be: $$\begin{aligned}&while \; not \; converged: \begin{bmatrix} w_0^{(t+1)} \\ w_1^{(t+1)} \end{bmatrix}\\ &= \begin{bmatrix}w_0^{(t)} \\ w_1^{(t)} \end{bmatrix} – \eta* \begin{bmatrix} -2\sum{[y_i – \hat{y}_i (w_0, w_1)]}\\ \\-2\sum[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix}\\ &= \begin{bmatrix} w_0^{(t)} \\ w_1^{(t)} \end{bmatrix} +2\eta* \begin{bmatrix} \sum{[y_i – \hat{y}_i (w_0, w_1)]} \\ \\ \sum[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix} \end{aligned}$$
So gradient descent does this, we’re going to repeatedly update our weights. So set \(W\) to \(W\) minus \(\eta\) times the derivative, where \(W\) is the vector. We will repeatedly do that until the algorithm converges. \(\eta\) here is the learning rate and controls how big a step we take on each iteration of gradient descent and the derivative quantity is basically the update or the change we want to make to the parameters \(W\).
Prediction for Linear Regression
After all the hard work now we need to test our machine learning model. The dataset we work on, generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation $$ Prediction = \hat{w}_0 + \hat{w}_1 * data $$ where \( \hat{w}_0\) and \(\hat{w}_1\) are the optimized set weights.
Now that we have finished the theoretical part of the tutorial now you can see the code and try to understand different blocks of the code.