A single variable linear regression model can learn to predict an output variable \(y\) when there is only one input variable, \(x\) and there is a linear relationship between \(y\) and \(x\), that is, \(y \approx w_0 + w_1 x\). Well, that might not be a great predictive model for most cases. For example, let’s assume we are going to begin a real estate business and we are going to use machine learning to predict house prices. In particular, we have some houses that we want to list for sale, but we don’t know the value of these houses. So, we’re going to look at other houses that sold in the recent past. Looking at how much they’ve sold and the different characteristics of those houses, we will use that data to inform our listing price for our house that we’d like to sell.

Now, there are many aspects depend on the house price. But very first we might think about the relationship between the square foot and the price of the house and find a simple linear regression between them.

But we might go into the data set and notice there are these other houses, that have the very similar square footage but they’re just the fundamentally different house. For example, one house only has one bathroom but the other house has three bathrooms. So the other house, of course, should have a higher value than the one with just one bathroom. So, we need to add more input to our regression model.

Price | Bed_rooms | Bath_rooms | … | Sqft_living |

221900 | 3 | 1 | … | 1180 |

538000 | 3 | 2.25 | … | 2570 |

180000 | 2 | 1 | … | 770 |

604000 | 4 | 3 | … | 1960 |

510000 | 3 | 2 | … | 1680 |

1.23E+06 | 4 | 4.5 | … | 5420 |

257500 | 3 | 2.25 | … | 1715 |

So instead of just looking at square feet and using that to predict the house value, we’re going to look after other inputs as well. For example, we’re going to record the number of bathrooms in the house and we’re going to use both of these two inputs to predict the house price. In particular, in this higher dimensional space, we’re going to fit some function that models the relationship between the number of square feet and the number of bathrooms and the output, the value of the house. And so, in particular, one simple function that we can think about is just modeling this function as $$f(x) = w_0 + w_1 * x_1 + w_2*x_2$$ where \(x_1\) is the number of square feet and \(x_2\) is the number of bathrooms.

We have just talked about square feet and number of bathrooms as the inputs that we’re looking at for our regression model. But, associated with any house, there are lots of different attributes and lots of things that we can use as inputs to our regression model and here the multivariable regression comes into play.

When we have these multiple inputs, the simplest models we can think of is just a function directly of the inputs themselves. Input \(\textbf{x}\) is a d-dim vector and output y is a scalar $$\textbf{x} = (\textbf{x}[1], \textbf{x}[2], \dots , \textbf{x}[d])$$ where \(\textbf{x}[1]\), \( \textbf{x}[2] \), \(\dots\), \(\textbf{x}[d]\) are the arrays containing different features e.g. number of square foot, number of bathrooms, number of bedrooms, etc. Taking these inputs and plugging those directly entirely into our linear model with the noise term, \(\epsilon_i\) we get output \(y_i\) in the \(i^{ \text{th}}\) data point:

$$y_i = w_0 + w_1 \textbf{x}_i[1] + w_2 \textbf{x}_i[2] + … + w_d \textbf{x}_i[d] + \epsilon_i$$ where the first feature in our model is just one, the constant feature. The second feature is the first input, for example, the number of square feet and the third feature is our second input, for example, the number of bathrooms. And this goes on and on till we get to our last input, which is the little d+1 feature, for example, maybe lot size. For generically, instead of just a simple hyperplane e.g. a single line, we can fit a polynomial or we can fit some D-dimensional curve. $$\begin{aligned}y_i &= w_0 h_0( \textbf{x}_i)+ w_1 h_1( \textbf{x}_i) + … + w_D h_D( \textbf{x}_i) + \epsilon_i \\ &= \sum_{j=0}^D w_j h_j( \textbf{x}_i) + \epsilon_i\end{aligned}$$

Because we’re gonna assume that there’s some capital D different features of these multiple inputs. So just as an example, maybe our zero feature is just that one constant term and that’s pretty typical. That just shifts up and down where this curve leads in the space and maybe our first feature might be just our first input like in the hyperplane example which is quite fit. And the second feature, it could be the second input like in our hyperplane example or could be some other function of any of the inputs. Maybe we want to take the log of the seventh input, which happens to be the number of bedrooms, times just the number of bathrooms.

So, in this case, our second feature of the model is relating log number of bathrooms times number, log number of bedrooms times number of bathrooms to the output and then we get all the way up to our capital D feature which is some function of any of our inputs to our regression model.

$$\begin{aligned} feature \; 1 &= h_0(\textbf{x}) \dots e.g., 1 \\ feature \; 2 &= h_1(\textbf{x}) \dots e.g. , \textbf{x}[1] =sq. \;ft. \\ feature \; 3 &= h_2(\textbf{x}) \dots \textbf{x}[2] = \#bath \; or, \log(\textbf{x}[7]) \textbf{x}[2] = \log(\#bed) * \#bath \\ \vdots \\ feature \; D+1 &= h_D( \textbf{x}) \dots \text{some other function of} \; \textbf{x}[1], \dots, \textbf{x}[d]\end{aligned}$$So this is our generic multiple regression model with multiple features.

Like the simple linear regression, we’re going to talk about two different algorithms. One is just a closed-form solution and the other is gradient descent and there are gonna be multiple steps that we have to take to build up to deriving these algorithms and the first is simply to rewrite our regression model in the matrix notation.

So, we will begin with rewriting our multiple regression model in matrix notation for just a single observation $$ y_i= \sum_{j=0}^D w_j h_j({x}_i) + \epsilon_i $$ and we are gonna write this in matrix notation: $$\begin{aligned} y_i &= \begin{bmatrix}w_0 & w_1 & w_2 & … & w_D\end{bmatrix}\begin{bmatrix} h_0(x_i) \\ h_1(x_i) \\ h_2(x_i) \\ … \\ h_D(x_i)\end{bmatrix} + \epsilon_i \\ &= w_0 h_0({x}_i)+ w_1 h_1({x}_i) + … + w_D h_D({x}_i) + \epsilon_i \\ &= \textbf{w}^T\textbf{h}(\textbf{x}_i) + \epsilon_i \end{aligned}$$ In particular we’re going to think of vectors always as being defined as columns and if it defines a row, then we’re going to call that the transpose.

Now, we are going to rewrite our model for all the observations together.

$$\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} h_0(x_1) & h_1(x_1) & \dots & h_D(x_1) \\ h_0(x_2) & h_1(x_2) & \dots & h_D(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ h_0(x_N) & h_1(x_N) & \dots & h_D(x_N) \end{bmatrix} \begin{bmatrix}w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} + \begin{bmatrix}\epsilon_1 \\ \epsilon _2 \\ \epsilon _3 \\ \vdots \\ \epsilon _N \end{bmatrix} $$ So, we get $$\textbf{y} = \textbf{Hw} + \mathbf{\epsilon}$$

here, we can write our entire regression model for \(N\) observations as this \(\textbf{y}\) vector and it is equal to the \(H\) matrix times this \(\textbf{w}\) vector plus \(\epsilon\) vector that represents all the errors in our model. So this is the matrix notation for our model of \(N\) observations.

In simple linear regression model, we have used Residual Sum Squares(RSS) as cost function. For any given fit, we define the residual sum of squares(RSS) of our parameter: $$\begin{aligned}RSS(w_0, w_1) &= \sum_{i=1}^N(y_i – [w_0 + w_1 x_i ])^2 \\ &= \sum_{i=1}^N(y_i – \hat{y}_i(w_0, w_1)) \end{aligned}$$where \( \hat{y}_i\) is the predicted value for \(y_i\) and \(w_0\) and \(w_1\) are the intercept and slope respectively. Now we will explain the residual sum of squares in the case of multiple regression. The residual is the difference between the actual observation and the predicted value. So what is our predicted value for the \(i^{\textbf{th}}\) observation? Well in our vector notation, what we do is we take each one of the weights in our model and then we multiply our features for that observation by that factor. So

$$\begin{aligned} \hat{y}_i &= \begin{bmatrix} h_0(x_i) & h_1(x_i) & h_2(x_i) & \dots & h_D(x_i) \end{bmatrix}\begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} \\ &= \textbf{h}^T( \textbf{x}_i) . \textbf{w}\end{aligned}$$ What is our predicted value for the ith observation. So our RSS for multiple regression is going to be: $$\begin{aligned}RSS(\textbf{w}) &= \sum_{i = 1}^N (y_i – h(\textbf{x}_i)^T \textbf{w})^2 \\ &= (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw}) \end{aligned}$$

So why are these two things equivalent? Well, we’re gonna break up the explanation into parts. We know that \(\hat{\textbf{y}}\), the vector of all of our end predicted observations is equal to \(H\) times \(w\) or \(\hat{\textbf{y}} = \textbf{H} * \textbf{w}\) implies: $$\textbf{y} – \textbf{H}.\textbf{w} = \textbf{y} – \hat{\textbf{y}}$$ this is equivalent of looking at our vector of actual observed values and subtracting our vector of predicted values. So we take all our house sales prices, and we look at all the predicted house prices, given a set of parameters, w, and we subtract them. What is that vector?

$$ \textbf{y} – \hat{\textbf{y}} = \begin{bmatrix}residual_1 \\ residual_2\\ \vdots \\residual_N \end{bmatrix}$$That vector is the vector of residuals because the result of this is the difference between our first house sale and our predicted house sale, we call that the residual for the first prediction, and likewise for the second, and all the way up to our \(n^\text{th}\) observation. So the term \(\textbf{y} – \textbf{H}.\textbf{w}\), is equivalent to the vector of the residuals from our predictions.

So, $$\begin{aligned} (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw}) &= \begin{bmatrix}residual_1 & residual_2 & \dots & residual_N \end{bmatrix} \begin{bmatrix}residual_1 \\ residual_2\\ \vdots \\residual_N \end{bmatrix} \\ &= (residual_1^2 + residual_2^2 + \dots + residual_N^2 ) \\ &= \sum _{i=1}^N residual_i^2 \\ &= RSS(\textbf{w}) \end{aligned} $$

By definition, that is exactly what residual sum of squares is using these \(\textbf{w}\) parameters.

Now we’re onto the final important step of the derivation, which is taking the gradient. The gradient was important both for our closed form solution as well as, of course, for the gradient descent algorithm. So, the gradient $$\begin{aligned} \nabla RSS(\textbf{w}) &= \nabla[ (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw})] \\ &= -2\textbf{H}^T(\textbf{y} – \textbf{Hw}) \end{aligned}$$

From calculus we know that, at the minimum the gradient will be **zero**. So, for closed form solution we take our gradient, and set it equal to **zero**, and solve for \(w\) $$\begin{aligned} \nabla RSS(\textbf{w}) = -2&\textbf{H}^T(\textbf{y} – \textbf{Hw}) = 0 \\ = -2&\textbf{H}^T \textbf{y} + 2\textbf{H}^T\textbf{Hw} = 0 \\ &\textbf{H}^T\textbf{Hw} = \textbf{H}^T\textbf{y} \\ \hat{w} = (&\textbf{H}^T \textbf{H})^{-1} \textbf{H}^T\textbf{y} \end{aligned}$$ we have a whole collection of different parameters, \(w_0\), \(w_1\) and all the way up to \(w_D\) multiplying all the features we’re using in our multiple regression model. And in one line we are able to write the solution to the fit using matrix notation. This motivates why we went through all this work to write things in this matrix notation because it allows us to have this nice closed form solution for all of our parameters written very compactly.

The other alternative approach and maybe more useful and simpler method is the Gradient Descent method where we’re walking down the surface of residual sum of squares and trying to get to the minimum. Of course, we might overshoot it and go back and forth but that’s a general idea that we’re doing this iterative procedure. $$\begin{aligned}while \; not \; co&nverged: \\ \textbf{w}^{(t+1)} \leftarrow &\textbf{w}^{(t)} – \eta \nabla RSS(\textbf{w}^{(t)}) \\ \leftarrow &\textbf{w}^{(t)} + 2\eta\textbf{H}^T(\textbf{y} – \textbf{Hw})\end{aligned}$$ what this version of the algorithm is doing is it’s taking our entire \(\textbf{w}\) vector, all the regression coefficients in our model, and updating them all at once using this matrix notation shown here.

Now that we have finished the theoretical part of the tutorial now you can see the code and try to understand different blocks of the code.

]]>In simple words, regression is a study of how to best fit a curve to summarize a collection of data. It’s one of the most powerful and well-studied types of supervised learning algorithms. In regression, we try to understand the data points by discovering the curve that might have generated them. In doing so, we seek an explanation for why the given data is scattered the way it is. The best-fit curve gives us a model for explaining how the dataset might have been produced. There are many types of regression e.g. simple linear regression, polynomial regression, multivariate regression. In this post, we will discuss simple linear regression only and later we will discuss the rest. We will also provide the python code from scratch at the end of the post

Simple regression, as the name implies, it’s just a very simple form of regression, where we assume that we just have one input and we’re just trying to fit a line.

Consider a data set containing age and the number of homicide deaths in the US in the year 2015:

age | num_homicide_deaths |

21 | 652 |

22 | 633 |

23 | 653 |

24 | 644 |

25 | 610 |

If we plot the dataset and the line best fit to it we see:

When we are talking about regression, our goal is to predict a continuous variable output given some input variables. For simple regression, we only have one input variable x which is the age in our case and our desired output y which is num of homicide deaths for each age. Our dataset then consists of many examples of x and y, so: $$ D = \{(x_1,y_1), (x_2,y_2), …, (x_N,y_N)\} $$ where \(N\) is the number of examples in the data set. So, our data set will look like: $$ D=\{(21,652),(22,633), …,(50,197)\} $$

So, how can we mathematically model single linear regression? Since the goal is to find the perfect line, let’s start by defining the **model** (the mathematical description of how predictions will be created) as a line. It’s very simple. We’re assuming we have just one input, which in this case is, age of people and one output which is the number of homicide deaths and we’re just gonna fit a line. And what’s the equation of a line? $$f(x) = w_0 + w_1*x$$

what this regression model then specifies is that each one of our observations \( y_i\) is simply that function evaluated at \(x_i\), so that’s \(w_0\) plus \(w_1*x_1\) plus the error term which we call \(\epsilon_i\). So this is our regression model $$y_i = w_0 + w_1*x_i + \epsilon_i$$ and to be clear, this error, \(\epsilon_i\), is the distance from our specific observation back down to the line. The parameters of this model are\(w_0\)and \(w_1\) are intercept and slope and we call these the regression coefficients.

We have chosen our model with two regression coefficients \(w_0\) and \(w_1\). For our data set, there can be infinitely many choices of these parameters. So our task is to find the best choice of parameters and we have to know how to measure the quality of the parameters or measure the quality of the fit. So in particular, we define a loss function (also called a cost function), which measures how good or bad a particular choice of \(w_0\) and \(w_1\) is. Values of \(w_0\) and \(w_1\) that seem poor should result in a large value of the loss function, whereas good values of \(w_0\) and \(w_1\) should result in small values of the loss function. So what’s the cost of using a specific line? It has many forms. But the one we’re gonna talk about here is Residual Sum of Squares (RSS): $$ RSS(w_0, w_1)= \sum_{i=1}^N(y_i-[ w_0 + w_1*x_i])^2$$

what Residual sum of squares assumes is that we’re just gonna add up the errors we made between what we believe the relationship is or what we’ve estimated the relationship to be between \(x\) and \(y\) and what the actual observation \(y\) is. And, we talked about the error as the \(\epsilon_i\).

Our cost was to find the residual sum of squares, and for any given line, we can compute the cost of that line. So for example, we have two different lines for two different residual sums of squares. How do we know which choice of parameters is better? Ones with the minimum RSS.

Our goal is to minimize over all possible \(w_0\) and \(w_1\) intercepts and slopes respectively, but a question is, how are we going to do this? The mathematical notation for this minimization over all possible \(w_0\) , \(w_1\) is $$min_{w_0,w_1}\sum_{i=1}^N(y_i – [w_0 +w_1x_i])^2$$ So we want to find the specific value of \(w_0\) and \(w_1\) we’ll call that \(\hat{w_0}\) and \(\hat{w_1}\) respectively that minimize this residual sum of squares.

The red dot marked below on the above plot shows where the desired minimum is. We need an algorithm to find this minimum. We will discuss two approaches e.g. **Closed-form Solution** and **Gradient Descent**.

From calculus, we know that at the minimum the derivatives will be \(0\). So, if we compute the gradient of our RSS: $$ \begin{aligned} RSS(w_0, w_1) &= \sum_{i=1}^N(y_i-[ w_0 + w_1*x_i])^2 \end{aligned}$$ $$\begin{aligned} &\nabla RSS(w_0, w_1) = \begin{bmatrix} \frac{\partial RSS}{\partial w_0} \\ \\ \frac{\partial RSS}{\partial w_1} \end{bmatrix} \\ &=\begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} \end{aligned}$$

Take this gradient, set it equal to zero and find the estiamates for \(w_0\) ,\(w_1\). Those are gonna be the estimates of our two parameters of our model that define our fitted line. $$ \begin{aligned} &\nabla RSS(w_0, w_1) = 0 \end{aligned} $$ implies, $$\begin{aligned} &\begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} = 0 \\ &-2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} = 0, \\ &-2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} * x_i = 0 \end{aligned}$$ Solving for\(w_0\)and\(w_1\)we get, $$ \begin{aligned} \hat{w_0} = \frac{\sum_{i = 1}^N y_i}{N} – \hat{w_1}\frac{\sum_{i=1}^N x_i}{N} \\ \hat{w_1} = \frac{\sum y_i x_i – \frac{\sum y_i \sum x_i}{N}}{\sum x_i^2 – \frac{\sum x_i \sum x_i}{N}} \end{aligned} $$

Now that we have the solutions, we just have to compute \( \hat{w}_1\) and then plug that in and compute \(\hat{w}_0\). To compute \( \hat{w}_1\) we need to compute a couple of terms e.g. sum over all of our observations \(\sum y_i\) and sum over all of our inputs \(\sum x_i\) and then a few other terms that are multipliers of our input and output \(\sum y_i x_i\) and \(\sum x_i^2\). Plug them into these equations and we get out what our optimal \(\hat{w}_0\) and \(\hat{w}_1\) are, that minimize our residual sum of squares.

The other approach that we will discuss is Gradient descent where we’re walking down this surface of residual sum of squares trying to get to the minimum. Of course, we might overshoot it and go back and forth but that’s a general idea that we’re doing this iterative procedure. $$\begin{aligned} &\nabla RSS(w_0, w_1) \\ &= \begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} \\&= \begin{bmatrix} -2\sum_{i=1}^N{[y_i – \hat{y}_i (w_0, w_1)]} \\ \\ -2\sum_{i=1}^N[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix} \end{aligned}$$

Then our gradient descent algorithm will be: $$\begin{aligned}&while \; not \; converged: \begin{bmatrix} w_0^{(t+1)} \\ w_1^{(t+1)} \end{bmatrix}\\ &= \begin{bmatrix}w_0^{(t)} \\ w_1^{(t)} \end{bmatrix} – \eta* \begin{bmatrix} -2\sum{[y_i – \hat{y}_i (w_0, w_1)]}\\ \\-2\sum[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix}\\ &= \begin{bmatrix} w_0^{(t)} \\ w_1^{(t)} \end{bmatrix} +2\eta* \begin{bmatrix} \sum{[y_i – \hat{y}_i (w_0, w_1)]} \\ \\ \sum[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix} \end{aligned}$$

So gradient descent does this, we’re going to repeatedly update our weights. So set \(W\) to \(W\) minus \(\eta\) times the derivative, where \(W\) is the vector. We will repeatedly do that until the algorithm converges. \(\eta\) here is the learning rate and controls how big a step we take on each iteration of gradient descent and the derivative quantity is basically the update or the change we want to make to the parameters \(W\).

After all the hard work now we need to test our machine learning model. The dataset we work on, generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation $$ Prediction = \hat{w}_0 + \hat{w}_1 * data $$ where \( \hat{w}_0\) and \(\hat{w}_1\) are the optimized set weights.

Now that we have finished the theoretical part of the tutorial now you can see the code and try to understand different blocks of the code.

]]>