Previously, we have discussed briefly the simple linear regression. Here we will discuss multiple regression or multivariable regression and how to get the solution of the multivariable regression. At the end of the post, we will provide the python code from scratch for multivariable regression.
Motivation
A single variable linear regression model can learn to predict an output variable \(y\) when there is only one input variable, \(x\) and there is a linear relationship between \(y\) and \(x\), that is, \(y \approx w_0 + w_1 x\). Well, that might not be a great predictive model for most cases. For example, let’s assume we are going to begin a real estate business and we are going to use machine learning to predict house prices. In particular, we have some houses that we want to list for sale, but we don’t know the value of these houses. So, we’re going to look at other houses that sold in the recent past. Looking at how much they’ve sold and the different characteristics of those houses, we will use that data to inform our listing price for our house that we’d like to sell.
Now, there are many aspects depend on the house price. But very first we might think about the relationship between the square foot and the price of the house and find a simple linear regression between them.
But we might go into the data set and notice there are these other houses, that have the very similar square footage but they’re just the fundamentally different house. For example, one house only has one bathroom but the other house has three bathrooms. So the other house, of course, should have a higher value than the one with just one bathroom. So, we need to add more input to our regression model.
Price | Bed_rooms | Bath_rooms | … | Sqft_living |
221900 | 3 | 1 | … | 1180 |
538000 | 3 | 2.25 | … | 2570 |
180000 | 2 | 1 | … | 770 |
604000 | 4 | 3 | … | 1960 |
510000 | 3 | 2 | … | 1680 |
1.23E+06 | 4 | 4.5 | … | 5420 |
257500 | 3 | 2.25 | … | 1715 |
So instead of just looking at square feet and using that to predict the house value, we’re going to look after other inputs as well. For example, we’re going to record the number of bathrooms in the house and we’re going to use both of these two inputs to predict the house price. In particular, in this higher dimensional space, we’re going to fit some function that models the relationship between the number of square feet and the number of bathrooms and the output, the value of the house. And so, in particular, one simple function that we can think about is just modeling this function as $$f(x) = w_0 + w_1 * x_1 + w_2*x_2$$ where \(x_1\) is the number of square feet and \(x_2\) is the number of bathrooms.
We have just talked about square feet and number of bathrooms as the inputs that we’re looking at for our regression model. But, associated with any house, there are lots of different attributes and lots of things that we can use as inputs to our regression model and here the multivariable regression comes into play.
Model
When we have these multiple inputs, the simplest models we can think of is just a function directly of the inputs themselves. Input \(\textbf{x}\) is a d-dim vector and output y is a scalar $$\textbf{x} = (\textbf{x}[1], \textbf{x}[2], \dots , \textbf{x}[d])$$ where \(\textbf{x}[1]\), \( \textbf{x}[2] \), \(\dots\), \(\textbf{x}[d]\) are the arrays containing different features e.g. number of square foot, number of bathrooms, number of bedrooms, etc. Taking these inputs and plugging those directly entirely into our linear model with the noise term, \(\epsilon_i\) we get output \(y_i\) in the \(i^{ \text{th}}\) data point:
$$y_i = w_0 + w_1 \textbf{x}_i[1] + w_2 \textbf{x}_i[2] + … + w_d \textbf{x}_i[d] + \epsilon_i$$ where the first feature in our model is just one, the constant feature. The second feature is the first input, for example, the number of square feet and the third feature is our second input, for example, the number of bathrooms. And this goes on and on till we get to our last input, which is the little d+1 feature, for example, maybe lot size. For generically, instead of just a simple hyperplane e.g. a single line, we can fit a polynomial or we can fit some D-dimensional curve. $$\begin{aligned}y_i &= w_0 h_0( \textbf{x}_i)+ w_1 h_1( \textbf{x}_i) + … + w_D h_D( \textbf{x}_i) + \epsilon_i \\ &= \sum_{j=0}^D w_j h_j( \textbf{x}_i) + \epsilon_i\end{aligned}$$
Because we’re gonna assume that there’s some capital D different features of these multiple inputs. So just as an example, maybe our zero feature is just that one constant term and that’s pretty typical. That just shifts up and down where this curve leads in the space and maybe our first feature might be just our first input like in the hyperplane example which is quite fit. And the second feature, it could be the second input like in our hyperplane example or could be some other function of any of the inputs. Maybe we want to take the log of the seventh input, which happens to be the number of bedrooms, times just the number of bathrooms.
So, in this case, our second feature of the model is relating log number of bathrooms times number, log number of bedrooms times number of bathrooms to the output and then we get all the way up to our capital D feature which is some function of any of our inputs to our regression model.
$$\begin{aligned} feature \; 1 &= h_0(\textbf{x}) \dots e.g., 1 \\ feature \; 2 &= h_1(\textbf{x}) \dots e.g. , \textbf{x}[1] =sq. \;ft. \\ feature \; 3 &= h_2(\textbf{x}) \dots \textbf{x}[2] = \#bath \; or, \log(\textbf{x}[7]) \textbf{x}[2] = \log(\#bed) * \#bath \\ \vdots \\ feature \; D+1 &= h_D( \textbf{x}) \dots \text{some other function of} \; \textbf{x}[1], \dots, \textbf{x}[d]\end{aligned}$$So this is our generic multiple regression model with multiple features.
Matrix
Like the simple linear regression, we’re going to talk about two different algorithms. One is just a closed-form solution and the other is gradient descent and there are gonna be multiple steps that we have to take to build up to deriving these algorithms and the first is simply to rewrite our regression model in the matrix notation.
So, we will begin with rewriting our multiple regression model in matrix notation for just a single observation $$ y_i= \sum_{j=0}^D w_j h_j({x}_i) + \epsilon_i $$ and we are gonna write this in matrix notation: $$\begin{aligned} y_i &= \begin{bmatrix}w_0 & w_1 & w_2 & … & w_D\end{bmatrix}\begin{bmatrix} h_0(x_i) \\ h_1(x_i) \\ h_2(x_i) \\ … \\ h_D(x_i)\end{bmatrix} + \epsilon_i \\ &= w_0 h_0({x}_i)+ w_1 h_1({x}_i) + … + w_D h_D({x}_i) + \epsilon_i \\ &= \textbf{w}^T\textbf{h}(\textbf{x}_i) + \epsilon_i \end{aligned}$$ In particular we’re going to think of vectors always as being defined as columns and if it defines a row, then we’re going to call that the transpose.
Now, we are going to rewrite our model for all the observations together.
$$\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} h_0(x_1) & h_1(x_1) & \dots & h_D(x_1) \\ h_0(x_2) & h_1(x_2) & \dots & h_D(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ h_0(x_N) & h_1(x_N) & \dots & h_D(x_N) \end{bmatrix} \begin{bmatrix}w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} + \begin{bmatrix}\epsilon_1 \\ \epsilon _2 \\ \epsilon _3 \\ \vdots \\ \epsilon _N \end{bmatrix} $$ So, we get $$\textbf{y} = \textbf{Hw} + \mathbf{\epsilon}$$
here, we can write our entire regression model for \(N\) observations as this \(\textbf{y}\) vector and it is equal to the \(H\) matrix times this \(\textbf{w}\) vector plus \(\epsilon\) vector that represents all the errors in our model. So this is the matrix notation for our model of \(N\) observations.
Quality Metric
In simple linear regression model, we have used Residual Sum Squares(RSS) as cost function. For any given fit, we define the residual sum of squares(RSS) of our parameter: $$\begin{aligned}RSS(w_0, w_1) &= \sum_{i=1}^N(y_i – [w_0 + w_1 x_i ])^2 \\ &= \sum_{i=1}^N(y_i – \hat{y}_i(w_0, w_1)) \end{aligned}$$where \( \hat{y}_i\) is the predicted value for \(y_i\) and \(w_0\) and \(w_1\) are the intercept and slope respectively. Now we will explain the residual sum of squares in the case of multiple regression. The residual is the difference between the actual observation and the predicted value. So what is our predicted value for the \(i^{\textbf{th}}\) observation? Well in our vector notation, what we do is we take each one of the weights in our model and then we multiply our features for that observation by that factor. So
$$\begin{aligned} \hat{y}_i &= \begin{bmatrix} h_0(x_i) & h_1(x_i) & h_2(x_i) & \dots & h_D(x_i) \end{bmatrix}\begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} \\ &= \textbf{h}^T( \textbf{x}_i) . \textbf{w}\end{aligned}$$ What is our predicted value for the ith observation. So our RSS for multiple regression is going to be: $$\begin{aligned}RSS(\textbf{w}) &= \sum_{i = 1}^N (y_i – h(\textbf{x}_i)^T \textbf{w})^2 \\ &= (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw}) \end{aligned}$$
So why are these two things equivalent? Well, we’re gonna break up the explanation into parts. We know that \(\hat{\textbf{y}}\), the vector of all of our end predicted observations is equal to \(H\) times \(w\) or \(\hat{\textbf{y}} = \textbf{H} * \textbf{w}\) implies: $$\textbf{y} – \textbf{H}.\textbf{w} = \textbf{y} – \hat{\textbf{y}}$$ this is equivalent of looking at our vector of actual observed values and subtracting our vector of predicted values. So we take all our house sales prices, and we look at all the predicted house prices, given a set of parameters, w, and we subtract them. What is that vector?
$$ \textbf{y} – \hat{\textbf{y}} = \begin{bmatrix}residual_1 \\ residual_2\\ \vdots \\residual_N \end{bmatrix}$$That vector is the vector of residuals because the result of this is the difference between our first house sale and our predicted house sale, we call that the residual for the first prediction, and likewise for the second, and all the way up to our \(n^\text{th}\) observation. So the term \(\textbf{y} – \textbf{H}.\textbf{w}\), is equivalent to the vector of the residuals from our predictions.
So, $$\begin{aligned} (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw}) &= \begin{bmatrix}residual_1 & residual_2 & \dots & residual_N \end{bmatrix} \begin{bmatrix}residual_1 \\ residual_2\\ \vdots \\residual_N \end{bmatrix} \\ &= (residual_1^2 + residual_2^2 + \dots + residual_N^2 ) \\ &= \sum _{i=1}^N residual_i^2 \\ &= RSS(\textbf{w}) \end{aligned} $$
By definition, that is exactly what residual sum of squares is using these \(\textbf{w}\) parameters.
Closed form solution for Multiple Regression
Now we’re onto the final important step of the derivation, which is taking the gradient. The gradient was important both for our closed form solution as well as, of course, for the gradient descent algorithm. So, the gradient $$\begin{aligned} \nabla RSS(\textbf{w}) &= \nabla[ (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw})] \\ &= -2\textbf{H}^T(\textbf{y} – \textbf{Hw}) \end{aligned}$$
From calculus we know that, at the minimum the gradient will be zero. So, for closed form solution we take our gradient, and set it equal to zero, and solve for \(w\) $$\begin{aligned} \nabla RSS(\textbf{w}) = -2&\textbf{H}^T(\textbf{y} – \textbf{Hw}) = 0 \\ = -2&\textbf{H}^T \textbf{y} + 2\textbf{H}^T\textbf{Hw} = 0 \\ &\textbf{H}^T\textbf{Hw} = \textbf{H}^T\textbf{y} \\ \hat{w} = (&\textbf{H}^T \textbf{H})^{-1} \textbf{H}^T\textbf{y} \end{aligned}$$ we have a whole collection of different parameters, \(w_0\), \(w_1\) and all the way up to \(w_D\) multiplying all the features we’re using in our multiple regression model. And in one line we are able to write the solution to the fit using matrix notation. This motivates why we went through all this work to write things in this matrix notation because it allows us to have this nice closed form solution for all of our parameters written very compactly.
Gradient Descent solution for Multiple Regression
The other alternative approach and maybe more useful and simpler method is the Gradient Descent method where we’re walking down the surface of residual sum of squares and trying to get to the minimum. Of course, we might overshoot it and go back and forth but that’s a general idea that we’re doing this iterative procedure. $$\begin{aligned}while \; not \; co&nverged: \\ \textbf{w}^{(t+1)} \leftarrow &\textbf{w}^{(t)} – \eta \nabla RSS(\textbf{w}^{(t)}) \\ \leftarrow &\textbf{w}^{(t)} + 2\eta\textbf{H}^T(\textbf{y} – \textbf{Hw})\end{aligned}$$ what this version of the algorithm is doing is it’s taking our entire \(\textbf{w}\) vector, all the regression coefficients in our model, and updating them all at once using this matrix notation shown here.
Now that we have finished the theoretical part of the tutorial now you can see the code and try to understand different blocks of the code.