We will start with the gradient descent algorithm which is the most basic and useful optimization algorithm for neural networks. Then we will explain the concept of mini-batch and Stochastic Gradient Descent (SGD) which is just a modification of Gradient Descent. After that, we will introduce the concept of momentum which is the core concept of the many modern optimization algorithms. Finally, we will derive the mathematical concepts of RMSProp and Adam Optimization that are two highly efficient algorithms.

Gradient descent is a mathematical concept of first-order iterative optimization algorithm for finding a local minimum of a differentiable function. In a neural network, this function is called loss function or cost function. Suppose we want to solve a classification problem using a neural network. For simplicity, suppose the neural network contains a single hidden layer only.

Suppose, we have a set of inputs or features \({X}\) with associated classification labels \({Y}\) $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned}$$ where \(x ^{(1)}, x^{(2)}, … , x^{(m)} \) are different training examples and we have \(m\) training examples. Then the output, \(A\) will be $$\begin{aligned} Z^{[1]} &= W^{[1]}.X + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \end{aligned}$$ where \(a^{[l](m)}\) denotes the output of different layers of different examples, \(l\) denotes the layer number and \(m\) denotes the different examples.Weights of this network are initialized with \(0\) or randomly.

Then we will use a loss function \(E\) to find the error between the prediction and the real value. Our goal is to minimize this error and make predictions as close as possible to the real values by updating the weights. To measure this error we use a metric known as the loss function. A common loss function is the Sum of Squared Error (SSE) defined as $$\begin{aligned} E &= \frac{1}{2} \sum \left[y – \hat{y}\right]^2 \\ E &= \frac{1}{2} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2\end{aligned}$$ where \(\hat{y}\) is the prediction and y is the true value. One thing to notice is that errors are the functions of weights. We have to tune up the weights to alter the network’s prediction which in turn will influence the overall error. Our goal is to find the weights that will minimize the errors and to do that we use gradients. Suppose, we plot the weights in the x-axis and the error \((E)\) in the y-axis to get a curve.

Here we are showing the simple depiction of the error with one weight. Our goal is to find the weight that minimizes the error. We start with a random weight and step towards the direction of the minimum. The direction is opposite to the gradient or the slope. After taking several steps towards the direction eventually we will be able to reach the minimum of the error function. To update the weights we use $$\begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned}$$ Here \(\eta \) is a constant that is called learning rate of a neural network. Leraning rate is a hyperparamater that needs to be adjusted in the network. Deriving the derivative term $$\begin{aligned}\frac{\partial E}{\partial w_i} &= -(y_i – \hat{y})\frac{\partial {\hat {y}}}{\partial w_i} \\& = -(y – \hat{y}). f^{‘} (z) \frac{\partial}{\partial w_i} \sum{w_i.x_i} \\ &= -(y – \hat{y}). f^{‘} (z).x_i\end{aligned}$$ We can simplify the equation defining another term \(\delta\) called error term $$\begin{aligned} w_i &= w_i – \eta \frac{\partial E}{\partial w_i}\\ &= w_i + \eta (y – \hat{y}). f^{‘} (z).x_i \\ &= w_i + \eta \delta x_i\\ where, \; \delta &= (y – \hat{y}). f^{‘} (z)\end{aligned}$$

In gradient descent algorithm, we used all the training examples that is also known as batch gradient descent. But there are some problems with the batch gradient descent. If the number of training examples is too big it takes long and requires a lot of memory to compute one single epoch. For example, if we have ten million training examples we have to process all the training examples before taking a single step towards the minimum. To resolve this issue we use mini-batch gradient descent by splitting the training examples into small chunks called batches and training one batch at a time. Suppose, we have a set of \(m\) number of inputs or features \({X}\) with associated classification labels \({Y}\). $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned}$$ In mini-batch gradient descent, instead of processing all the training examples all together we split them in small batches e.g. 1000 training examples each. $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(1000)}|x^{(1001)} & x^{(1002)}&\ldots & x^{(2000)} &| x^{(2001)} & x^{(2002)} & \ldots & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(1000)} | y^{(1001)} & y^{(1002)} &\ldots & y^{(2000)} &| y^{(2001)} & y^{(2002)} & \ldots & y^{(m)}\end{bmatrix} \end{aligned}$$ For simplicity, we can denote mini-batches as $$\begin{aligned}X^{\{1\}} &= x ^{(1)} \; x^{(2)} \; … \; x^{(1000)}\\X^{\{2\}} &= x ^{(1001)} \; x^{(1002)} \; … \; x^{(2000)} \\ \vdots\\ X^{\{t\}} &= x ^{1000(t-1) + 1} \; x^{1000(t-1) + 2} \; … \; x^{(m)}\\ X &= [X^{\{1\}} X^{\{2\}} \ldots X^{\{t\}}]\end{aligned}$$ The training process is now similar with batch gradient descent. We will pass the split batches for training and have to update the weights for every single batch. The forward propagation and loss function for mini-batch gradient descent will be $$\begin{aligned} Z^{[1]} &= W^{[1]}.X^{\{t\}} + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \\ E &= \frac{t}{N} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2 \\ \frac{\partial E}{\partial w_i} &=- \frac{t}{N}(y – \hat{y}). f^{‘} (z).x_i \\N &= total \; number \; of \; sample \\ t &= number \; of \;batches \end{aligned}$$ Like before, we can define error term \(\delta\) for simplification and write the equations as $$\begin{aligned} \delta &= \frac{t}{N}(y – \hat{y}). f^{‘} (z) \\ w_i &= w_i + \eta\delta x_i\end{aligned}$$

When the mini-batch size is \(1\) the method is called stochastic gradient descent.

- If \(t=N\), Batch Gradient Descent
- If \(1<t<N\), Mini-Batch Gradient Descent
- If \(t=1\), Stochastic Gradient Descent

Gradient descent is a very basic and fundamental optimization algorithm and has some problems. While converging to the minimum gradient descent oscillates in up and down direction and takes a lot of steps. These oscillations make gradient descent a lot slower preventing us to use a larger learning rate. Another big problem is to get stuck in a local minimum instead of a global minimum. Gradient descent with momentum helps to address these issues. In this method, we calculate an exponentially weighted average of our gradients and use that to update our weights.

Suppose, we are trying to optimize a cost function that has contours like above and the red dot denotes the position of the local optima. Starting gradient descent from the first point we reach the second position after one iteration and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical. We can see that there are up and down oscillations in a vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

In this method, we use exponentially weighted average which in simple words is just taking previous values into account while updating the weights. Previously, to update weights we used $$\begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned}$$ In the momentum method we use exponentially weighted averages of \(\Delta w_1\) and \(\Delta w_2\) and denote it \(V_{\Delta w_1}\) and \(V_{\Delta w_2}\) respectively. $$\begin{aligned}V_{\Delta w_1} &= \beta_1 \times V_{\Delta w_1} + (1 – \beta_1)\times \Delta w_1 \\ V_{\Delta w_2} &= \beta_1 \times V_{\Delta w_2} + (1 – \beta_1)\times \Delta w_2\end{aligned}$$

Here \(\beta_1\) is a hyperparameter that balances values between the previous and the current values and ranges from \([0,1]\). After calculating the exponentially weighted averages we will update our parameters using these averages. $$\begin{aligned}w_1 &= w_1 + \eta \times V_{\Delta w_1} \\ w_2 &= w_2 + \eta \times V_{\Delta w_2}\end{aligned}$$

The intuition behind this method is quite simple. When we are taking the exponential average of the previous values the up and down oscillations cancels out each other and the vertical motion gets closer to zero. But in horizontal direction all the gradients are pointing to the same direction. So it doesn’t slow down in horizontal direction after adding up the previous values.

Root Mean Square Prop or RMSProp is another optimization algorithm that is quite similar to the gradient descent with momentum algorithm. Like previously, suppose that we are trying to optimize a cost function that has contours like below and the red dot denotes the position of the local optima. Starting gradient descent from the first point after one iteration we reach the second position and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical motion. We can also see that there are up and down oscillations in the vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

For simplicity, we are deriving the equations on two dimnstional space with two weights \(w_1\) and \(w_2\) where \(w_1\) is moving in horizontal direction and \(w_2\) is moving in vertical direction. In RMSProp, we use exponentially weighted averages like before but here we use square of the gradients $$\begin{aligned}S_{\Delta w_1} &= \beta_2 \times S_{\Delta w_1} + (1 – \beta_2)\times {\Delta w_1}^2 \\ S_{\Delta w_2} &= \beta_2 \times S_{\Delta w_2} + (1 – \beta_2)\times {\Delta w_2}^2\end{aligned}$$ After calculating the exponentially weighted averages we will update our parameters using these averages. $$\begin{aligned}w_1 &= w_1 + \eta \times \frac{\Delta w_1}{\sqrt {s_{\Delta w_1}} + \epsilon} \\ w_2 &= w_2 + \eta \times \frac{\Delta w_2}{\sqrt {s_{\Delta w_2}} + \epsilon}\end{aligned}$$

Here we use \(\epsilon\) for numerical stability and it is generally a very small number, \(10^{-8}\). The intuition behind RMSProp is that in the horizontal direction or in our case in \(w_1\) direction we want learning to go fast while in \(w_2\) direction we want to slow it down to reduce the oscillations. Since we are dividing by \(S_{\Delta w_1}\) and \(S_{\Delta w_2}\) we want \(S_{\Delta w_1}\) to be bigger and \(S_{\Delta w_2}\) to be smaller. If we look at the derivatives the angle or slope is much larger in the vertical direction while much smaller in the horizontal direction. So the square of \(S_{\Delta w_2}\) will be relatively larger than \(S_{\Delta w_1}\). In summary, we are dividing the updates in the vertical direction with a much bigger number to reduce the oscillations while dividing the updates in the horizontal direction with a smaller number that has very little impact.

Adam or Adaptive moment estimation algorithm is another very popular optimization algorithm for different types of neural networks. This algorithm is basically using momentum and RMSProp together. Below is the algorithm for Adam optimization.

$$\begin{aligned} V_{\Delta w_1} &= 0 \; S_{\Delta w_1} = 0 \; V_{\Delta w_2} = 0 \; S_{\Delta w_2} = 0\\ on \; iteration\; t&:\\ &V_{\Delta w_1} = \beta_1 V_{\Delta w_1} + (1 – \beta_1) \Delta w_1 \; \; \; V_{\Delta w_2} = \beta_1 V_{\Delta w_2} + (1 – \beta_1) \Delta w_2\\ &S_{\Delta w_1} = \beta_2 S_{\Delta w_1} + (1 – \beta_2) {\Delta w_1}^2 \; \; \; S_{\Delta w_2} = \beta_2 S_{\Delta w_2} + (1 – \beta_2) {\Delta w_2}^2\\ &V^{\prime}_{\Delta w_1} = \frac{V_{\Delta w_1}}{1 – \beta_1^t} \; \; \; V^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_1^t} \\ &S^{\prime}_{\Delta w_1} = \frac{S_{\Delta w_1}}{1 – \beta_2^t} \; \; \; S^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_2^t} \\ &w_1 = w_1 – \eta \frac{V^{\prime}_{\Delta w_1}}{\sqrt{S^{\prime}_{\Delta w_1}} + \epsilon} \\ &w_2 = w_2 – \eta \frac{V^{\prime}_{\Delta w_2}}{\sqrt{S^{\prime}_{\Delta w_2}} + \epsilon}\end{aligned}$$

In this post, we have discussed various optimization algorithms used in deep learning. In these optimization algorithms, there are different hyperparameters as well. Even though there’s no straight forward rule to choose these hyperparameters, we try to follow some common values for \(\beta_1\), \(\beta_2\) and \(\epsilon\).

- Learning Rate, \(\eta\) – Needs to be tuned
- \(\beta_1\) – 0.9
- \(\beta_2\) – 0.999
- \(\epsilon\) – \(10^{-8}\)

Previously, we have covered linear classifier models and we have described the concept of decision boundaries. We have taken some features \(x_i\), chosen some random weights \(w_i\) and calculated weighted sums. Then we used a simple sign function to determine if the value is positive or negative. : $$\begin{aligned} &\text{Model: } \hat{y}_i = \text{sign(Score(}\textbf{x}_i)) \\ &\text{Score(}\textbf{x}_i) = w_1 \textbf{x}_1+ … + w_n \textbf{ x}_n + b \end{aligned}$$ If the value is postive we have given a postive prediction and if the value is negative we have given a negative prediction. But there are problems with these models. The error function we have used in linear classifiers is a simple sign function that gives us output depending on the sign of the values and it is a discrete value.

Using linear classifiers, numbers like .0000001 and 10 are going to have the same predictions. But what if we want a model in which we can predict and at the same time we can tell how confident we are about a prediction. Here’s where probability comes in. Using probability we can tell how confident we are about a prediction. Also in order to use gradient descent, our error function needs to be both continuous and differentiable and we need to move from discrete predictions to continuous predictions.

In logistic regression, we don’t just use discrete predictions like \(+1\) or \(-1\), we actually predict a number generally between 0 and 1 which we will consider a probability. These probabilities are extremely useful because they give us an indication of how sure we are about the predictions we make. However, a model can not predict correctly every time. Using probability we can actually predict how confident we are about a prediction.

So, we want to use a continuous error function and we want to use probability. Question is, how can we do that? The way we address both of these problems is by changing our step function from a discrete step function to a continuous step function.

This function is called the sigmoid function and its equation is: $$ \sigma(x) = \frac{1}{1+e^{-x}} $$

It gives output between 0 and 1 and it is a continuous function. Now, we need an algorithm that will help us find the best model. The best model is the model that gives a higher probability of the events that actually happen to us. The method is called the Maximum likelihood in which we pick the model that gives the existing labels the highest probability. Thus by maximizing probability, we can pick the best possible model.

Maximum likelihood is a quite simple but interesting concept. Suppose, we have some probabilities of events \(p_1, p_2, …, p_n\). In order to obtain the probabilities of the whole arrangement, we have to multiply these probabilities \( p_1 * p_2 * …* p_n \). The bigger the individual probabilities will be the resulting value will get bigger as well. Now, all we need to do is to maximize this probability. The equation of maximum likelihood function: $$l(w) = \prod_{i=1}^NP(y_i | \textbf{x}_i, \textbf{w})$$

The maximum likelihood is the product of numbers and the product has got some problems. If we have a big data-set with thousands of probabilities and if these probabilities \(p_1, p_2, …, p_n\) get really small then the resulting multiplication will become really tiny since the value of the probability is always between 0 and 1. Also, if we have a product of thousands of numbers and we change one of them the product will change drastically. So, we want to avoid products and one alternative is to use sums. One easy way to do this is to use logarithms and the concept is called the cross-entropy.

In cross-entropy, we take products and we take the logarithms of them. We know: $$\begin{aligned}ln (a*b) = ln (a) + ln (b)\end{aligned} $$ So, if we take the logarithms of the probabilities we get: $$ ln(p_1 * p_2 * …* p_n) = ln(p_1) + ln(p_2)+ … +ln(p_n)$$ Which is a good solution. But there’s still a problem. Every \(ln(p_i)\) will be a negative number because logarithms of numbers between 0 and 1 is always a negative number and \(ln(1) = 0\). But if we take the negative of the logarithms of the probabilities we get positive numbers. $$\begin{aligned}\text{CE} &= -ln(p_1) – ln(p_2) … – ln(p_n) \\ &= – \sum ln(p_i)\end{aligned}$$ This value is called cross-entropy which is a very important concept and will tell us how good our model is. What cross-entropy says is if we have some events and some corresponding probabilities how likely are those events going to happen based on the probabilities. If it is very likely then we have a small cross-entropy, if it is unlikely cross-entropy will be large. It is important to notice that, our goal has changed from maximizing the probability to minimizing the cross-entropy. Now, we need a general formula for cross-entropy.

Let’s say we are going to do an assignment and our supervisor has suggested three books to look at to get some help. Suppose, the probability of getting something useful from these books are \(p_1 = 0.9\), \(p_2 = 0.7\) and \(p_3 = 0.2\) respectively and suppose these are independent events. So, the probability of these events to be happening is 0.9 for the first book, 0.7 for the second book and 0.8 for not having useful content in the third book. We know the cross-entropy will be the sum of the negative logarithms of the probabilities $$ -ln(p_1) – ln(p_2) – ln(1-p_3) = – ln(0.9) – ln (0.7) – ln(0.8)$$ Let’s there be another variable called \(y_i\) which will be 1 if there is something useful in the book and 0 if it’s not. Again suppose, for the first book \(y_1 = 1\), for the second book \(y_2 = 1\) and for the third book \(y_3 = 0\). If we put all this information together, we can find a formula: $$\text{Cross-Entropy} = – \sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i)$$

If we look closely, we will notice if there is something useful in the books \(y_i\) will be 1 and the second term will be zero and if there is not something useful the \(y_i = 0\) and the first term becomes zero. That is the beauty of this formula and it really gives the sums of the negatives of the logarithms of the probabilities which we have defined as cross-entropy. If we calculate the cross-entropy of the pair: $$\text{CE }[(1,1,0),(0.9,0.7,0.2)] = 0.69$$ which is low since the event is likely. On the other hand, if we calculate $$\text{CE }[(0,0,1),(0.9,0.7,0.2)] = 5.12$$ the cross-entropy is high. By convention we actually use average instead of using the sum: $$\text{Cross-Entropy} = – \frac{1}{m}\sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i)$$

In this example, we only had two classes. Having something useful in the book and not having something useful in the book. What if wwe have more than two classes? The concept is called Multi-Class Cross Entropy.

Suppose again, we need to do an assignment and our supervisor has suggested looking at three different books. But, in this case we have three different topics: Science, Technology, and Sports. For these cases, we use a formula which is more general. $$\text{Cross-Entropy} = -\sum_{i=1}^n\sum_{j=1}^m y_{ij} ln(p_{ij})$$

The above formula is actually the same as the formal we have used for two classes.So, now that we have our error function we can use gradient descent to minimize the function.

Let’s recall that we have* n *points labelled \(x_1, x_2, \ldots, x_n\) and we know the error formula and the prediction is given by: $$\begin{aligned}E &= -\frac{1}{n} \sum_{i=1}^n \left( y_i \ln(\hat{y_i}) + (1-y_i) \ln (1-\hat{y_i}) \right) \\ \hat{y_i} &= \sigma(Wx_i + b) \end{aligned}$$ Where the prediction is given by \(\hat{y_i} = \sigma(Wx^{(i)} + b)\) and we can get the derivative of the Sigmoid:

$$\begin{aligned} \sigma(x) &= \frac{1}{1+e^{-x}} \\ \sigma(x)^{’} & = \frac{d}{dx} \frac{1}{1+e^{-x}} \\ &= \frac{e^{-x}}{({1+e^{-x}})^2} = \frac{1}{1+e^{-x}}. \frac{e^{-x}}{({1+e^{-x}})}\\ &= \sigma(x) . (1 – \sigma(x)) \end{aligned} $$

We can use the above formula to derive: $$ \begin{aligned} \frac{\partial}{\partial w_j}\hat{y} &= \frac{\partial}{dw_j} \sigma(Wx+b)\\ &= \sigma(Wx+b) (1 – \sigma(Wx+b)) . \frac{\partial}{\partial w_j}(Wx+b)\\ &= \hat{y}(1-\hat{y}) . \frac{\partial}{\partial w_j}(Wx+b) \\ &= \hat{y}(1-\hat{y}) . \frac{\partial}{\partial w_j} (w_1x_1 + … + w_jx_j+ … + w_nx_n + b ) \\ &= \hat{y}(1-\hat{y}) . x_j \end{aligned} $$

Our goal is to calculate the gradient of E at a point \(x = (x_1, \ldots, x_n)\) given by the partial derivatives $$\nabla E =\left(\frac{\partial}{\partial w_1}E, \cdots, \frac{\partial}{\partial w_n}E, \frac{\partial}{\partial b}E \right)$$

To simplify our calculations, we’ll actually think of the error that each point produces, and calculate the derivative of this error. The total error, then, is the average of the errors at all the points. Now, we can go ahead and calculate the derivative of the error E at a point x with respect to the weight \(w_j\): $$\begin{aligned} \frac{\partial}{\partial w_j} E &= \frac{\partial}{\partial w_j} \left[- y \ln(\hat{y}) – (1-y) \ln (1-\hat{y}) \right] \\ &= – y \frac{\partial}{\partial w_j} \ln(\hat{y}) – (1-y) \frac{\partial}{\partial w_j} \ln (1-\hat{y}) \\ &= -y . \frac{1}{\hat{y}}. \frac{\partial}{\partial w_j} \hat{y} – (1-y) . \frac{1}{1-\hat{y}} . \frac{\partial}{\partial w_j} (1 – \hat{y}) \\ &= -y(1-\hat{y}).x_j + (1- y) \hat{y}. x_j \\ &= -(y-\hat{y}) x_j\end{aligned}$$ A similar calculation will show us that $$\frac{\partial}{\partial b} E = – (y – \hat{y})$$

This actually tells us something very important. For a point with coordinates \((x_1, \ldots, x_n)\), label y, and prediction \(\hat{y}\), the gradient of the error function at that point is

$$\left(-(y – \hat{y})x_1, \cdots, -(y – \hat{y})x_n, -(y – \hat{y}) \right)$$ In summary, the gradient is $$\nabla E = -(y – \hat{y}) (x_1, \ldots, x_n, 1)$$

We now have all the tools we need to describe the Logistic Regression algorithm. Here is the pseuodocode for the Logistic Regression algorithm:

]]>A single variable linear regression model can learn to predict an output variable \(y\) when there is only one input variable, \(x\) and there is a linear relationship between \(y\) and \(x\), that is, \(y \approx w_0 + w_1 x\). Well, that might not be a great predictive model for most cases. For example, let’s assume we are going to begin a real estate business and we are going to use machine learning to predict house prices. In particular, we have some houses that we want to list for sale, but we don’t know the value of these houses. So, we’re going to look at other houses that sold in the recent past. Looking at how much they’ve sold and the different characteristics of those houses, we will use that data to inform our listing price for our house that we’d like to sell.

Now, there are many aspects depend on the house price. But very first we might think about the relationship between the square foot and the price of the house and find a simple linear regression between them.

But we might go into the data set and notice there are these other houses, that have the very similar square footage but they’re just the fundamentally different house. For example, one house only has one bathroom but the other house has three bathrooms. So the other house, of course, should have a higher value than the one with just one bathroom. So, we need to add more input to our regression model.

Price | Bed_rooms | Bath_rooms | … | Sqft_living |

221900 | 3 | 1 | … | 1180 |

538000 | 3 | 2.25 | … | 2570 |

180000 | 2 | 1 | … | 770 |

604000 | 4 | 3 | … | 1960 |

510000 | 3 | 2 | … | 1680 |

1.23E+06 | 4 | 4.5 | … | 5420 |

257500 | 3 | 2.25 | … | 1715 |

So instead of just looking at square feet and using that to predict the house value, we’re going to look after other inputs as well. For example, we’re going to record the number of bathrooms in the house and we’re going to use both of these two inputs to predict the house price. In particular, in this higher dimensional space, we’re going to fit some function that models the relationship between the number of square feet and the number of bathrooms and the output, the value of the house. And so, in particular, one simple function that we can think about is just modeling this function as $$f(x) = w_0 + w_1 * x_1 + w_2*x_2$$ where \(x_1\) is the number of square feet and \(x_2\) is the number of bathrooms.

We have just talked about square feet and number of bathrooms as the inputs that we’re looking at for our regression model. But, associated with any house, there are lots of different attributes and lots of things that we can use as inputs to our regression model and here the multivariable regression comes into play.

When we have these multiple inputs, the simplest models we can think of is just a function directly of the inputs themselves. Input \(\textbf{x}\) is a d-dim vector and output y is a scalar $$\textbf{x} = (\textbf{x}[1], \textbf{x}[2], \dots , \textbf{x}[d])$$ where \(\textbf{x}[1]\), \( \textbf{x}[2] \), \(\dots\), \(\textbf{x}[d]\) are the arrays containing different features e.g. number of square foot, number of bathrooms, number of bedrooms, etc. Taking these inputs and plugging those directly entirely into our linear model with the noise term, \(\epsilon_i\) we get output \(y_i\) in the \(i^{ \text{th}}\) data point:

$$y_i = w_0 + w_1 \textbf{x}_i[1] + w_2 \textbf{x}_i[2] + … + w_d \textbf{x}_i[d] + \epsilon_i$$ where the first feature in our model is just one, the constant feature. The second feature is the first input, for example, the number of square feet and the third feature is our second input, for example, the number of bathrooms. And this goes on and on till we get to our last input, which is the little d+1 feature, for example, maybe lot size. For generically, instead of just a simple hyperplane e.g. a single line, we can fit a polynomial or we can fit some D-dimensional curve. $$\begin{aligned}y_i &= w_0 h_0( \textbf{x}_i)+ w_1 h_1( \textbf{x}_i) + … + w_D h_D( \textbf{x}_i) + \epsilon_i \\ &= \sum_{j=0}^D w_j h_j( \textbf{x}_i) + \epsilon_i\end{aligned}$$

Because we’re gonna assume that there’s some capital D different features of these multiple inputs. So just as an example, maybe our zero feature is just that one constant term and that’s pretty typical. That just shifts up and down where this curve leads in the space and maybe our first feature might be just our first input like in the hyperplane example which is quite fit. And the second feature, it could be the second input like in our hyperplane example or could be some other function of any of the inputs. Maybe we want to take the log of the seventh input, which happens to be the number of bedrooms, times just the number of bathrooms.

So, in this case, our second feature of the model is relating log number of bathrooms times number, log number of bedrooms times number of bathrooms to the output and then we get all the way up to our capital D feature which is some function of any of our inputs to our regression model.

$$\begin{aligned} feature \; 1 &= h_0(\textbf{x}) \dots e.g., 1 \\ feature \; 2 &= h_1(\textbf{x}) \dots e.g. , \textbf{x}[1] =sq. \;ft. \\ feature \; 3 &= h_2(\textbf{x}) \dots \textbf{x}[2] = \#bath \; or, \log(\textbf{x}[7]) \textbf{x}[2] = \log(\#bed) * \#bath \\ \vdots \\ feature \; D+1 &= h_D( \textbf{x}) \dots \text{some other function of} \; \textbf{x}[1], \dots, \textbf{x}[d]\end{aligned}$$So this is our generic multiple regression model with multiple features.

Like the simple linear regression, we’re going to talk about two different algorithms. One is just a closed-form solution and the other is gradient descent and there are gonna be multiple steps that we have to take to build up to deriving these algorithms and the first is simply to rewrite our regression model in the matrix notation.

So, we will begin with rewriting our multiple regression model in matrix notation for just a single observation $$ y_i= \sum_{j=0}^D w_j h_j({x}_i) + \epsilon_i $$ and we are gonna write this in matrix notation: $$\begin{aligned} y_i &= \begin{bmatrix}w_0 & w_1 & w_2 & … & w_D\end{bmatrix}\begin{bmatrix} h_0(x_i) \\ h_1(x_i) \\ h_2(x_i) \\ … \\ h_D(x_i)\end{bmatrix} + \epsilon_i \\ &= w_0 h_0({x}_i)+ w_1 h_1({x}_i) + … + w_D h_D({x}_i) + \epsilon_i \\ &= \textbf{w}^T\textbf{h}(\textbf{x}_i) + \epsilon_i \end{aligned}$$ In particular we’re going to think of vectors always as being defined as columns and if it defines a row, then we’re going to call that the transpose.

Now, we are going to rewrite our model for all the observations together.

$$\begin{bmatrix}y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_N \end{bmatrix} = \begin{bmatrix} h_0(x_1) & h_1(x_1) & \dots & h_D(x_1) \\ h_0(x_2) & h_1(x_2) & \dots & h_D(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ h_0(x_N) & h_1(x_N) & \dots & h_D(x_N) \end{bmatrix} \begin{bmatrix}w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} + \begin{bmatrix}\epsilon_1 \\ \epsilon _2 \\ \epsilon _3 \\ \vdots \\ \epsilon _N \end{bmatrix} $$ So, we get $$\textbf{y} = \textbf{Hw} + \mathbf{\epsilon}$$

here, we can write our entire regression model for \(N\) observations as this \(\textbf{y}\) vector and it is equal to the \(H\) matrix times this \(\textbf{w}\) vector plus \(\epsilon\) vector that represents all the errors in our model. So this is the matrix notation for our model of \(N\) observations.

In simple linear regression model, we have used Residual Sum Squares(RSS) as cost function. For any given fit, we define the residual sum of squares(RSS) of our parameter: $$\begin{aligned}RSS(w_0, w_1) &= \sum_{i=1}^N(y_i – [w_0 + w_1 x_i ])^2 \\ &= \sum_{i=1}^N(y_i – \hat{y}_i(w_0, w_1)) \end{aligned}$$where \( \hat{y}_i\) is the predicted value for \(y_i\) and \(w_0\) and \(w_1\) are the intercept and slope respectively. Now we will explain the residual sum of squares in the case of multiple regression. The residual is the difference between the actual observation and the predicted value. So what is our predicted value for the \(i^{\textbf{th}}\) observation? Well in our vector notation, what we do is we take each one of the weights in our model and then we multiply our features for that observation by that factor. So

$$\begin{aligned} \hat{y}_i &= \begin{bmatrix} h_0(x_i) & h_1(x_i) & h_2(x_i) & \dots & h_D(x_i) \end{bmatrix}\begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_D \end{bmatrix} \\ &= \textbf{h}^T( \textbf{x}_i) . \textbf{w}\end{aligned}$$ What is our predicted value for the ith observation. So our RSS for multiple regression is going to be: $$\begin{aligned}RSS(\textbf{w}) &= \sum_{i = 1}^N (y_i – h(\textbf{x}_i)^T \textbf{w})^2 \\ &= (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw}) \end{aligned}$$

So why are these two things equivalent? Well, we’re gonna break up the explanation into parts. We know that \(\hat{\textbf{y}}\), the vector of all of our end predicted observations is equal to \(H\) times \(w\) or \(\hat{\textbf{y}} = \textbf{H} * \textbf{w}\) implies: $$\textbf{y} – \textbf{H}.\textbf{w} = \textbf{y} – \hat{\textbf{y}}$$ this is equivalent of looking at our vector of actual observed values and subtracting our vector of predicted values. So we take all our house sales prices, and we look at all the predicted house prices, given a set of parameters, w, and we subtract them. What is that vector?

$$ \textbf{y} – \hat{\textbf{y}} = \begin{bmatrix}residual_1 \\ residual_2\\ \vdots \\residual_N \end{bmatrix}$$That vector is the vector of residuals because the result of this is the difference between our first house sale and our predicted house sale, we call that the residual for the first prediction, and likewise for the second, and all the way up to our \(n^\text{th}\) observation. So the term \(\textbf{y} – \textbf{H}.\textbf{w}\), is equivalent to the vector of the residuals from our predictions.

So, $$\begin{aligned} (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw}) &= \begin{bmatrix}residual_1 & residual_2 & \dots & residual_N \end{bmatrix} \begin{bmatrix}residual_1 \\ residual_2\\ \vdots \\residual_N \end{bmatrix} \\ &= (residual_1^2 + residual_2^2 + \dots + residual_N^2 ) \\ &= \sum _{i=1}^N residual_i^2 \\ &= RSS(\textbf{w}) \end{aligned} $$

By definition, that is exactly what residual sum of squares is using these \(\textbf{w}\) parameters.

Now we’re onto the final important step of the derivation, which is taking the gradient. The gradient was important both for our closed form solution as well as, of course, for the gradient descent algorithm. So, the gradient $$\begin{aligned} \nabla RSS(\textbf{w}) &= \nabla[ (\textbf{y} – \textbf{Hw})^T (\textbf{y} – \textbf{Hw})] \\ &= -2\textbf{H}^T(\textbf{y} – \textbf{Hw}) \end{aligned}$$

From calculus we know that, at the minimum the gradient will be **zero**. So, for closed form solution we take our gradient, and set it equal to **zero**, and solve for \(w\) $$\begin{aligned} \nabla RSS(\textbf{w}) = -2&\textbf{H}^T(\textbf{y} – \textbf{Hw}) = 0 \\ = -2&\textbf{H}^T \textbf{y} + 2\textbf{H}^T\textbf{Hw} = 0 \\ &\textbf{H}^T\textbf{Hw} = \textbf{H}^T\textbf{y} \\ \hat{w} = (&\textbf{H}^T \textbf{H})^{-1} \textbf{H}^T\textbf{y} \end{aligned}$$ we have a whole collection of different parameters, \(w_0\), \(w_1\) and all the way up to \(w_D\) multiplying all the features we’re using in our multiple regression model. And in one line we are able to write the solution to the fit using matrix notation. This motivates why we went through all this work to write things in this matrix notation because it allows us to have this nice closed form solution for all of our parameters written very compactly.

The other alternative approach and maybe more useful and simpler method is the Gradient Descent method where we’re walking down the surface of residual sum of squares and trying to get to the minimum. Of course, we might overshoot it and go back and forth but that’s a general idea that we’re doing this iterative procedure. $$\begin{aligned}while \; not \; co&nverged: \\ \textbf{w}^{(t+1)} \leftarrow &\textbf{w}^{(t)} – \eta \nabla RSS(\textbf{w}^{(t)}) \\ \leftarrow &\textbf{w}^{(t)} + 2\eta\textbf{H}^T(\textbf{y} – \textbf{Hw})\end{aligned}$$ what this version of the algorithm is doing is it’s taking our entire \(\textbf{w}\) vector, all the regression coefficients in our model, and updating them all at once using this matrix notation shown here.

Now that we have finished the theoretical part of the tutorial now you can see the code and try to understand different blocks of the code.

]]>In simple words, regression is a study of how to best fit a curve to summarize a collection of data. It’s one of the most powerful and well-studied types of supervised learning algorithms. In regression, we try to understand the data points by discovering the curve that might have generated them. In doing so, we seek an explanation for why the given data is scattered the way it is. The best-fit curve gives us a model for explaining how the dataset might have been produced. There are many types of regression e.g. simple linear regression, polynomial regression, multivariate regression. In this post, we will discuss simple linear regression only and later we will discuss the rest. We will also provide the python code from scratch at the end of the post

Simple regression, as the name implies, it’s just a very simple form of regression, where we assume that we just have one input and we’re just trying to fit a line.

Consider a data set containing age and the number of homicide deaths in the US in the year 2015:

age | num_homicide_deaths |

21 | 652 |

22 | 633 |

23 | 653 |

24 | 644 |

25 | 610 |

If we plot the dataset and the line best fit to it we see:

When we are talking about regression, our goal is to predict a continuous variable output given some input variables. For simple regression, we only have one input variable x which is the age in our case and our desired output y which is num of homicide deaths for each age. Our dataset then consists of many examples of x and y, so: $$ D = \{(x_1,y_1), (x_2,y_2), …, (x_N,y_N)\} $$ where \(N\) is the number of examples in the data set. So, our data set will look like: $$ D=\{(21,652),(22,633), …,(50,197)\} $$

So, how can we mathematically model single linear regression? Since the goal is to find the perfect line, let’s start by defining the **model** (the mathematical description of how predictions will be created) as a line. It’s very simple. We’re assuming we have just one input, which in this case is, age of people and one output which is the number of homicide deaths and we’re just gonna fit a line. And what’s the equation of a line? $$f(x) = w_0 + w_1*x$$

what this regression model then specifies is that each one of our observations \( y_i\) is simply that function evaluated at \(x_i\), so that’s \(w_0\) plus \(w_1*x_1\) plus the error term which we call \(\epsilon_i\). So this is our regression model $$y_i = w_0 + w_1*x_i + \epsilon_i$$ and to be clear, this error, \(\epsilon_i\), is the distance from our specific observation back down to the line. The parameters of this model are\(w_0\)and \(w_1\) are intercept and slope and we call these the regression coefficients.

We have chosen our model with two regression coefficients \(w_0\) and \(w_1\). For our data set, there can be infinitely many choices of these parameters. So our task is to find the best choice of parameters and we have to know how to measure the quality of the parameters or measure the quality of the fit. So in particular, we define a loss function (also called a cost function), which measures how good or bad a particular choice of \(w_0\) and \(w_1\) is. Values of \(w_0\) and \(w_1\) that seem poor should result in a large value of the loss function, whereas good values of \(w_0\) and \(w_1\) should result in small values of the loss function. So what’s the cost of using a specific line? It has many forms. But the one we’re gonna talk about here is Residual Sum of Squares (RSS): $$ RSS(w_0, w_1)= \sum_{i=1}^N(y_i-[ w_0 + w_1*x_i])^2$$

what Residual sum of squares assumes is that we’re just gonna add up the errors we made between what we believe the relationship is or what we’ve estimated the relationship to be between \(x\) and \(y\) and what the actual observation \(y\) is. And, we talked about the error as the \(\epsilon_i\).

Our cost was to find the residual sum of squares, and for any given line, we can compute the cost of that line. So for example, we have two different lines for two different residual sums of squares. How do we know which choice of parameters is better? Ones with the minimum RSS.

Our goal is to minimize over all possible \(w_0\) and \(w_1\) intercepts and slopes respectively, but a question is, how are we going to do this? The mathematical notation for this minimization over all possible \(w_0\) , \(w_1\) is $$min_{w_0,w_1}\sum_{i=1}^N(y_i – [w_0 +w_1x_i])^2$$ So we want to find the specific value of \(w_0\) and \(w_1\) we’ll call that \(\hat{w_0}\) and \(\hat{w_1}\) respectively that minimize this residual sum of squares.

The red dot marked below on the above plot shows where the desired minimum is. We need an algorithm to find this minimum. We will discuss two approaches e.g. **Closed-form Solution** and **Gradient Descent**.

From calculus, we know that at the minimum the derivatives will be \(0\). So, if we compute the gradient of our RSS: $$ \begin{aligned} RSS(w_0, w_1) &= \sum_{i=1}^N(y_i-[ w_0 + w_1*x_i])^2 \end{aligned}$$ $$\begin{aligned} &\nabla RSS(w_0, w_1) = \begin{bmatrix} \frac{\partial RSS}{\partial w_0} \\ \\ \frac{\partial RSS}{\partial w_1} \end{bmatrix} \\ &=\begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} \end{aligned}$$

Take this gradient, set it equal to zero and find the estiamates for \(w_0\) ,\(w_1\). Those are gonna be the estimates of our two parameters of our model that define our fitted line. $$ \begin{aligned} &\nabla RSS(w_0, w_1) = 0 \end{aligned} $$ implies, $$\begin{aligned} &\begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} = 0 \\ &-2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} = 0, \\ &-2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} * x_i = 0 \end{aligned}$$ Solving for\(w_0\)and\(w_1\)we get, $$ \begin{aligned} \hat{w_0} = \frac{\sum_{i = 1}^N y_i}{N} – \hat{w_1}\frac{\sum_{i=1}^N x_i}{N} \\ \hat{w_1} = \frac{\sum y_i x_i – \frac{\sum y_i \sum x_i}{N}}{\sum x_i^2 – \frac{\sum x_i \sum x_i}{N}} \end{aligned} $$

Now that we have the solutions, we just have to compute \( \hat{w}_1\) and then plug that in and compute \(\hat{w}_0\). To compute \( \hat{w}_1\) we need to compute a couple of terms e.g. sum over all of our observations \(\sum y_i\) and sum over all of our inputs \(\sum x_i\) and then a few other terms that are multipliers of our input and output \(\sum y_i x_i\) and \(\sum x_i^2\). Plug them into these equations and we get out what our optimal \(\hat{w}_0\) and \(\hat{w}_1\) are, that minimize our residual sum of squares.

The other approach that we will discuss is Gradient descent where we’re walking down this surface of residual sum of squares trying to get to the minimum. Of course, we might overshoot it and go back and forth but that’s a general idea that we’re doing this iterative procedure. $$\begin{aligned} &\nabla RSS(w_0, w_1) \\ &= \begin{bmatrix} -2\sum_{i=1}^N{[y_i – (w_0 + w_1 * x_i)]} \\ \\ -2\sum_{i=1}^N[y_i – (w_0 + w_1 * x_i)] *x_i \end{bmatrix} \\&= \begin{bmatrix} -2\sum_{i=1}^N{[y_i – \hat{y}_i (w_0, w_1)]} \\ \\ -2\sum_{i=1}^N[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix} \end{aligned}$$

Then our gradient descent algorithm will be: $$\begin{aligned}&while \; not \; converged: \begin{bmatrix} w_0^{(t+1)} \\ w_1^{(t+1)} \end{bmatrix}\\ &= \begin{bmatrix}w_0^{(t)} \\ w_1^{(t)} \end{bmatrix} – \eta* \begin{bmatrix} -2\sum{[y_i – \hat{y}_i (w_0, w_1)]}\\ \\-2\sum[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix}\\ &= \begin{bmatrix} w_0^{(t)} \\ w_1^{(t)} \end{bmatrix} +2\eta* \begin{bmatrix} \sum{[y_i – \hat{y}_i (w_0, w_1)]} \\ \\ \sum[y_i – \hat{y}_i(w_0,w_1)] *x_i \end{bmatrix} \end{aligned}$$

So gradient descent does this, we’re going to repeatedly update our weights. So set \(W\) to \(W\) minus \(\eta\) times the derivative, where \(W\) is the vector. We will repeatedly do that until the algorithm converges. \(\eta\) here is the learning rate and controls how big a step we take on each iteration of gradient descent and the derivative quantity is basically the update or the change we want to make to the parameters \(W\).

After all the hard work now we need to test our machine learning model. The dataset we work on, generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation $$ Prediction = \hat{w}_0 + \hat{w}_1 * data $$ where \( \hat{w}_0\) and \(\hat{w}_1\) are the optimized set weights.

Now that we have finished the theoretical part of the tutorial now you can see the code and try to understand different blocks of the code.

]]>In this tutorial, our case study is discussing how to predict house prices. We have a dataset which consists of house prices with the square feet of the house associated with it.

The data we use in machine learning are inherently noisy. So the way the world works is that there’s some true relationship between \(x\) and \(y\). And we’re representing that arbitrary relationship by \(f_{w(true)}\) which is the notation we’re using for that functional relationship. But of course, that’s not a perfect description between \(x\) and \(y\), the number of square feet and the house value.

There are a lot of other contributing factors including other attributes of the house that are not included just in square feet or how a person feels when they go in and make a purchase of a house or a personal relationship they might have with the owners. Or lots and lots of other things that we can’t ever perfectly capture with just some function between square feet and value, and so that is the noise that’s inherent in this process represented by this epsilon term.

So in particular for any observation \(y_i\), it’s the sum of this relationship between the square feet and the value plus this noise term \(\epsilon _i\) specific to that ith house. We assume that this noise has zero mean because if it didn’t that could be shoved into the f function instead. This noise is something that’s just a property of the data. We don’t have control over this. This has nothing to do with our model nor our estimation procedure, it’s just something that we have to deal with. And so this is called Irreducible error because it’s nothing that we can reduce through choosing a better model or a better estimation procedure. Things that we can control are bias and variance. We’re gonna discuss on these two terms.

Suppose, we have a dataset that’s just a random snapshot of N houses that were sold and recorded and we tabulated in our data set. Well, based on that data set, we fit some function e.g. a constant function.

But what if another set of N houses had been sold? Then we would have had a different data set that we were using. And when we went to fit our model, we would have gotten a different line.

So for one data set of size N, we get one fit and for another dataset, we have different fit associated with it. And of course, there’s a continuum of possible fits we might have gotten. And for all those possible fits, here this dashed green line below represents our average fit, averaged over all those fits weighted by how likely they were to have appeared.

Now bias is the difference between this average fit and the true function, \(f_{w(true)}\). The gray shaded region above, that’s the difference between the true function and our average fit. So intuitively what bias is saying is, if our model flexible enough to on average be able to capture the true relationship between square feet and house value. What we see is that for this very simple low complexity constant model, has high bias and it’s not flexible enough to have a good approximation to the true relationship. $$Bias(x) = f_{w(true)}(x) – f_{\bar{w}}(x)$$ Similarly a high complexity model has low bias.

Variance is how specific fits to a given data set can be different from one another, as we are looking at different possible data sets. In this case, when we look at a just simple constant model, the actual resulting fits don’t vary too much. And when we look at the space of all possible observations we see that the fits, they’re fairly similar and stable. So, when we look at the variation in these fits, which is drawn with grey bars above we see that they don’t vary very much.

So, for this low complexity model, we see that there’s low variance. To summarize, the variance is how much fits can vary. But if they could vary dramatically from one data set to the other, then we would have very erratic predictions. The predictions would just be sensitive to what data set we got. So, that would be a source of error in our predictions. And to see this, we can start looking at high-complexity models. So in particular, let’s look at this data set again. Now, let’s fit some high-order polynomial to it and when we think about looking over all possible data sets we might get, we might get some crazy set of curves. So, a high complexity model has a high variance.

Finally, we define the error term as: $$\begin{aligned}Error &= Reducible \: Error + Irreducible \; Error \\ &= {Bias}^2 + variance + Irreducible \; Error\end{aligned}$$

Here, we’re gonna plot bias and variance as a function of model complexity. We have discussed previously as our model complexity increases, our bias decreases. Because we can better and better approximate the true relationship between \(x\) and \(y\). On the other hand, variance increases. So, a very simple model has very low variance and the high-complexity models have high variance.

What we see is there’s a natural tradeoff between bias and variance called bias-variance tradeoff. And one way to summarize this is something that’s called mean squared error. Mean squared error is simply the sum of bias squared plus variance. Machine learning is all about this bias-variance tradeoff. We’re gonna see this again and again. And the goal is finding the optimal spot. The optimal spot where we get our minimum error, the minimum contribution of bias and variance, to our prediction errors.

In the earlier era of machine learning, there used to be a lot of discussion on the bias-variance tradeoff. And the reason for that was that we could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn’t have many tools, we didn’t have as many tools that just reduce bias or that just reduce variance without hurting the other one.

But in the modern deep learning, neural network or big data era, so long as we can keep training a bigger network, and so long as we can keep getting more data, which isn’t always the case for either of these, but if that’s the case, then getting a bigger network almost always just reduces bias without necessarily hurting variance, so long as we regularize appropriately. And getting more data pretty much always reduces variance and doesn’t hurt bias much.

Well, we defined bias very explicitly in terms of the relationship relative to the true function. And when we think about defining variance, we have to average over all possible data sets, and the same was true for bias too. But all possible data sets of size n, we could have gotten from the world, and we just don’t know what that is. So, we can’t compute bias-variance exactly. But there are ways to optimize this tradeoff between bias and variance in a practical way. For example, if we underfit the data this implies we have high bias and if we overfit the data this implies we have a high variance.

So for the sake of argument, let’s say that we’re recognizing cats in pictures, which is something that people can do nearly perfectly.

Train set error | 1% | 15% | 15% | 0.50% |

Dev set error | 11% | 16% | 30% | 1% |

High variance | High bias | High bias & high variance | Low bias & low variance |

Here, we can see when we have a small training set error and a relatively large dev set error it implies we might have overfitted the training data and we are not generalizing well. So, we have a high variance. Similarly, if we have a large training set error it means we have underfitted the data implies high bias.

To fix the high bias problem we can do the following:

- Using a bigger network
- Train longer
- Find better suited NN architecture

To fix high variance problem we can do the following:

- Use more data
- Regularization
- Find better NN architecture