We will start with the gradient descent algorithm which is the most basic and useful optimization algorithm for neural networks. Then we will explain the concept of mini-batch and Stochastic Gradient Descent (SGD) which is just a modification of Gradient Descent. After that, we will introduce the concept of momentum which is the core concept of the many modern optimization algorithms. Finally, we will derive the mathematical concepts of RMSProp and Adam Optimization that are two highly efficient algorithms.

Gradient descent is a mathematical concept of first-order iterative optimization algorithm for finding a local minimum of a differentiable function. In a neural network, this function is called loss function or cost function. Suppose we want to solve a classification problem using a neural network. For simplicity, suppose the neural network contains a single hidden layer only.

Suppose, we have a set of inputs or features \({X}\) with associated classification labels \({Y}\) $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned}$$ where \(x ^{(1)}, x^{(2)}, … , x^{(m)} \) are different training examples and we have \(m\) training examples. Then the output, \(A\) will be $$\begin{aligned} Z^{[1]} &= W^{[1]}.X + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \end{aligned}$$ where \(a^{[l](m)}\) denotes the output of different layers of different examples, \(l\) denotes the layer number and \(m\) denotes the different examples.Weights of this network are initialized with \(0\) or randomly.

Then we will use a loss function \(E\) to find the error between the prediction and the real value. Our goal is to minimize this error and make predictions as close as possible to the real values by updating the weights. To measure this error we use a metric known as the loss function. A common loss function is the Sum of Squared Error (SSE) defined as $$\begin{aligned} E &= \frac{1}{2} \sum \left[y – \hat{y}\right]^2 \\ E &= \frac{1}{2} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2\end{aligned}$$ where \(\hat{y}\) is the prediction and y is the true value. One thing to notice is that errors are the functions of weights. We have to tune up the weights to alter the network’s prediction which in turn will influence the overall error. Our goal is to find the weights that will minimize the errors and to do that we use gradients. Suppose, we plot the weights in the x-axis and the error \((E)\) in the y-axis to get a curve.

Here we are showing the simple depiction of the error with one weight. Our goal is to find the weight that minimizes the error. We start with a random weight and step towards the direction of the minimum. The direction is opposite to the gradient or the slope. After taking several steps towards the direction eventually we will be able to reach the minimum of the error function. To update the weights we use $$\begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned}$$ Here \(\eta \) is a constant that is called learning rate of a neural network. Leraning rate is a hyperparamater that needs to be adjusted in the network. Deriving the derivative term $$\begin{aligned}\frac{\partial E}{\partial w_i} &= -(y_i – \hat{y})\frac{\partial {\hat {y}}}{\partial w_i} \\& = -(y – \hat{y}). f^{‘} (z) \frac{\partial}{\partial w_i} \sum{w_i.x_i} \\ &= -(y – \hat{y}). f^{‘} (z).x_i\end{aligned}$$ We can simplify the equation defining another term \(\delta\) called error term $$\begin{aligned} w_i &= w_i – \eta \frac{\partial E}{\partial w_i}\\ &= w_i + \eta (y – \hat{y}). f^{‘} (z).x_i \\ &= w_i + \eta \delta x_i\\ where, \; \delta &= (y – \hat{y}). f^{‘} (z)\end{aligned}$$

In gradient descent algorithm, we used all the training examples that is also known as batch gradient descent. But there are some problems with the batch gradient descent. If the number of training examples is too big it takes long and requires a lot of memory to compute one single epoch. For example, if we have ten million training examples we have to process all the training examples before taking a single step towards the minimum. To resolve this issue we use mini-batch gradient descent by splitting the training examples into small chunks called batches and training one batch at a time. Suppose, we have a set of \(m\) number of inputs or features \({X}\) with associated classification labels \({Y}\). $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned}$$ In mini-batch gradient descent, instead of processing all the training examples all together we split them in small batches e.g. 1000 training examples each. $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(1000)}|x^{(1001)} & x^{(1002)}&\ldots & x^{(2000)} &| x^{(2001)} & x^{(2002)} & \ldots & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(1000)} | y^{(1001)} & y^{(1002)} &\ldots & y^{(2000)} &| y^{(2001)} & y^{(2002)} & \ldots & y^{(m)}\end{bmatrix} \end{aligned}$$ For simplicity, we can denote mini-batches as $$\begin{aligned}X^{\{1\}} &= x ^{(1)} \; x^{(2)} \; … \; x^{(1000)}\\X^{\{2\}} &= x ^{(1001)} \; x^{(1002)} \; … \; x^{(2000)} \\ \vdots\\ X^{\{t\}} &= x ^{1000(t-1) + 1} \; x^{1000(t-1) + 2} \; … \; x^{(m)}\\ X &= [X^{\{1\}} X^{\{2\}} \ldots X^{\{t\}}]\end{aligned}$$ The training process is now similar with batch gradient descent. We will pass the split batches for training and have to update the weights for every single batch. The forward propagation and loss function for mini-batch gradient descent will be $$\begin{aligned} Z^{[1]} &= W^{[1]}.X^{\{t\}} + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \\ E &= \frac{t}{N} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2 \\ \frac{\partial E}{\partial w_i} &=- \frac{t}{N}(y – \hat{y}). f^{‘} (z).x_i \\N &= total \; number \; of \; sample \\ t &= number \; of \;batches \end{aligned}$$ Like before, we can define error term \(\delta\) for simplification and write the equations as $$\begin{aligned} \delta &= \frac{t}{N}(y – \hat{y}). f^{‘} (z) \\ w_i &= w_i + \eta\delta x_i\end{aligned}$$

When the mini-batch size is \(1\) the method is called stochastic gradient descent.

- If \(t=N\), Batch Gradient Descent
- If \(1<t<N\), Mini-Batch Gradient Descent
- If \(t=1\), Stochastic Gradient Descent

Gradient descent is a very basic and fundamental optimization algorithm and has some problems. While converging to the minimum gradient descent oscillates in up and down direction and takes a lot of steps. These oscillations make gradient descent a lot slower preventing us to use a larger learning rate. Another big problem is to get stuck in a local minimum instead of a global minimum. Gradient descent with momentum helps to address these issues. In this method, we calculate an exponentially weighted average of our gradients and use that to update our weights.

Suppose, we are trying to optimize a cost function that has contours like above and the red dot denotes the position of the local optima. Starting gradient descent from the first point we reach the second position after one iteration and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical. We can see that there are up and down oscillations in a vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

In this method, we use exponentially weighted average which in simple words is just taking previous values into account while updating the weights. Previously, to update weights we used $$\begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned}$$ In the momentum method we use exponentially weighted averages of \(\Delta w_1\) and \(\Delta w_2\) and denote it \(V_{\Delta w_1}\) and \(V_{\Delta w_2}\) respectively. $$\begin{aligned}V_{\Delta w_1} &= \beta_1 \times V_{\Delta w_1} + (1 – \beta_1)\times \Delta w_1 \\ V_{\Delta w_2} &= \beta_1 \times V_{\Delta w_2} + (1 – \beta_1)\times \Delta w_2\end{aligned}$$

Here \(\beta_1\) is a hyperparameter that balances values between the previous and the current values and ranges from \([0,1]\). After calculating the exponentially weighted averages we will update our parameters using these averages. $$\begin{aligned}w_1 &= w_1 + \eta \times V_{\Delta w_1} \\ w_2 &= w_2 + \eta \times V_{\Delta w_2}\end{aligned}$$

The intuition behind this method is quite simple. When we are taking the exponential average of the previous values the up and down oscillations cancels out each other and the vertical motion gets closer to zero. But in horizontal direction all the gradients are pointing to the same direction. So it doesn’t slow down in horizontal direction after adding up the previous values.

Root Mean Square Prop or RMSProp is another optimization algorithm that is quite similar to the gradient descent with momentum algorithm. Like previously, suppose that we are trying to optimize a cost function that has contours like below and the red dot denotes the position of the local optima. Starting gradient descent from the first point after one iteration we reach the second position and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical motion. We can also see that there are up and down oscillations in the vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

For simplicity, we are deriving the equations on two dimnstional space with two weights \(w_1\) and \(w_2\) where \(w_1\) is moving in horizontal direction and \(w_2\) is moving in vertical direction. In RMSProp, we use exponentially weighted averages like before but here we use square of the gradients $$\begin{aligned}S_{\Delta w_1} &= \beta_2 \times S_{\Delta w_1} + (1 – \beta_2)\times {\Delta w_1}^2 \\ S_{\Delta w_2} &= \beta_2 \times S_{\Delta w_2} + (1 – \beta_2)\times {\Delta w_2}^2\end{aligned}$$ After calculating the exponentially weighted averages we will update our parameters using these averages. $$\begin{aligned}w_1 &= w_1 + \eta \times \frac{\Delta w_1}{\sqrt {s_{\Delta w_1}} + \epsilon} \\ w_2 &= w_2 + \eta \times \frac{\Delta w_2}{\sqrt {s_{\Delta w_2}} + \epsilon}\end{aligned}$$

Here we use \(\epsilon\) for numerical stability and it is generally a very small number, \(10^{-8}\). The intuition behind RMSProp is that in the horizontal direction or in our case in \(w_1\) direction we want learning to go fast while in \(w_2\) direction we want to slow it down to reduce the oscillations. Since we are dividing by \(S_{\Delta w_1}\) and \(S_{\Delta w_2}\) we want \(S_{\Delta w_1}\) to be bigger and \(S_{\Delta w_2}\) to be smaller. If we look at the derivatives the angle or slope is much larger in the vertical direction while much smaller in the horizontal direction. So the square of \(S_{\Delta w_2}\) will be relatively larger than \(S_{\Delta w_1}\). In summary, we are dividing the updates in the vertical direction with a much bigger number to reduce the oscillations while dividing the updates in the horizontal direction with a smaller number that has very little impact.

Adam or Adaptive moment estimation algorithm is another very popular optimization algorithm for different types of neural networks. This algorithm is basically using momentum and RMSProp together. Below is the algorithm for Adam optimization.

$$\begin{aligned} V_{\Delta w_1} &= 0 \; S_{\Delta w_1} = 0 \; V_{\Delta w_2} = 0 \; S_{\Delta w_2} = 0\\ on \; iteration\; t&:\\ &V_{\Delta w_1} = \beta_1 V_{\Delta w_1} + (1 – \beta_1) \Delta w_1 \; \; \; V_{\Delta w_2} = \beta_1 V_{\Delta w_2} + (1 – \beta_1) \Delta w_2\\ &S_{\Delta w_1} = \beta_2 S_{\Delta w_1} + (1 – \beta_2) {\Delta w_1}^2 \; \; \; S_{\Delta w_2} = \beta_2 S_{\Delta w_2} + (1 – \beta_2) {\Delta w_2}^2\\ &V^{\prime}_{\Delta w_1} = \frac{V_{\Delta w_1}}{1 – \beta_1^t} \; \; \; V^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_1^t} \\ &S^{\prime}_{\Delta w_1} = \frac{S_{\Delta w_1}}{1 – \beta_2^t} \; \; \; S^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_2^t} \\ &w_1 = w_1 – \eta \frac{V^{\prime}_{\Delta w_1}}{\sqrt{S^{\prime}_{\Delta w_1}} + \epsilon} \\ &w_2 = w_2 – \eta \frac{V^{\prime}_{\Delta w_2}}{\sqrt{S^{\prime}_{\Delta w_2}} + \epsilon}\end{aligned}$$

In this post, we have discussed various optimization algorithms used in deep learning. In these optimization algorithms, there are different hyperparameters as well. Even though there’s no straight forward rule to choose these hyperparameters, we try to follow some common values for \(\beta_1\), \(\beta_2\) and \(\epsilon\).

- Learning Rate, \(\eta\) – Needs to be tuned
- \(\beta_1\) – 0.9
- \(\beta_2\) – 0.999
- \(\epsilon\) – \(10^{-8}\)

In this tutorial, our case study is discussing how to predict house prices. We have a dataset which consists of house prices with the square feet of the house associated with it.

The data we use in machine learning are inherently noisy. So the way the world works is that there’s some true relationship between \(x\) and \(y\). And we’re representing that arbitrary relationship by \(f_{w(true)}\) which is the notation we’re using for that functional relationship. But of course, that’s not a perfect description between \(x\) and \(y\), the number of square feet and the house value.

There are a lot of other contributing factors including other attributes of the house that are not included just in square feet or how a person feels when they go in and make a purchase of a house or a personal relationship they might have with the owners. Or lots and lots of other things that we can’t ever perfectly capture with just some function between square feet and value, and so that is the noise that’s inherent in this process represented by this epsilon term.

So in particular for any observation \(y_i\), it’s the sum of this relationship between the square feet and the value plus this noise term \(\epsilon _i\) specific to that ith house. We assume that this noise has zero mean because if it didn’t that could be shoved into the f function instead. This noise is something that’s just a property of the data. We don’t have control over this. This has nothing to do with our model nor our estimation procedure, it’s just something that we have to deal with. And so this is called Irreducible error because it’s nothing that we can reduce through choosing a better model or a better estimation procedure. Things that we can control are bias and variance. We’re gonna discuss on these two terms.

Suppose, we have a dataset that’s just a random snapshot of N houses that were sold and recorded and we tabulated in our data set. Well, based on that data set, we fit some function e.g. a constant function.

But what if another set of N houses had been sold? Then we would have had a different data set that we were using. And when we went to fit our model, we would have gotten a different line.

So for one data set of size N, we get one fit and for another dataset, we have different fit associated with it. And of course, there’s a continuum of possible fits we might have gotten. And for all those possible fits, here this dashed green line below represents our average fit, averaged over all those fits weighted by how likely they were to have appeared.

Now bias is the difference between this average fit and the true function, \(f_{w(true)}\). The gray shaded region above, that’s the difference between the true function and our average fit. So intuitively what bias is saying is, if our model flexible enough to on average be able to capture the true relationship between square feet and house value. What we see is that for this very simple low complexity constant model, has high bias and it’s not flexible enough to have a good approximation to the true relationship. $$Bias(x) = f_{w(true)}(x) – f_{\bar{w}}(x)$$ Similarly a high complexity model has low bias.

Variance is how specific fits to a given data set can be different from one another, as we are looking at different possible data sets. In this case, when we look at a just simple constant model, the actual resulting fits don’t vary too much. And when we look at the space of all possible observations we see that the fits, they’re fairly similar and stable. So, when we look at the variation in these fits, which is drawn with grey bars above we see that they don’t vary very much.

So, for this low complexity model, we see that there’s low variance. To summarize, the variance is how much fits can vary. But if they could vary dramatically from one data set to the other, then we would have very erratic predictions. The predictions would just be sensitive to what data set we got. So, that would be a source of error in our predictions. And to see this, we can start looking at high-complexity models. So in particular, let’s look at this data set again. Now, let’s fit some high-order polynomial to it and when we think about looking over all possible data sets we might get, we might get some crazy set of curves. So, a high complexity model has a high variance.

Finally, we define the error term as: $$\begin{aligned}Error &= Reducible \: Error + Irreducible \; Error \\ &= {Bias}^2 + variance + Irreducible \; Error\end{aligned}$$

Here, we’re gonna plot bias and variance as a function of model complexity. We have discussed previously as our model complexity increases, our bias decreases. Because we can better and better approximate the true relationship between \(x\) and \(y\). On the other hand, variance increases. So, a very simple model has very low variance and the high-complexity models have high variance.

What we see is there’s a natural tradeoff between bias and variance called bias-variance tradeoff. And one way to summarize this is something that’s called mean squared error. Mean squared error is simply the sum of bias squared plus variance. Machine learning is all about this bias-variance tradeoff. We’re gonna see this again and again. And the goal is finding the optimal spot. The optimal spot where we get our minimum error, the minimum contribution of bias and variance, to our prediction errors.

In the earlier era of machine learning, there used to be a lot of discussion on the bias-variance tradeoff. And the reason for that was that we could increase bias and reduce variance, or reduce bias and increase variance. But back in the pre-deep learning era, we didn’t have many tools, we didn’t have as many tools that just reduce bias or that just reduce variance without hurting the other one.

But in the modern deep learning, neural network or big data era, so long as we can keep training a bigger network, and so long as we can keep getting more data, which isn’t always the case for either of these, but if that’s the case, then getting a bigger network almost always just reduces bias without necessarily hurting variance, so long as we regularize appropriately. And getting more data pretty much always reduces variance and doesn’t hurt bias much.

Well, we defined bias very explicitly in terms of the relationship relative to the true function. And when we think about defining variance, we have to average over all possible data sets, and the same was true for bias too. But all possible data sets of size n, we could have gotten from the world, and we just don’t know what that is. So, we can’t compute bias-variance exactly. But there are ways to optimize this tradeoff between bias and variance in a practical way. For example, if we underfit the data this implies we have high bias and if we overfit the data this implies we have a high variance.

So for the sake of argument, let’s say that we’re recognizing cats in pictures, which is something that people can do nearly perfectly.

Train set error | 1% | 15% | 15% | 0.50% |

Dev set error | 11% | 16% | 30% | 1% |

High variance | High bias | High bias & high variance | Low bias & low variance |

Here, we can see when we have a small training set error and a relatively large dev set error it implies we might have overfitted the training data and we are not generalizing well. So, we have a high variance. Similarly, if we have a large training set error it means we have underfitted the data implies high bias.

To fix the high bias problem we can do the following:

- Using a bigger network
- Train longer
- Find better suited NN architecture

To fix high variance problem we can do the following:

- Use more data
- Regularization
- Find better NN architecture

But in the modern deep learning, neural network or big data era getting a bigger network almost always just reduces bias without necessarily hurting variance, so long as we regularize appropriately. Also, getting more data always reduces variance and doesn’t hurt bias much. But we can’t always get more training data, or it could be expensive. Adding regularization will often help to prevent overfitting, or to reduce the errors in our network.

So generally, high complexity models overfit the data having very low bias but high variance and low complexity models underfit the data having a high bias, but low variance. We want to balance the trade-off between bias and variance to get to that spot of having good predictive performance. In this tutorial, we will show how we can use regularization to automatically balance between bias and variance. We will show concepts like ridge and lasso regression and we will develop these ideas using logistic regression.

We use regularization to prevent overfitting how do we know that a model is overfitting? We have to find a quantitative measure that’s indicative of when a model is overfitting. Consider two models

$$\begin{aligned}\hat{y} &=x_1+x_2 \\ \hat{y}&=10x_1+10x_2 \end{aligned}$$

Our task is to separate two points (1,1) and (-1,-1) with a line using these models. We will use sigmoid activation function \(\). In the first model:

$$\begin{aligned} \hat{y}&= \sigma(w_1x_1 + w_2x_2 + b) \\\hat{y}&=\sigma(1+1) = \sigma(2)=0.880797 \\ \hat{y}&=\sigma(−1–1)= \sigma(−2)=0.1192\end{aligned}$$

Now in the second model, for (1,1) and (-1,-1):

$$\begin{aligned}\hat{y} &=\sigma(10(1)+10(1)) =\sigma(20)=0.99999 \\\hat{y}&=\sigma(-10-10) =\sigma(−20)=0.000000002 \end{aligned}$$

Looking for the first time at the probabilities one might think that since the second model is giving us better probabilities it is better model. But that’s not the case since second model is actually overfitting.

When we use sigmoid to the small values we get the first model which has a nice slope for the gradient descent. But when we multiply the linear function \(x_1 + x_2\) with 10 and take \(\sigma(10x_1 + 10x_2)\) our predictions are much better since we are closer to 0 and 1 but the function becomes much steeper and much harder to do gradient descent here since the derivatives are mostly close to zero and they are very large when we go to the middle of the curve. Therefore in order to do gradient descent we prefer model 1 more than the model 2. In a conceptual way the second model is too certain and it gives little room for applying gradient descent.

So, when models become overfit the estimated coefficients of these models tend to become really large in magnitude. Ridge and lasso regression just quantify overfitting through this measure of the magnitude of the coefficients and we will do this by modifying our cost function.

- How well function fits data
- Magnitude of coefficients

We have seen the magnitude of our estimated coefficients as an indicator of overfitting. So we can write down a total cost that has these two terms:

$$ \textbf{Total Cost} = \textbf{Measure of Fit} + \textbf{Measure of Magnitude of Coefficients}$$

where total cost is our new measure of the quality of the fit, and when we say measure of fit here, it means a small number indicates there’s a good fit to the data. On the other hand, if the measure of the magnitude of the coefficients is small this indicates the size of the coefficients is small and we’re unlikely to be in the setting of a very overfit model. So, what’s our measure of fit? For logistic regression, we try to minimize the cost function which is defined as: $$\text{Cross-Entropy} = – \frac{1}{m}\sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i)$$ Now we just need a measure of the magnitude of the coefficients which will tell us that the coeeficients are big. One way is just summing all the coefficients together. $$ \textbf{Magnitude of Coefficients} = \sum_{i = 1}^n w_i$$ Problem is if we take two coefficients \(w_1 = 1,527,301\) and \(w_2 = -1,605,253\) in our model and if we look at \(w_1 + w_2\), this is going to be a small number despite the fact that each of the coefficients themselves was quite large. We can fix this issue by taking the absolute value: $$ \textbf{Magnitude of Coefficients} = \mid w_1\mid + … + \mid w_i\mid = \sum_{i = 1}^n \mid w_i\mid$$ This is one of the well known solutions and defined as the **lasso regression** or \(L_1\) norm. So let’s consider the resulting objective, where we’re gonna try and search over all possible w vectors. To find the ones that minimize our cost function plus the \(L_1\) norm. So that’s gonna be our \(\hat{w}\), our estimated model parameters. But we’d like to be able to control how much we’re weighing the complexity of the model as measured by this magnitude of our coefficient, relative to the fit of the model. We’d like to **balance** between these two terms, and so we’re gonna introduce another parameter \(\lambda\). This is called a tuning parameter and this is balancing between this fit and magnitude. So, our cost function becomes $$\text{ERROR FUNCTION} = \frac{1}{m} \left(- \sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i) + \lambda \sum_{i = 1}^n \mid w_i \mid \right)$$

Another choice is to consider the sum of the squares of the coefficients $$ \textbf{Magnitude of Coefficients} = w_1^2 + … + w_i^2 = \sum_{i = 1}^n w_i^2$$ which is called the ridge regression or \(L_2\) norm. $$ \text{ERROR FUNCTION} = \frac{1}{m} \left(- \sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i) + \lambda \sum_{i = 1}^n w_i^2 \right)$$

Now, there are many ways to implement ridge regression. We will just add \(L_2\) norm after our cost function. Here we will use \(\frac{\lambda}{2}\) instead of \(\lambda\) since it will be easier to implement. $$ \text{ERROR FUNCTION} = \frac{1}{m} \left(-\sum_{i=1}^m y_{i}ln(p_i) + (1- y_i)ln(1-p_i) + \frac{\lambda}{2} \sum_{i = 1}^n w_i^2\right) $$

So, how do we implement Gradient descent in regularization? Previously without reguliraztion our gradient descent was like: $$\begin{aligned}w_{new} &= w – \alpha * dw \\ dw &= \frac{dE}{dw} \end{aligned}$$ For \(L_1\) regularization the problem is the gradient of the norm does not exist at 0, so you need to be careful. As for the regularization term, if \(w_i>0\) then \(\mid w_i\mid = w_i\) and the gradient is \(+1\), similarly when \(w_i<0\) the gradient is \(−1\), so in summary $$\begin{aligned}w_{new} &= w – \alpha * dw \\ &= w – \alpha * (\frac{dE}{dw} + \frac{\lambda}{m}\frac{d\mid w\mid }{dw}) \\ &=\left\{ \begin{array}{c l} w – \alpha * (\frac{dE}{dw} + \frac{\lambda}{m}) & w>0\\ w – \alpha * (\frac{dE}{dw} – \frac{\lambda}{m}) & w<0 \end{array}\right. \end{aligned} $$

For \(L_2\) regularization $$dw = \frac{dE}{dw} + \frac{\lambda}{m}w $$ And then we just compute this update, same as before $$\begin{aligned}w_{new} &= w – \alpha * dw \\ &= w – \alpha * (\frac{dE}{dw} + \frac{\lambda}{m}w) \\ &= (1 – \frac{\alpha \lambda}{m}) w – \alpha \frac{dE}{dw} \end{aligned} $$

In the above equation, we see that if we take \(\lambda = 0\), we get the previous cost function without regularization. Now, let’s take \(\lambda = \infty\) and we have a massively large weight on this magnitude term. So, what happens to any solution where \(\hat{w} \neq 0\)? For solutions where \(\hat{w} \neq 0\), the total cost is \(\infty\). But for sure, if \( \hat{w} = 0\), then the total cost is minimum. So, the minimizing solution here is always \( \hat{w} = 0\) because that’s gonna minimize the total cost over all possible \(w\). So, we will be operating in a regime where \(\lambda\) is in between \(0\) to \(\infty\).

Now, let’s talk about this in the context of the bias-variance trade-off. And what we saw is when we had very large lambda, we had a solution with very high bias, but low variance and one way to see this is that, when we’re cranking lambda all the way up to infinity, in that limit, we get coefficients shrunk to be zero, and clearly that’s a model with high bias but low variance. On the other hand, when we had very small lambda, we have a model that is low bias, but high variance. And to see this think about setting lambda to zero, in which case, we just get our old solution. And there we see that for higher complexity models clearly you’re gonna have low bias but high variance. So what we see is this lambda tuning parameter controls our model complexity and controls this bias-variance trade-off.

]]>Animal brains even the small ones like the brain of a pigeon was more capable of than digital computers with huge processing power and storage space. This puzzled scientists for many years. This turned the attention to the architectural differences. Traditional computers processed data very much sequentially and there is no fuzziness or discreteness. Animal brains, on the other hand, although apparently running at much slower rhythms, seemed to process signals in parallel, and fuzziness was a feature of their computation.

The basic unit of a biological brain is the **neuron**. Neurons having various forms of them, their job is to transmit electrical signals from one end to another, from the dendrites along the axons to the terminals. These signals are then passed from one neuron to another. This is how our body senses light, touch pressure, heat and so on. Signals from specialized sensory neurons are transmitted along with our nervous system to our brain, which itself is mostly made of neurons too. Now, the question is why are biological brains so capable even though they are much slower and consist of relatively few computing elements when compared to modern computers?

Let’s look at how a biological neuron works. It takes an electrical input and pops out another electrical signal. But can we represent neurons as linear functions? The answer is no! A biological neuron doesn’t produce an output that is a linear function of the form $$Z = W.X + b.$$ So, neurons don’t exactly react readily but instead, suppress the input until it has grown so large that it triggers an output. Here comes the idea of the **activation functions**.

A function that takes the input signal and generates an output signal, but takes into account some kind of **threshold** is called an activation function. There are many such activation functions.

Here, we can see for the **step function** the output is zero for low input values. But once it reaches the threshold, output jumps up. We can improve on the step function in many ways. The S-shaped function shown above is called the **sigmoid** or **logistic** function is another very popular activation function which equation is $$ \sigma{(z)} = \frac{1}{1+e^{-z}}$$ Another very important activation function which is used vastly is called **ReLU** or *Rectified Linear Unit* activation function which equation is $$R(z) = max(0, z)$$ Here is the brief table of different types of activation functions.

The basic computational unit of a neural network is also called neuron. It receives input from some other nodes, or from an external source and generates an output. Each input has an associated

weight \((w)$, which is assigned on the basis of its relative importance to other inputs. The node applies an activation function e.g. sigmoid to the weighted sum of its inputs. If the combined signal is not large enough then the effect of the sigmoid threshold function is to suppress the output signal and fire otherwise.

In a biological neural network, electrical signals are collected by dendrites and these combine to form a stronger signal. If the signal is strong and passes the threshold the neuron fires a signal down the axon towards the terminals to pass onto the next neuron’s dendrites. The important thing to notice that each neuron takes input from many before it and also provides signals to many more. One way to replicate this from nature to an artificial model is to have layers of neurons, with each connected to every other one in the preceding and subsequent layer. The following diagram illustrates this idea:

we can see a neural network with three layers, each with several artificial neurons or nodes. Also, each node is connected to every other node in the preceding and next layers. This is how we actually take the idea from the biological brain and apply it to build a neural architecture for computers. But how this architecture actually learns? The most obvious thing is to adjust the strength of the connections between nodes. Within a node, we could have adjusted the summation of the inputs, or we could have adjusted the shape of the sigmoid threshold function, but that’s more complicated than simply adjusting the strength of the connections between the nodes. The diagram in the right shows the connected nodes, but this time weight is shown associated with each connection. Low weight will de-emphasize a signal, and a high weight will amplify it.

Next, we will see the idea of calculating signals in a neural network from the inputs through the different layers to become the output. The idea is called **forward propagation** part of a neural network.

Suppose, we have a Boolean function represented by \(F(x,y,z) = xy + \bar{z}\). The values of this function are given below which we will use to demonstrate the calculations of the neural network.

\(x\) | \(y\) | \(z\) | \(xy\) | \(\bar{z}\) | F(x,y,z) = xy+\(\bar{z}\) |

1 | 1 | 1 | 1 | 0 | 1 |

1 | 1 | 0 | 1 | 1 | 1 |

1 | 0 | 1 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 1 | 1 |

0 | 1 | 1 | 0 | 0 | 0 |

0 | 1 | 0 | 0 | 1 | 1 |

0 | 0 | 1 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 1 | 1 |

Let’s use the fourth row \((1, 0, 1) => 0\) to demonstrate the forward propagation.

Here, there is a neural network with three layers. The first layer is called the input layer and the last layer is called the output layer. The layers in the middle are called the hidden layer. We have used one hidden layer for simplicity. Input and hidden layers contain three nodes and the output layer contains a single node. We now assign weights to the synapses between the input and hidden layer. The weights are taken randomly between \(0\) and \(1\) since it is the first time we’re forward propagating.

Now for a single neuron or node, we take all the inputs and multiply it with the associated weights and sum it. Then the node applies an activation function e.g. sigmoid to the weighted sum of its inputs to introduce non-linearity. $$\begin{aligned}z = \sum w_i x_i \\ \sigma(z) = \frac{1}{1 + e^{-z}}\end{aligned}$$ Now, for several layers, this process repeats for every node on these layers. So, let’s focus on node \(1\) of the hidden layer. All the nodes in the input layer are connected to it. Those input nodes have raw values of \(1\), \(0\) and \(1\) with the associated weights of \(0.9, 0.8, 0.1\) respectively. We sum the product of the inputs with their corresponding set of weights to arrive at the first value for the hidden layer and do the same to get the other values of the hidden layer.

\(z_1 = w_1 * x_1 + w_4 * x_2 + w_7 * x_3 = 1 * 0.9 + 0 * 0.8 + 1 * 0.1 = 1\)

\(z_2 = w_2 * x_1 + w_5 * x_2 + w_8 * x_3 = 1 * 0.3 + 0 * 0.5 + 1 * 0.6 = 0.9\)

\(z_3 = w_3 * x_1 + w_6 * x_2 + w_9 * x_3 = 1 * 0.2 + 0 * 0.4 + 1 * 0.7 = 0.9\)

We put these sums smaller in the circle because they’re not the final value. We can now, finally calculate the nodes final output value using the activation function \( \sigma(z) = \frac{1}{1 + e^{-z}}\). Applying \(\sigma(z)\) to the three hidden layer weighted sums, we get: $$ \begin{aligned}h_1 &= \sigma(1.0)& = 0.731058578630 \\ h_2 &= \sigma(0.9) &= 0.710949502625 \\ h_3 &= \sigma(0.9) &= 0.710949502625 \end{aligned}$$ We add this to our neural network as hidden layer results. Then, we calculate the weighted sum of the hidden layer results with the second set of weights (also determined at random) to determine the output sum.

0.73 * 0.3 + 0.71 * 0.5 + 0.71 * 0.9 = 1.213

Finally, we apply the sigmoid activation function to get the final output result: $$ \sigma(1.213) = 0.7708293339958$$ Because we used a random set of initial weights, the value of the output neuron is off the mark; in this case by +0.77 (since the target is 0).

We can see that, we are not even close to our target value. That’s because we have initialized weights randomly and we have to calibrate them. The process we will use to calibrate the weights is called * backpropagation* which we will cover next. But, before diving into backpropagation we need to give some ideas about computing forward propagation process using matrix computation.

A matrix is nothing but a table or a grid of numbers. For example,

\[\begin{bmatrix} w_{1,1}& w_{1,2} & w_{1,3}\\ w_{2,1}&w_{2,2}& w_{2,3} \\ w_{3,1}&w_{3,2}&w_{3,3}\\ \end{bmatrix}\] Here, matrix values are the weights of the neural network and we can represent the inputs of the network by another matrix \[\begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix}\] When we multiply these two matrices we get

\[\begin{aligned}X &= \begin{bmatrix} w_{1,1}& w_{1,2} & w_{1,3}\\ w_{2,1}&w_{2,2}& w_{2,3} \\ w_{3,1}&w_{3,2}&w_{3,3}\\ \end{bmatrix}^T . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\ &=

\begin{bmatrix} w_{1,1}& w_{2,1} & w_{3,1}\\ w_{1,2}&w_{2,2}& w_{3,1} \\ w_{1,3}&w_{2,3}&w_{3,3} \end{bmatrix} . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\&= \begin{bmatrix} w_{1}& w_{4} & w_{7}\\ w_{2}&w_{5}& w_{8} \\ w_{3}&w_{6}&w_{9}\\ \end{bmatrix} . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\ &= \begin{bmatrix}(w_{1} * input_1) +

(w_{4} * input_2 ) + (w_{7} * input_3 \\ (w_{2} * input_1) +

(w_{5} * input_2 ) + (w_{8} * input_3)\\ (w_{3} * input_1) +

(w_{6} * input_2 ) + (w_{9} * input_3) \end{bmatrix}\end{aligned}\] This is the result what we have found as the weighted sum of the input and the hidden layer. So, we can calculate the hidden layer output: $$\begin{aligned} H &= \sigma(W^T .x) \\ &= \begin{bmatrix}h_1 \\ h_2 \\ h_3 \end{bmatrix} \end{aligned}$$ where W is the weight matrix and x is the input matrix. Which will be much easier and faster to calculate, since it doesn’t require to calculate every node individually. This technique is called **vectorization**.

So, the general equations will be $$ \begin{aligned} z_1^{[1]} = w_1^{[1]T} x + b_1^{[1]}, a_1^{[1]} = \sigma(z_1{[1]}) \\ z_2^{[1]} = w_2^{[1]T} x + b_2^{[1]}, a_2^{[1]} = \sigma(z_2{[1]}) \\ z_3^{[1]} = w_3^{[1]T} x + b_3^{[1]}, a_3^{[1]} = \sigma(z_3{[1]}) \end{aligned}$$ Where \(z_l^{[i]}\) is the wighted sum of a single node and \(l\) denotes the number of layers and \(i\) denotes node number in a layer.

$$ \begin{aligned}z^{[1]} &= \begin{bmatrix} w_1^{[1]T} \\ w_2^{[1]T} \\ w_3^{[1]T} \end{bmatrix} . \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]} \end{bmatrix} \\ &=

\begin{bmatrix} w_1^{[1]T}.x + b_1^{[1]} \\ w_2^{[1]T}.x + b_2^{[1]} \\ w_3^{[1]T}.x + b_3^{[1]} \end{bmatrix} = \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]} \end{bmatrix} \\ z^{[1]} &= W^{[1]}.x + b^{[1]} \\ a^{[1]} &= \sigma(z^{[1]}) \\ z^{[2]} &= W^{[2]}.a^{[1]} + b^{[2]} \\ a^{[2]} &= \sigma(z^{[2]}) \end{aligned}$$

Now, these equations are for a single trianing example! But in general, we will have many such examples like the values of the Boolean functions shown above. Let $$ X = \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix}$$ where \(x ^{(1)}, x^{(2)}, … , x^{(m)} \) are different training examples. Clearly, we have \(m\) training examples. Then $$\begin{aligned} Z^{[1]} &= W^{[1]}.X + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[2]} &= W^{[2]}.A^{[1]} + b^{[2]} \\ A^{[2]} &= \sigma(Z^{[2]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \end{aligned}$$ where \(a^{[l](m)}\) denotes the output of different layers of different examples and \(l\) denotes the layer number and \(m\) denotes the different examples.

[latexpage]To improve our model, we first have to quantify how wrong our model predictions are with compared to the target values of the model. Then, we adjust the weights accordingly so that the margin of error is decreased. Similar to the forward propagation, back propagation calculations occur at each layer but as the name indicates, backwardly. We begin by changing the weights between the hidden layer and the output layer.

For quantifying how wrong our model is, first we have to calculate the error between predicted values and the output values of the model. To do this we have to use a cost function. A cost function can be the sum of the difference between the output value and the predicted value like $$ E_{total} = \sum (target – output)$$ Let, we have target values \(2, 3, 5, 9\) and output values \(1, 5, 3, 6\) respectively.

Then the **total error** becomes $$ E_{total} = 1 – 2 + 2 +3 = 4$$ We can see that second and third value cancels each other. So, we are not getting the actual error. To make our model more accurate we have to use a different cost function. What about the sum of the absolute value of the error?

$$ E_{total} = \sum |target – output|$$

Then the total error becomes $$ E_{total} = 1 + 2 + 2 +3 = 8$$ and it doesn’t cancel anything. The reason this isn’t popular is that the slope isn’t continuous near the minimum and this makes gradient descent not work so well, because we can bounce around the V-shaped valley that this error function has. The slope doesn’t get smaller or closer to the minimum, so our steps don’t get smaller, which means they risk overshooting. A better option is to use the sum of the squares of the errors. So, we will calculate the error for each output neuron using the squared error function and sum them to get the total error: $$ E_{total} = \sum \frac{1}{2}(target – output)^2$$

For example, the target output for our network is \(0\) but the neural network output is \(0.77\), therefore its error is: $$E_{total} = \frac{1}{2}(0 – 0.77)^2 = .29645$$ **Cross Entropy** is another very popular cost function which equation is: $$ E_{total} = – \sum target * \log(output)$$

With backpropagation our goal is to update each of the weights in the network so that the cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. Like forward propagation, we will derive equations for a single neuron to update the weights and then expand the concept for the rest of the network.

Consider, a network with inputs \(x_1, x_2\) and \(x_3\) and the associated weights \(w_1, w_2\) and \(w_3\) respectively. we know from the forward propagation part that $$\begin{aligned}\sigma(z) &= \frac{1}{1 + e^{-z}} \\ where, \; z &= \sum w_ix_i = w_1x_1 + w_2x_2 + w_3x_3\end{aligned}$$ which we call as the output of the network. We have a target value for the network and for a untrained network when weights are not calibrated, there will be an error. We denote this error as \(E_{total}\) where $$ E_{total} = \frac{1}{2}(target – output)^2$$ Our job is to find out how we have adjust the weights to decrease this error. consider \(w_1\). We want to know how much a change in \(w_1\) affects the total error, \(\frac{\partial E_{total}}{\partial w_1}\). If we look closely, we can see that, error is affected by the output and output is affected by \(z\) while \(z\) is affected by the weight \(w_1\). By applying the chain rule, we know that, $$\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial output} * \frac{\partial output}{\partial z} * \frac{\partial z}{\partial w_1} $$ We need to break down each piece of the equation

$$\begin{aligned} \frac{\partial E_{total}}{\partial output} &= 2. \frac{1}{2} . (target – output)^{2-1} . -1\\ &= (output – target)\end{aligned}$$ Next, we have to find out how much the output changes with respect to its total net input where $$\begin{aligned}output &= \sigma(z) = \frac{1}{1 + e^{-z}} \\ \frac{\partial(output)}{\partial z} &= \frac{e^{-x}}{{(1+e^{-x})}^2} \\ &= \frac{1}{1 + e^{-z}} . \frac{e^{-z}}{1 + e^{-z}} \\ &= \sigma(z).(1 – \sigma(z))\end{aligned}$$ Finally, how much the total net input \(z\) changes with respect to \(w_1\) needs to be determined $$\begin{aligned}z &= w_1x_1 + w_2x_2 + w_3x_3 \\ \frac{\partial z}{\partial w_1} &= x_1 \end{aligned}$$ Putting all the pieces together $$\frac{\partial E_{total}}{\partial w_1} = (output – target) * \sigma(z).(1 – \sigma(z)) * x_1$$ To adjust the weights we then use the formula $$\begin{aligned}w_1 &= w_1 – \eta * dw_1 \;, where \; \eta \; is \; called \; the \; learning \; rate \\ dw_1 &= \frac{\partial E_{total}}{\partial w_1} \end{aligned}$$

This rule for updating weights is called **Gradient Descent**. Now, let’s do a workout example for updating the weight \(w_{10}\) in figure 1. Here, $$\begin{aligned} dw_{10} &= \frac{\partial E_{total}}{\partial w_{10}} \\ &= (output – target) * \sigma(z).(1 – \sigma(z)) * h_1 \\ &= (.77 – 0) * \sigma(1.2)*(1 – \sigma(1.2)) * .73 \\ &= 0.1 \\ w_{10} &= w_{10} – \eta * dw_{10} \\ &= 0.3 – (.5 * .1)\\ &= 0.25\end{aligned}$$ Similarly, we can update other weights and this is generally a long process.

We have done all the hard works so far so that we can predict for new data using our neural network. The dataset we work on, generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation $$ Prediction = W^T.X_{test} + b$$ where W and b are the calibrated set of weights and bias respectively and X_test is the test set split from our dataset.

Now that we have finished the theoretical part of our tutorial now you can see the code and try to understand different blocks of the code.

]]>