In this era, everything is influenced by Artificial Intelligence, Machine Learning, and deep learning. Nowadays it is common to use very big datasets and we need fast and efficient optimization algorithms to get the leverage of this big amount of data. Training Neural Network is generally much harder than the other optimization problems in deep learning, A good optimization algorithm is easy to implement and faster at the same time. In this post, we will explain the mathematical foundation behind the most common and effective optimization algorithms for neural networks.

We will start with the gradient descent algorithm which is the most basic and useful optimization algorithm for neural networks. Then we will explain the concept of mini-batch and Stochastic Gradient Descent (SGD) which is just a modification of Gradient Descent. After that, we will introduce the concept of momentum which is the core concept of the many modern optimization algorithms. Finally, we will derive the mathematical concepts of RMSProp and Adam Optimization that are two highly efficient algorithms.

Gradient descent is a mathematical concept of first-order iterative optimization algorithm for finding a local minimum of a differentiable function. In a neural network, this function is called loss function or cost function. Suppose we want to solve a classification problem using a neural network. For simplicity, suppose the neural network contains a single hidden layer only.

Suppose, we have a set of inputs or features $${X}$$ with associated classification labels $${Y}$$ \begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned} where $$x ^{(1)}, x^{(2)}, … , x^{(m)}$$ are different training examples and we have $$m$$ training examples. Then the output, $$A$$ will be \begin{aligned} Z^{} &= W^{}.X + b^{} \\ A^{} &= \sigma(Z^{}) \\ Z^{} &= \begin{bmatrix} z ^{(1)} & z^{(2)} & … & z^{(m)}\end{bmatrix}\\ A^{} &= \begin{bmatrix} a ^{(1)} & a^{(2)} & … & a^{(m)}\end{bmatrix} \end{aligned} where $$a^{[l](m)}$$ denotes the output of different layers of different examples, $$l$$ denotes the layer number and $$m$$ denotes the different examples.Weights of this network are initialized with $$0$$ or randomly.

Then we will use a loss function $$E$$ to find the error between the prediction and the real value. Our goal is to minimize this error and make predictions as close as possible to the real values by updating the weights. To measure this error we use a metric known as the loss function. A common loss function is the Sum of Squared Error (SSE) defined as \begin{aligned} E &= \frac{1}{2} \sum \left[y – \hat{y}\right]^2 \\ E &= \frac{1}{2} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2\end{aligned} where $$\hat{y}$$ is the prediction and y is the true value. One thing to notice is that errors are the functions of weights. We have to tune up the weights to alter the network’s prediction which in turn will influence the overall error. Our goal is to find the weights that will minimize the errors and to do that we use gradients. Suppose, we plot the weights in the x-axis and the error $$(E)$$ in the y-axis to get a curve.

Here we are showing the simple depiction of the error with one weight. Our goal is to find the weight that minimizes the error. We start with a random weight and step towards the direction of the minimum. The direction is opposite to the gradient or the slope. After taking several steps towards the direction eventually we will be able to reach the minimum of the error function. To update the weights we use \begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned} Here $$\eta$$ is a constant that is called learning rate of a neural network. Leraning rate is a hyperparamater that needs to be adjusted in the network. Deriving the derivative term \begin{aligned}\frac{\partial E}{\partial w_i} &= -(y_i – \hat{y})\frac{\partial {\hat {y}}}{\partial w_i} \\& = -(y – \hat{y}). f^{‘} (z) \frac{\partial}{\partial w_i} \sum{w_i.x_i} \\ &= -(y – \hat{y}). f^{‘} (z).x_i\end{aligned} We can simplify the equation defining another term $$\delta$$ called error term \begin{aligned} w_i &= w_i – \eta \frac{\partial E}{\partial w_i}\\ &= w_i + \eta (y – \hat{y}). f^{‘} (z).x_i \\ &= w_i + \eta \delta x_i\\ where, \; \delta &= (y – \hat{y}). f^{‘} (z)\end{aligned}

## Mini-Batch and Stochastic Gradient Descent(SGD)

In gradient descent algorithm, we used all the training examples that is also known as batch gradient descent. But there are some problems with the batch gradient descent. If the number of training examples is too big it takes long and requires a lot of memory to compute one single epoch. For example, if we have ten million training examples we have to process all the training examples before taking a single step towards the minimum. To resolve this issue we use mini-batch gradient descent by splitting the training examples into small chunks called batches and training one batch at a time. Suppose, we have a set of $$m$$ number of inputs or features $${X}$$ with associated classification labels $${Y}$$. \begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned} In mini-batch gradient descent, instead of processing all the training examples all together we split them in small batches e.g. 1000 training examples each. \begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(1000)}|x^{(1001)} & x^{(1002)}&\ldots & x^{(2000)} &| x^{(2001)} & x^{(2002)} & \ldots & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(1000)} | y^{(1001)} & y^{(1002)} &\ldots & y^{(2000)} &| y^{(2001)} & y^{(2002)} & \ldots & y^{(m)}\end{bmatrix} \end{aligned} For simplicity, we can denote mini-batches as \begin{aligned}X^{\{1\}} &= x ^{(1)} \; x^{(2)} \; … \; x^{(1000)}\\X^{\{2\}} &= x ^{(1001)} \; x^{(1002)} \; … \; x^{(2000)} \\ \vdots\\ X^{\{t\}} &= x ^{1000(t-1) + 1} \; x^{1000(t-1) + 2} \; … \; x^{(m)}\\ X &= [X^{\{1\}} X^{\{2\}} \ldots X^{\{t\}}]\end{aligned} The training process is now similar with batch gradient descent. We will pass the split batches for training and have to update the weights for every single batch. The forward propagation and loss function for mini-batch gradient descent will be \begin{aligned} Z^{} &= W^{}.X^{\{t\}} + b^{} \\ A^{} &= \sigma(Z^{}) \\ Z^{} &= \begin{bmatrix} z ^{(1)} & z^{(2)} & … & z^{(m)}\end{bmatrix}\\ A^{} &= \begin{bmatrix} a ^{(1)} & a^{(2)} & … & a^{(m)}\end{bmatrix} \\ E &= \frac{t}{N} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2 \\ \frac{\partial E}{\partial w_i} &=- \frac{t}{N}(y – \hat{y}). f^{‘} (z).x_i \\N &= total \; number \; of \; sample \\ t &= number \; of \;batches \end{aligned} Like before, we can define error term $$\delta$$ for simplification and write the equations as \begin{aligned} \delta &= \frac{t}{N}(y – \hat{y}). f^{‘} (z) \\ w_i &= w_i + \eta\delta x_i\end{aligned}

When the mini-batch size is $$1$$ the method is called stochastic gradient descent.

• If $$t=N$$, Batch Gradient Descent
• If $$1<t<N$$, Mini-Batch Gradient Descent
• If $$t=1$$, Stochastic Gradient Descent

Gradient descent is a very basic and fundamental optimization algorithm and has some problems. While converging to the minimum gradient descent oscillates in up and down direction and takes a lot of steps. These oscillations make gradient descent a lot slower preventing us to use a larger learning rate. Another big problem is to get stuck in a local minimum instead of a global minimum. Gradient descent with momentum helps to address these issues. In this method, we calculate an exponentially weighted average of our gradients and use that to update our weights.

Suppose, we are trying to optimize a cost function that has contours like above and the red dot denotes the position of the local optima. Starting gradient descent from the first point we reach the second position after one iteration and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical. We can see that there are up and down oscillations in a vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

In this method, we use exponentially weighted average which in simple words is just taking previous values into account while updating the weights. Previously, to update weights we used \begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned} In the momentum method we use exponentially weighted averages of $$\Delta w_1$$ and $$\Delta w_2$$ and denote it $$V_{\Delta w_1}$$ and $$V_{\Delta w_2}$$ respectively. \begin{aligned}V_{\Delta w_1} &= \beta_1 \times V_{\Delta w_1} + (1 – \beta_1)\times \Delta w_1 \\ V_{\Delta w_2} &= \beta_1 \times V_{\Delta w_2} + (1 – \beta_1)\times \Delta w_2\end{aligned}

Here $$\beta_1$$ is a hyperparameter that balances values between the previous and the current values and ranges from $$[0,1]$$. After calculating the exponentially weighted averages we will update our parameters using these averages. \begin{aligned}w_1 &= w_1 + \eta \times V_{\Delta w_1} \\ w_2 &= w_2 + \eta \times V_{\Delta w_2}\end{aligned}

The intuition behind this method is quite simple. When we are taking the exponential average of the previous values the up and down oscillations cancels out each other and the vertical motion gets closer to zero. But in horizontal direction all the gradients are pointing to the same direction. So it doesn’t slow down in horizontal direction after adding up the previous values.

## RMSProp

Root Mean Square Prop or RMSProp is another optimization algorithm that is quite similar to the gradient descent with momentum algorithm. Like previously, suppose that we are trying to optimize a cost function that has contours like below and the red dot denotes the position of the local optima. Starting gradient descent from the first point after one iteration we reach the second position and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical motion. We can also see that there are up and down oscillations in the vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

For simplicity, we are deriving the equations on two dimnstional space with two weights $$w_1$$ and $$w_2$$ where $$w_1$$ is moving in horizontal direction and $$w_2$$ is moving in vertical direction. In RMSProp, we use exponentially weighted averages like before but here we use square of the gradients \begin{aligned}S_{\Delta w_1} &= \beta_2 \times S_{\Delta w_1} + (1 – \beta_2)\times {\Delta w_1}^2 \\ S_{\Delta w_2} &= \beta_2 \times S_{\Delta w_2} + (1 – \beta_2)\times {\Delta w_2}^2\end{aligned} After calculating the exponentially weighted averages we will update our parameters using these averages. \begin{aligned}w_1 &= w_1 + \eta \times \frac{\Delta w_1}{\sqrt {s_{\Delta w_1}} + \epsilon} \\ w_2 &= w_2 + \eta \times \frac{\Delta w_2}{\sqrt {s_{\Delta w_2}} + \epsilon}\end{aligned}

Here we use $$\epsilon$$ for numerical stability and it is generally a very small number, $$10^{-8}$$. The intuition behind RMSProp is that in the horizontal direction or in our case in $$w_1$$ direction we want learning to go fast while in $$w_2$$ direction we want to slow it down to reduce the oscillations. Since we are dividing by $$S_{\Delta w_1}$$ and $$S_{\Delta w_2}$$ we want $$S_{\Delta w_1}$$ to be bigger and $$S_{\Delta w_2}$$ to be smaller. If we look at the derivatives the angle or slope is much larger in the vertical direction while much smaller in the horizontal direction. So the square of $$S_{\Delta w_2}$$ will be relatively larger than $$S_{\Delta w_1}$$. In summary, we are dividing the updates in the vertical direction with a much bigger number to reduce the oscillations while dividing the updates in the horizontal direction with a smaller number that has very little impact.

\begin{aligned} V_{\Delta w_1} &= 0 \; S_{\Delta w_1} = 0 \; V_{\Delta w_2} = 0 \; S_{\Delta w_2} = 0\\ on \; iteration\; t&:\\ &V_{\Delta w_1} = \beta_1 V_{\Delta w_1} + (1 – \beta_1) \Delta w_1 \; \; \; V_{\Delta w_2} = \beta_1 V_{\Delta w_2} + (1 – \beta_1) \Delta w_2\\ &S_{\Delta w_1} = \beta_2 S_{\Delta w_1} + (1 – \beta_2) {\Delta w_1}^2 \; \; \; S_{\Delta w_2} = \beta_2 S_{\Delta w_2} + (1 – \beta_2) {\Delta w_2}^2\\ &V^{\prime}_{\Delta w_1} = \frac{V_{\Delta w_1}}{1 – \beta_1^t} \; \; \; V^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_1^t} \\ &S^{\prime}_{\Delta w_1} = \frac{S_{\Delta w_1}}{1 – \beta_2^t} \; \; \; S^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_2^t} \\ &w_1 = w_1 – \eta \frac{V^{\prime}_{\Delta w_1}}{\sqrt{S^{\prime}_{\Delta w_1}} + \epsilon} \\ &w_2 = w_2 – \eta \frac{V^{\prime}_{\Delta w_2}}{\sqrt{S^{\prime}_{\Delta w_2}} + \epsilon}\end{aligned}
In this post, we have discussed various optimization algorithms used in deep learning. In these optimization algorithms, there are different hyperparameters as well. Even though there’s no straight forward rule to choose these hyperparameters, we try to follow some common values for $$\beta_1$$, $$\beta_2$$ and $$\epsilon$$.
• Learning Rate, $$\eta$$ – Needs to be tuned
• $$\beta_1$$ – 0.9
• $$\beta_2$$ – 0.999
• $$\epsilon$$ – $$10^{-8}$$