We will start with the gradient descent algorithm which is the most basic and useful optimization algorithm for neural networks. Then we will explain the concept of mini-batch and Stochastic Gradient Descent (SGD) which is just a modification of Gradient Descent. After that, we will introduce the concept of momentum which is the core concept of the many modern optimization algorithms. Finally, we will derive the mathematical concepts of RMSProp and Adam Optimization that are two highly efficient algorithms.

Gradient descent is a mathematical concept of first-order iterative optimization algorithm for finding a local minimum of a differentiable function. In a neural network, this function is called loss function or cost function. Suppose we want to solve a classification problem using a neural network. For simplicity, suppose the neural network contains a single hidden layer only.

Suppose, we have a set of inputs or features \({X}\) with associated classification labels \({Y}\) $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned}$$ where \(x ^{(1)}, x^{(2)}, … , x^{(m)} \) are different training examples and we have \(m\) training examples. Then the output, \(A\) will be $$\begin{aligned} Z^{[1]} &= W^{[1]}.X + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \end{aligned}$$ where \(a^{[l](m)}\) denotes the output of different layers of different examples, \(l\) denotes the layer number and \(m\) denotes the different examples.Weights of this network are initialized with \(0\) or randomly.

Then we will use a loss function \(E\) to find the error between the prediction and the real value. Our goal is to minimize this error and make predictions as close as possible to the real values by updating the weights. To measure this error we use a metric known as the loss function. A common loss function is the Sum of Squared Error (SSE) defined as $$\begin{aligned} E &= \frac{1}{2} \sum \left[y – \hat{y}\right]^2 \\ E &= \frac{1}{2} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2\end{aligned}$$ where \(\hat{y}\) is the prediction and y is the true value. One thing to notice is that errors are the functions of weights. We have to tune up the weights to alter the network’s prediction which in turn will influence the overall error. Our goal is to find the weights that will minimize the errors and to do that we use gradients. Suppose, we plot the weights in the x-axis and the error \((E)\) in the y-axis to get a curve.

Here we are showing the simple depiction of the error with one weight. Our goal is to find the weight that minimizes the error. We start with a random weight and step towards the direction of the minimum. The direction is opposite to the gradient or the slope. After taking several steps towards the direction eventually we will be able to reach the minimum of the error function. To update the weights we use $$\begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned}$$ Here \(\eta \) is a constant that is called learning rate of a neural network. Leraning rate is a hyperparamater that needs to be adjusted in the network. Deriving the derivative term $$\begin{aligned}\frac{\partial E}{\partial w_i} &= -(y_i – \hat{y})\frac{\partial {\hat {y}}}{\partial w_i} \\& = -(y – \hat{y}). f^{‘} (z) \frac{\partial}{\partial w_i} \sum{w_i.x_i} \\ &= -(y – \hat{y}). f^{‘} (z).x_i\end{aligned}$$ We can simplify the equation defining another term \(\delta\) called error term $$\begin{aligned} w_i &= w_i – \eta \frac{\partial E}{\partial w_i}\\ &= w_i + \eta (y – \hat{y}). f^{‘} (z).x_i \\ &= w_i + \eta \delta x_i\\ where, \; \delta &= (y – \hat{y}). f^{‘} (z)\end{aligned}$$

In gradient descent algorithm, we used all the training examples that is also known as batch gradient descent. But there are some problems with the batch gradient descent. If the number of training examples is too big it takes long and requires a lot of memory to compute one single epoch. For example, if we have ten million training examples we have to process all the training examples before taking a single step towards the minimum. To resolve this issue we use mini-batch gradient descent by splitting the training examples into small chunks called batches and training one batch at a time. Suppose, we have a set of \(m\) number of inputs or features \({X}\) with associated classification labels \({Y}\). $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(m)}\end{bmatrix} \end{aligned}$$ In mini-batch gradient descent, instead of processing all the training examples all together we split them in small batches e.g. 1000 training examples each. $$\begin{aligned}X &= \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(1000)}|x^{(1001)} & x^{(1002)}&\ldots & x^{(2000)} &| x^{(2001)} & x^{(2002)} & \ldots & x^{(m)}\end{bmatrix} \\ Y &= \begin{bmatrix} y ^{(1)} & y^{(2)} & … & y^{(1000)} | y^{(1001)} & y^{(1002)} &\ldots & y^{(2000)} &| y^{(2001)} & y^{(2002)} & \ldots & y^{(m)}\end{bmatrix} \end{aligned}$$ For simplicity, we can denote mini-batches as $$\begin{aligned}X^{\{1\}} &= x ^{(1)} \; x^{(2)} \; … \; x^{(1000)}\\X^{\{2\}} &= x ^{(1001)} \; x^{(1002)} \; … \; x^{(2000)} \\ \vdots\\ X^{\{t\}} &= x ^{1000(t-1) + 1} \; x^{1000(t-1) + 2} \; … \; x^{(m)}\\ X &= [X^{\{1\}} X^{\{2\}} \ldots X^{\{t\}}]\end{aligned}$$ The training process is now similar with batch gradient descent. We will pass the split batches for training and have to update the weights for every single batch. The forward propagation and loss function for mini-batch gradient descent will be $$\begin{aligned} Z^{[1]} &= W^{[1]}.X^{\{t\}} + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \\ E &= \frac{t}{N} \sum \left[y – \sigma\left(\sum w_i.x_i\right)\right]^2 \\ \frac{\partial E}{\partial w_i} &=- \frac{t}{N}(y – \hat{y}). f^{‘} (z).x_i \\N &= total \; number \; of \; sample \\ t &= number \; of \;batches \end{aligned}$$ Like before, we can define error term \(\delta\) for simplification and write the equations as $$\begin{aligned} \delta &= \frac{t}{N}(y – \hat{y}). f^{‘} (z) \\ w_i &= w_i + \eta\delta x_i\end{aligned}$$

When the mini-batch size is \(1\) the method is called stochastic gradient descent.

- If \(t=N\), Batch Gradient Descent
- If \(1<t<N\), Mini-Batch Gradient Descent
- If \(t=1\), Stochastic Gradient Descent

Gradient descent is a very basic and fundamental optimization algorithm and has some problems. While converging to the minimum gradient descent oscillates in up and down direction and takes a lot of steps. These oscillations make gradient descent a lot slower preventing us to use a larger learning rate. Another big problem is to get stuck in a local minimum instead of a global minimum. Gradient descent with momentum helps to address these issues. In this method, we calculate an exponentially weighted average of our gradients and use that to update our weights.

Suppose, we are trying to optimize a cost function that has contours like above and the red dot denotes the position of the local optima. Starting gradient descent from the first point we reach the second position after one iteration and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical. We can see that there are up and down oscillations in a vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

In this method, we use exponentially weighted average which in simple words is just taking previous values into account while updating the weights. Previously, to update weights we used $$\begin{aligned}w_i &= w_i + \Delta w_i \\ \Delta w_i &\propto – \frac{\partial E}{\partial w_i} \\ &= – \eta \frac{\partial E}{\partial w_i}\end{aligned}$$ In the momentum method we use exponentially weighted averages of \(\Delta w_1\) and \(\Delta w_2\) and denote it \(V_{\Delta w_1}\) and \(V_{\Delta w_2}\) respectively. $$\begin{aligned}V_{\Delta w_1} &= \beta_1 \times V_{\Delta w_1} + (1 – \beta_1)\times \Delta w_1 \\ V_{\Delta w_2} &= \beta_1 \times V_{\Delta w_2} + (1 – \beta_1)\times \Delta w_2\end{aligned}$$

Here \(\beta_1\) is a hyperparameter that balances values between the previous and the current values and ranges from \([0,1]\). After calculating the exponentially weighted averages we will update our parameters using these averages. $$\begin{aligned}w_1 &= w_1 + \eta \times V_{\Delta w_1} \\ w_2 &= w_2 + \eta \times V_{\Delta w_2}\end{aligned}$$

The intuition behind this method is quite simple. When we are taking the exponential average of the previous values the up and down oscillations cancels out each other and the vertical motion gets closer to zero. But in horizontal direction all the gradients are pointing to the same direction. So it doesn’t slow down in horizontal direction after adding up the previous values.

Root Mean Square Prop or RMSProp is another optimization algorithm that is quite similar to the gradient descent with momentum algorithm. Like previously, suppose that we are trying to optimize a cost function that has contours like below and the red dot denotes the position of the local optima. Starting gradient descent from the first point after one iteration we reach the second position and after another iteration, we reach the third position and so on. With each iteration, we move closer to the local optima. If we look closely we can see there are two types of motions, e.g. horizontal and vertical motion. We can also see that there are up and down oscillations in the vertical direction. Due to these oscillations, it takes longer to reach the optima. Also, due to these oscillations, we can not use a bigger learning rate since it may diverge for a bigger learning rate.

For simplicity, we are deriving the equations on two dimnstional space with two weights \(w_1\) and \(w_2\) where \(w_1\) is moving in horizontal direction and \(w_2\) is moving in vertical direction. In RMSProp, we use exponentially weighted averages like before but here we use square of the gradients $$\begin{aligned}S_{\Delta w_1} &= \beta_2 \times S_{\Delta w_1} + (1 – \beta_2)\times {\Delta w_1}^2 \\ S_{\Delta w_2} &= \beta_2 \times S_{\Delta w_2} + (1 – \beta_2)\times {\Delta w_2}^2\end{aligned}$$ After calculating the exponentially weighted averages we will update our parameters using these averages. $$\begin{aligned}w_1 &= w_1 + \eta \times \frac{\Delta w_1}{\sqrt {s_{\Delta w_1}} + \epsilon} \\ w_2 &= w_2 + \eta \times \frac{\Delta w_2}{\sqrt {s_{\Delta w_2}} + \epsilon}\end{aligned}$$

Here we use \(\epsilon\) for numerical stability and it is generally a very small number, \(10^{-8}\). The intuition behind RMSProp is that in the horizontal direction or in our case in \(w_1\) direction we want learning to go fast while in \(w_2\) direction we want to slow it down to reduce the oscillations. Since we are dividing by \(S_{\Delta w_1}\) and \(S_{\Delta w_2}\) we want \(S_{\Delta w_1}\) to be bigger and \(S_{\Delta w_2}\) to be smaller. If we look at the derivatives the angle or slope is much larger in the vertical direction while much smaller in the horizontal direction. So the square of \(S_{\Delta w_2}\) will be relatively larger than \(S_{\Delta w_1}\). In summary, we are dividing the updates in the vertical direction with a much bigger number to reduce the oscillations while dividing the updates in the horizontal direction with a smaller number that has very little impact.

Adam or Adaptive moment estimation algorithm is another very popular optimization algorithm for different types of neural networks. This algorithm is basically using momentum and RMSProp together. Below is the algorithm for Adam optimization.

$$\begin{aligned} V_{\Delta w_1} &= 0 \; S_{\Delta w_1} = 0 \; V_{\Delta w_2} = 0 \; S_{\Delta w_2} = 0\\ on \; iteration\; t&:\\ &V_{\Delta w_1} = \beta_1 V_{\Delta w_1} + (1 – \beta_1) \Delta w_1 \; \; \; V_{\Delta w_2} = \beta_1 V_{\Delta w_2} + (1 – \beta_1) \Delta w_2\\ &S_{\Delta w_1} = \beta_2 S_{\Delta w_1} + (1 – \beta_2) {\Delta w_1}^2 \; \; \; S_{\Delta w_2} = \beta_2 S_{\Delta w_2} + (1 – \beta_2) {\Delta w_2}^2\\ &V^{\prime}_{\Delta w_1} = \frac{V_{\Delta w_1}}{1 – \beta_1^t} \; \; \; V^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_1^t} \\ &S^{\prime}_{\Delta w_1} = \frac{S_{\Delta w_1}}{1 – \beta_2^t} \; \; \; S^{\prime}_{\Delta w_2} = \frac{V_{\Delta w_2}}{1 – \beta_2^t} \\ &w_1 = w_1 – \eta \frac{V^{\prime}_{\Delta w_1}}{\sqrt{S^{\prime}_{\Delta w_1}} + \epsilon} \\ &w_2 = w_2 – \eta \frac{V^{\prime}_{\Delta w_2}}{\sqrt{S^{\prime}_{\Delta w_2}} + \epsilon}\end{aligned}$$

In this post, we have discussed various optimization algorithms used in deep learning. In these optimization algorithms, there are different hyperparameters as well. Even though there’s no straight forward rule to choose these hyperparameters, we try to follow some common values for \(\beta_1\), \(\beta_2\) and \(\epsilon\).

- Learning Rate, \(\eta\) – Needs to be tuned
- \(\beta_1\) – 0.9
- \(\beta_2\) – 0.999
- \(\epsilon\) – \(10^{-8}\)

Animal brains even the small ones like the brain of a pigeon was more capable of than digital computers with huge processing power and storage space. This puzzled scientists for many years. This turned the attention to the architectural differences. Traditional computers processed data very much sequentially and there is no fuzziness or discreteness. Animal brains, on the other hand, although apparently running at much slower rhythms, seemed to process signals in parallel, and fuzziness was a feature of their computation.

The basic unit of a biological brain is the **neuron**. Neurons having various forms of them, their job is to transmit electrical signals from one end to another, from the dendrites along the axons to the terminals. These signals are then passed from one neuron to another. This is how our body senses light, touch pressure, heat and so on. Signals from specialized sensory neurons are transmitted along with our nervous system to our brain, which itself is mostly made of neurons too. Now, the question is why are biological brains so capable even though they are much slower and consist of relatively few computing elements when compared to modern computers?

Let’s look at how a biological neuron works. It takes an electrical input and pops out another electrical signal. But can we represent neurons as linear functions? The answer is no! A biological neuron doesn’t produce an output that is a linear function of the form $$Z = W.X + b.$$ So, neurons don’t exactly react readily but instead, suppress the input until it has grown so large that it triggers an output. Here comes the idea of the **activation functions**.

A function that takes the input signal and generates an output signal, but takes into account some kind of **threshold** is called an activation function. There are many such activation functions.

Here, we can see for the **step function** the output is zero for low input values. But once it reaches the threshold, output jumps up. We can improve on the step function in many ways. The S-shaped function shown above is called the **sigmoid** or **logistic** function is another very popular activation function which equation is $$ \sigma{(z)} = \frac{1}{1+e^{-z}}$$ Another very important activation function which is used vastly is called **ReLU** or *Rectified Linear Unit* activation function which equation is $$R(z) = max(0, z)$$ Here is the brief table of different types of activation functions.

The basic computational unit of a neural network is also called neuron. It receives input from some other nodes, or from an external source and generates an output. Each input has an associated

weight \((w)$, which is assigned on the basis of its relative importance to other inputs. The node applies an activation function e.g. sigmoid to the weighted sum of its inputs. If the combined signal is not large enough then the effect of the sigmoid threshold function is to suppress the output signal and fire otherwise.

In a biological neural network, electrical signals are collected by dendrites and these combine to form a stronger signal. If the signal is strong and passes the threshold the neuron fires a signal down the axon towards the terminals to pass onto the next neuron’s dendrites. The important thing to notice that each neuron takes input from many before it and also provides signals to many more. One way to replicate this from nature to an artificial model is to have layers of neurons, with each connected to every other one in the preceding and subsequent layer. The following diagram illustrates this idea:

we can see a neural network with three layers, each with several artificial neurons or nodes. Also, each node is connected to every other node in the preceding and next layers. This is how we actually take the idea from the biological brain and apply it to build a neural architecture for computers. But how this architecture actually learns? The most obvious thing is to adjust the strength of the connections between nodes. Within a node, we could have adjusted the summation of the inputs, or we could have adjusted the shape of the sigmoid threshold function, but that’s more complicated than simply adjusting the strength of the connections between the nodes. The diagram in the right shows the connected nodes, but this time weight is shown associated with each connection. Low weight will de-emphasize a signal, and a high weight will amplify it.

Next, we will see the idea of calculating signals in a neural network from the inputs through the different layers to become the output. The idea is called **forward propagation** part of a neural network.

Suppose, we have a Boolean function represented by \(F(x,y,z) = xy + \bar{z}\). The values of this function are given below which we will use to demonstrate the calculations of the neural network.

\(x\) | \(y\) | \(z\) | \(xy\) | \(\bar{z}\) | F(x,y,z) = xy+\(\bar{z}\) |

1 | 1 | 1 | 1 | 0 | 1 |

1 | 1 | 0 | 1 | 1 | 1 |

1 | 0 | 1 | 0 | 0 | 0 |

1 | 0 | 0 | 0 | 1 | 1 |

0 | 1 | 1 | 0 | 0 | 0 |

0 | 1 | 0 | 0 | 1 | 1 |

0 | 0 | 1 | 0 | 0 | 0 |

0 | 0 | 0 | 0 | 1 | 1 |

Let’s use the fourth row \((1, 0, 1) => 0\) to demonstrate the forward propagation.

Here, there is a neural network with three layers. The first layer is called the input layer and the last layer is called the output layer. The layers in the middle are called the hidden layer. We have used one hidden layer for simplicity. Input and hidden layers contain three nodes and the output layer contains a single node. We now assign weights to the synapses between the input and hidden layer. The weights are taken randomly between \(0\) and \(1\) since it is the first time we’re forward propagating.

Now for a single neuron or node, we take all the inputs and multiply it with the associated weights and sum it. Then the node applies an activation function e.g. sigmoid to the weighted sum of its inputs to introduce non-linearity. $$\begin{aligned}z = \sum w_i x_i \\ \sigma(z) = \frac{1}{1 + e^{-z}}\end{aligned}$$ Now, for several layers, this process repeats for every node on these layers. So, let’s focus on node \(1\) of the hidden layer. All the nodes in the input layer are connected to it. Those input nodes have raw values of \(1\), \(0\) and \(1\) with the associated weights of \(0.9, 0.8, 0.1\) respectively. We sum the product of the inputs with their corresponding set of weights to arrive at the first value for the hidden layer and do the same to get the other values of the hidden layer.

\(z_1 = w_1 * x_1 + w_4 * x_2 + w_7 * x_3 = 1 * 0.9 + 0 * 0.8 + 1 * 0.1 = 1\)

\(z_2 = w_2 * x_1 + w_5 * x_2 + w_8 * x_3 = 1 * 0.3 + 0 * 0.5 + 1 * 0.6 = 0.9\)

\(z_3 = w_3 * x_1 + w_6 * x_2 + w_9 * x_3 = 1 * 0.2 + 0 * 0.4 + 1 * 0.7 = 0.9\)

We put these sums smaller in the circle because they’re not the final value. We can now, finally calculate the nodes final output value using the activation function \( \sigma(z) = \frac{1}{1 + e^{-z}}\). Applying \(\sigma(z)\) to the three hidden layer weighted sums, we get: $$ \begin{aligned}h_1 &= \sigma(1.0)& = 0.731058578630 \\ h_2 &= \sigma(0.9) &= 0.710949502625 \\ h_3 &= \sigma(0.9) &= 0.710949502625 \end{aligned}$$ We add this to our neural network as hidden layer results. Then, we calculate the weighted sum of the hidden layer results with the second set of weights (also determined at random) to determine the output sum.

0.73 * 0.3 + 0.71 * 0.5 + 0.71 * 0.9 = 1.213

Finally, we apply the sigmoid activation function to get the final output result: $$ \sigma(1.213) = 0.7708293339958$$ Because we used a random set of initial weights, the value of the output neuron is off the mark; in this case by +0.77 (since the target is 0).

We can see that, we are not even close to our target value. That’s because we have initialized weights randomly and we have to calibrate them. The process we will use to calibrate the weights is called * backpropagation* which we will cover next. But, before diving into backpropagation we need to give some ideas about computing forward propagation process using matrix computation.

A matrix is nothing but a table or a grid of numbers. For example,

\[\begin{bmatrix} w_{1,1}& w_{1,2} & w_{1,3}\\ w_{2,1}&w_{2,2}& w_{2,3} \\ w_{3,1}&w_{3,2}&w_{3,3}\\ \end{bmatrix}\] Here, matrix values are the weights of the neural network and we can represent the inputs of the network by another matrix \[\begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix}\] When we multiply these two matrices we get

\[\begin{aligned}X &= \begin{bmatrix} w_{1,1}& w_{1,2} & w_{1,3}\\ w_{2,1}&w_{2,2}& w_{2,3} \\ w_{3,1}&w_{3,2}&w_{3,3}\\ \end{bmatrix}^T . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\ &=

\begin{bmatrix} w_{1,1}& w_{2,1} & w_{3,1}\\ w_{1,2}&w_{2,2}& w_{3,1} \\ w_{1,3}&w_{2,3}&w_{3,3} \end{bmatrix} . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\&= \begin{bmatrix} w_{1}& w_{4} & w_{7}\\ w_{2}&w_{5}& w_{8} \\ w_{3}&w_{6}&w_{9}\\ \end{bmatrix} . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\ &= \begin{bmatrix}(w_{1} * input_1) +

(w_{4} * input_2 ) + (w_{7} * input_3 \\ (w_{2} * input_1) +

(w_{5} * input_2 ) + (w_{8} * input_3)\\ (w_{3} * input_1) +

(w_{6} * input_2 ) + (w_{9} * input_3) \end{bmatrix}\end{aligned}\] This is the result what we have found as the weighted sum of the input and the hidden layer. So, we can calculate the hidden layer output: $$\begin{aligned} H &= \sigma(W^T .x) \\ &= \begin{bmatrix}h_1 \\ h_2 \\ h_3 \end{bmatrix} \end{aligned}$$ where W is the weight matrix and x is the input matrix. Which will be much easier and faster to calculate, since it doesn’t require to calculate every node individually. This technique is called **vectorization**.

So, the general equations will be $$ \begin{aligned} z_1^{[1]} = w_1^{[1]T} x + b_1^{[1]}, a_1^{[1]} = \sigma(z_1{[1]}) \\ z_2^{[1]} = w_2^{[1]T} x + b_2^{[1]}, a_2^{[1]} = \sigma(z_2{[1]}) \\ z_3^{[1]} = w_3^{[1]T} x + b_3^{[1]}, a_3^{[1]} = \sigma(z_3{[1]}) \end{aligned}$$ Where \(z_l^{[i]}\) is the wighted sum of a single node and \(l\) denotes the number of layers and \(i\) denotes node number in a layer.

$$ \begin{aligned}z^{[1]} &= \begin{bmatrix} w_1^{[1]T} \\ w_2^{[1]T} \\ w_3^{[1]T} \end{bmatrix} . \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]} \end{bmatrix} \\ &=

\begin{bmatrix} w_1^{[1]T}.x + b_1^{[1]} \\ w_2^{[1]T}.x + b_2^{[1]} \\ w_3^{[1]T}.x + b_3^{[1]} \end{bmatrix} = \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]} \end{bmatrix} \\ z^{[1]} &= W^{[1]}.x + b^{[1]} \\ a^{[1]} &= \sigma(z^{[1]}) \\ z^{[2]} &= W^{[2]}.a^{[1]} + b^{[2]} \\ a^{[2]} &= \sigma(z^{[2]}) \end{aligned}$$

Now, these equations are for a single trianing example! But in general, we will have many such examples like the values of the Boolean functions shown above. Let $$ X = \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix}$$ where \(x ^{(1)}, x^{(2)}, … , x^{(m)} \) are different training examples. Clearly, we have \(m\) training examples. Then $$\begin{aligned} Z^{[1]} &= W^{[1]}.X + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[2]} &= W^{[2]}.A^{[1]} + b^{[2]} \\ A^{[2]} &= \sigma(Z^{[2]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \end{aligned}$$ where \(a^{[l](m)}\) denotes the output of different layers of different examples and \(l\) denotes the layer number and \(m\) denotes the different examples.

[latexpage]To improve our model, we first have to quantify how wrong our model predictions are with compared to the target values of the model. Then, we adjust the weights accordingly so that the margin of error is decreased. Similar to the forward propagation, back propagation calculations occur at each layer but as the name indicates, backwardly. We begin by changing the weights between the hidden layer and the output layer.

For quantifying how wrong our model is, first we have to calculate the error between predicted values and the output values of the model. To do this we have to use a cost function. A cost function can be the sum of the difference between the output value and the predicted value like $$ E_{total} = \sum (target – output)$$ Let, we have target values \(2, 3, 5, 9\) and output values \(1, 5, 3, 6\) respectively.

Then the **total error** becomes $$ E_{total} = 1 – 2 + 2 +3 = 4$$ We can see that second and third value cancels each other. So, we are not getting the actual error. To make our model more accurate we have to use a different cost function. What about the sum of the absolute value of the error?

$$ E_{total} = \sum |target – output|$$

Then the total error becomes $$ E_{total} = 1 + 2 + 2 +3 = 8$$ and it doesn’t cancel anything. The reason this isn’t popular is that the slope isn’t continuous near the minimum and this makes gradient descent not work so well, because we can bounce around the V-shaped valley that this error function has. The slope doesn’t get smaller or closer to the minimum, so our steps don’t get smaller, which means they risk overshooting. A better option is to use the sum of the squares of the errors. So, we will calculate the error for each output neuron using the squared error function and sum them to get the total error: $$ E_{total} = \sum \frac{1}{2}(target – output)^2$$

For example, the target output for our network is \(0\) but the neural network output is \(0.77\), therefore its error is: $$E_{total} = \frac{1}{2}(0 – 0.77)^2 = .29645$$ **Cross Entropy** is another very popular cost function which equation is: $$ E_{total} = – \sum target * \log(output)$$

With backpropagation our goal is to update each of the weights in the network so that the cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. Like forward propagation, we will derive equations for a single neuron to update the weights and then expand the concept for the rest of the network.

Consider, a network with inputs \(x_1, x_2\) and \(x_3\) and the associated weights \(w_1, w_2\) and \(w_3\) respectively. we know from the forward propagation part that $$\begin{aligned}\sigma(z) &= \frac{1}{1 + e^{-z}} \\ where, \; z &= \sum w_ix_i = w_1x_1 + w_2x_2 + w_3x_3\end{aligned}$$ which we call as the output of the network. We have a target value for the network and for a untrained network when weights are not calibrated, there will be an error. We denote this error as \(E_{total}\) where $$ E_{total} = \frac{1}{2}(target – output)^2$$ Our job is to find out how we have adjust the weights to decrease this error. consider \(w_1\). We want to know how much a change in \(w_1\) affects the total error, \(\frac{\partial E_{total}}{\partial w_1}\). If we look closely, we can see that, error is affected by the output and output is affected by \(z\) while \(z\) is affected by the weight \(w_1\). By applying the chain rule, we know that, $$\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial output} * \frac{\partial output}{\partial z} * \frac{\partial z}{\partial w_1} $$ We need to break down each piece of the equation

$$\begin{aligned} \frac{\partial E_{total}}{\partial output} &= 2. \frac{1}{2} . (target – output)^{2-1} . -1\\ &= (output – target)\end{aligned}$$ Next, we have to find out how much the output changes with respect to its total net input where $$\begin{aligned}output &= \sigma(z) = \frac{1}{1 + e^{-z}} \\ \frac{\partial(output)}{\partial z} &= \frac{e^{-x}}{{(1+e^{-x})}^2} \\ &= \frac{1}{1 + e^{-z}} . \frac{e^{-z}}{1 + e^{-z}} \\ &= \sigma(z).(1 – \sigma(z))\end{aligned}$$ Finally, how much the total net input \(z\) changes with respect to \(w_1\) needs to be determined $$\begin{aligned}z &= w_1x_1 + w_2x_2 + w_3x_3 \\ \frac{\partial z}{\partial w_1} &= x_1 \end{aligned}$$ Putting all the pieces together $$\frac{\partial E_{total}}{\partial w_1} = (output – target) * \sigma(z).(1 – \sigma(z)) * x_1$$ To adjust the weights we then use the formula $$\begin{aligned}w_1 &= w_1 – \eta * dw_1 \;, where \; \eta \; is \; called \; the \; learning \; rate \\ dw_1 &= \frac{\partial E_{total}}{\partial w_1} \end{aligned}$$

This rule for updating weights is called **Gradient Descent**. Now, let’s do a workout example for updating the weight \(w_{10}\) in figure 1. Here, $$\begin{aligned} dw_{10} &= \frac{\partial E_{total}}{\partial w_{10}} \\ &= (output – target) * \sigma(z).(1 – \sigma(z)) * h_1 \\ &= (.77 – 0) * \sigma(1.2)*(1 – \sigma(1.2)) * .73 \\ &= 0.1 \\ w_{10} &= w_{10} – \eta * dw_{10} \\ &= 0.3 – (.5 * .1)\\ &= 0.25\end{aligned}$$ Similarly, we can update other weights and this is generally a long process.

We have done all the hard works so far so that we can predict for new data using our neural network. The dataset we work on, generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation $$ Prediction = W^T.X_{test} + b$$ where W and b are the calibrated set of weights and bias respectively and X_test is the test set split from our dataset.

Now that we have finished the theoretical part of our tutorial now you can see the code and try to understand different blocks of the code.

]]>The dataset is also known as Fisher’s Iris Data contains a set of 150 records under five attributes – petal length, petal width, sepal length, sepal width, and species.

First, we have to prepare the data set, which provides necessary information in a machine-readable way. Let’s First import and examine the data set.

```
import pandas as pd
import numpy as np
# Loading the dataset
dataset = pd.read_csv('Iris_Dataset.csv')
# Print top Five rows
dataset.head()
```

Here we are seeing Sepal Length, width, Petal Length and width for each flower.

Now, we have to convert the flowers into numbers so that we can pass it into our neural network. The technique we are gonna using is called one-hot coding. Which is simply a multidimensional array. Where each row is representing the example and each column is representing the flowers. When the example is

```
# One Hot Encoding
dataset = pd.get_dummies(dataset, columns=['Species'])
values = list(dataset.columns.values)
```

Now, we have to define input and output data for the Neural network and shuffle them so that there’s no bias.

```
# output data
y = dataset[values[-3:]]
y = np.array(y, dtype='float32')
# input data
X = dataset[values[1:-3]]
X = np.array(X, dtype='float32')
# Shuffle Data
indices = np.random.choice(len(X), len(X), replace=False)
X_values = X[indices]
y_values = y[indices]
```

Here we will use the simplest neural network architecture with only one hidden layer which consists of 8 hidden

We are now ready for splitting data in training and test set with test set size 10. Will use 500 epochs for training.

```
# Creating a Train and a Test Dataset
test_size = 10
X_test = X_values[-test_size:]
X_train = X_values[:-test_size]
y_test = y_values[-test_size:]
y_train = y_values[:-test_size]
# Session
sess = tf.Session()
# Interval / Epochs
interval = 50
epoch = 500
hidden_layer_nodes = 8
```

We have to create variables for the parameters in the neural network. We must be careful about the shape of the parameters. We also have to create placeholders which will be fed eventually.

```
# Create variables for Neural Network layers
w1 = tf.Variable(tf.random_normal(shape=[4,hidden_layer_nodes])) # Inputs -> Hidden Layer
b1 = tf.Variable(tf.random_normal(shape=[hidden_layer_nodes])) # First Bias
w2 = tf.Variable(tf.random_normal(shape=[hidden_layer_nodes,3])) # Hidden layer -> Outputs
b2 = tf.Variable(tf.random_normal(shape=[3])) # Second Bias
# Initialize placeholders
X_data = tf.placeholder(shape=[None, 4], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 3], dtype=tf.float32)
```

We now have to define the operations for the hidden layer output and final output. Finally, we have to define our cost function and optimizer for backpropagation.

```
# Operations
hidden_output = tf.nn.relu(tf.add(tf.matmul(X_data, w1), b1))
final_output = tf.nn.softmax(tf.add(tf.matmul(hidden_output, w2), b2))
# Cost Function
loss = tf.reduce_mean(-tf.reduce_sum(y_target * tf.log(final_output), axis=0))
# Optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
```

To train the model we just have to run the session for the number of epochs in a loop.

```
# Training
print('Training the model...')
for i in range(1, (epoch + 1)):
sess.run(optimizer, feed_dict={X_data: X_train, y_target: y_train})
if i % interval == 0:
print('Epoch', i, '|', 'Loss:', sess.run(loss, feed_dict={X_data: X_train, y_target: y_train}))
```

```
import matplotlib.pyplot as plt
iterations = list(ls.keys())
costs = list(ls.values())
plt.plot(iterations,costs)
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate = " + str(.001))
plt.show()
```

After training the model if we want to predict we just have use the test set.

```
# Prediction
for i in range(len(X_test)):
print('Actual:', y_test[i], 'Predicted:', np.rint(sess.run(final_output, feed_dict={X_data: [X_test[i]]})))
```

Complete code will be in here.

]]>