Neural Network has become a crucial part of modern technology. It has influenced our daily life in a way that we have never imagined. From e-commerce and solving classification problems to autonomous driving, it has touched everything. In this tutorial, we are gonna discuss all the important aspects of neural networks in the simplest way possible and at the end of the tutorial, we have also provided code implemented in python for all the described parts.

## Motivation

Animal brains even the small ones like the brain of a pigeon was more capable of than digital computers with huge processing power and storage space. This puzzled scientists for many years. This turned the attention to the architectural differences. Traditional computers processed data very much sequentially and there is no fuzziness or discreteness. Animal brains, on the other hand, although apparently running at much slower rhythms, seemed to process signals in parallel, and fuzziness was a feature of their computation.

The basic unit of a biological brain is the neuron. Neurons having various forms of them, their job is to transmit electrical signals from one end to another, from the dendrites along the axons to the terminals. These signals are then passed from one neuron to another. This is how our body senses light, touch pressure, heat and so on. Signals from specialized sensory neurons are transmitted along with our nervous system to our brain, which itself is mostly made of neurons too. Now, the question is why are biological brains so capable even though they are much slower and consist of relatively few computing elements when compared to modern computers?

Let’s look at how a biological neuron works. It takes an electrical input and pops out another electrical signal. But can we represent neurons as linear functions? The answer is no! A biological neuron doesn’t produce an output that is a linear function of the form $$Z = W.X + b.$$ So, neurons don’t exactly react readily but instead, suppress the input until it has grown so large that it triggers an output. Here comes the idea of the activation functions.

## Activation Function

A function that takes the input signal and generates an output signal, but takes into account some kind of threshold is called an activation function. There are many such activation functions.

Here, we can see for the step function the output is zero for low input values. But once it reaches the threshold, output jumps up. We can improve on the step function in many ways. The S-shaped function shown above is called the sigmoid or logistic function is another very popular activation function which equation is $$\sigma{(z)} = \frac{1}{1+e^{-z}}$$ Another very important activation function which is used vastly is called ReLU or Rectified Linear Unit activation function which equation is $$R(z) = max(0, z)$$ Here is the brief table of different types of activation functions.

## Neurons

The basic computational unit of a neural network is also called neuron. It receives input from some other nodes, or from an external source and generates an output. Each input has an associated
weight $$(w), which is assigned on the basis of its relative importance to other inputs. The node applies an activation function e.g. sigmoid to the weighted sum of its inputs. If the combined signal is not large enough then the effect of the sigmoid threshold function is to suppress the output signal and fire otherwise. ## Neural Network In a biological neural network, electrical signals are collected by dendrites and these combine to form a stronger signal. If the signal is strong and passes the threshold the neuron fires a signal down the axon towards the terminals to pass onto the next neuron’s dendrites. The important thing to notice that each neuron takes input from many before it and also provides signals to many more. One way to replicate this from nature to an artificial model is to have layers of neurons, with each connected to every other one in the preceding and subsequent layer. The following diagram illustrates this idea: we can see a neural network with three layers, each with several artificial neurons or nodes. Also, each node is connected to every other node in the preceding and next layers. This is how we actually take the idea from the biological brain and apply it to build a neural architecture for computers. But how this architecture actually learns? The most obvious thing is to adjust the strength of the connections between nodes. Within a node, we could have adjusted the summation of the inputs, or we could have adjusted the shape of the sigmoid threshold function, but that’s more complicated than simply adjusting the strength of the connections between the nodes. The diagram in the right shows the connected nodes, but this time weight is shown associated with each connection. Low weight will de-emphasize a signal, and a high weight will amplify it. Next, we will see the idea of calculating signals in a neural network from the inputs through the different layers to become the output. The idea is called forward propagation part of a neural network. ## Neural Network: Forward Propagation Suppose, we have a Boolean function represented by \(F(x,y,z) = xy + \bar{z}$$. The values of this function are given below which we will use to demonstrate the calculations of the neural network.

Let’s use the fourth row $$(1, 0, 1) => 0$$ to demonstrate the forward propagation.

Here, there is a neural network with three layers. The first layer is called the input layer and the last layer is called the output layer. The layers in the middle are called the hidden layer. We have used one hidden layer for simplicity. Input and hidden layers contain three nodes and the output layer contains a single node. We now assign weights to the synapses between the input and hidden layer. The weights are taken randomly between $$0$$ and $$1$$ since it is the first time we’re forward propagating.

Now for a single neuron or node, we take all the inputs and multiply it with the associated weights and sum it. Then the node applies an activation function e.g. sigmoid to the weighted sum of its inputs to introduce non-linearity. \begin{aligned}z = \sum w_i x_i \\ \sigma(z) = \frac{1}{1 + e^{-z}}\end{aligned} Now, for several layers, this process repeats for every node on these layers. So, let’s focus on node $$1$$ of the hidden layer. All the nodes in the input layer are connected to it. Those input nodes have raw values of $$1$$, $$0$$ and $$1$$ with the associated weights of $$0.9, 0.8, 0.1$$ respectively. We sum the product of the inputs with their corresponding set of weights to arrive at the first value for the hidden layer and do the same to get the other values of the hidden layer.

$$z_1 = w_1 * x_1 + w_4 * x_2 + w_7 * x_3 = 1 * 0.9 + 0 * 0.8 + 1 * 0.1 = 1$$
$$z_2 = w_2 * x_1 + w_5 * x_2 + w_8 * x_3 = 1 * 0.3 + 0 * 0.5 + 1 * 0.6 = 0.9$$
$$z_3 = w_3 * x_1 + w_6 * x_2 + w_9 * x_3 = 1 * 0.2 + 0 * 0.4 + 1 * 0.7 = 0.9$$

We put these sums smaller in the circle because they’re not the final value. We can now, finally calculate the nodes final output value using the activation function $$\sigma(z) = \frac{1}{1 + e^{-z}}$$. Applying $$\sigma(z)$$ to the three hidden layer weighted sums, we get: \begin{aligned}h_1 &= \sigma(1.0)& = 0.731058578630 \\ h_2 &= \sigma(0.9) &= 0.710949502625 \\ h_3 &= \sigma(0.9) &= 0.710949502625 \end{aligned} We add this to our neural network as hidden layer results. Then, we calculate the weighted sum of the hidden layer results with the second set of weights (also determined at random) to determine the output sum.

0.73 * 0.3 + 0.71 * 0.5 + 0.71 * 0.9 = 1.213

Finally, we apply the sigmoid activation function to get the final output result: $$\sigma(1.213) = 0.7708293339958$$ Because we used a random set of initial weights, the value of the output neuron is off the mark; in this case by +0.77 (since the target is 0).

We can see that, we are not even close to our target value. That’s because we have initialized weights randomly and we have to calibrate them. The process we will use to calibrate the weights is called backpropagation which we will cover next. But, before diving into backpropagation we need to give some ideas about computing forward propagation process using matrix computation.

## Matrix vs Neural Network

A matrix is nothing but a table or a grid of numbers. For example,

$\begin{bmatrix} w_{1,1}& w_{1,2} & w_{1,3}\\ w_{2,1}&w_{2,2}& w_{2,3} \\ w_{3,1}&w_{3,2}&w_{3,3}\\ \end{bmatrix}$ Here, matrix values are the weights of the neural network and we can represent the inputs of the network by another matrix $\begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix}$ When we multiply these two matrices we get
\begin{aligned}X &= \begin{bmatrix} w_{1,1}& w_{1,2} & w_{1,3}\\ w_{2,1}&w_{2,2}& w_{2,3} \\ w_{3,1}&w_{3,2}&w_{3,3}\\ \end{bmatrix}^T . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\ &= \begin{bmatrix} w_{1,1}& w_{2,1} & w_{3,1}\\ w_{1,2}&w_{2,2}& w_{3,1} \\ w_{1,3}&w_{2,3}&w_{3,3} \end{bmatrix} . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\&= \begin{bmatrix} w_{1}& w_{4} & w_{7}\\ w_{2}&w_{5}& w_{8} \\ w_{3}&w_{6}&w_{9}\\ \end{bmatrix} . \begin{bmatrix}input_1 \\ input_2 \\ input_3 \end{bmatrix} \\ &= \begin{bmatrix}(w_{1} * input_1) + (w_{4} * input_2 ) + (w_{7} * input_3 \\ (w_{2} * input_1) + (w_{5} * input_2 ) + (w_{8} * input_3)\\ (w_{3} * input_1) + (w_{6} * input_2 ) + (w_{9} * input_3) \end{bmatrix}\end{aligned} This is the result what we have found as the weighted sum of the input and the hidden layer. So, we can calculate the hidden layer output: \begin{aligned} H &= \sigma(W^T .x) \\ &= \begin{bmatrix}h_1 \\ h_2 \\ h_3 \end{bmatrix} \end{aligned} where W is the weight matrix and x is the input matrix. Which will be much easier and faster to calculate, since it doesn’t require to calculate every node individually. This technique is called vectorization.

So, the general equations will be \begin{aligned} z_1^{[1]} = w_1^{[1]T} x + b_1^{[1]}, a_1^{[1]} = \sigma(z_1{[1]}) \\ z_2^{[1]} = w_2^{[1]T} x + b_2^{[1]}, a_2^{[1]} = \sigma(z_2{[1]}) \\ z_3^{[1]} = w_3^{[1]T} x + b_3^{[1]}, a_3^{[1]} = \sigma(z_3{[1]}) \end{aligned} Where $$z_l^{[i]}$$ is the wighted sum of a single node and $$l$$ denotes the number of layers and $$i$$ denotes node number in a layer.

\begin{aligned}z^{[1]} &= \begin{bmatrix} w_1^{[1]T} \\ w_2^{[1]T} \\ w_3^{[1]T} \end{bmatrix} . \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]} \end{bmatrix} \\ &= \begin{bmatrix} w_1^{[1]T}.x + b_1^{[1]} \\ w_2^{[1]T}.x + b_2^{[1]} \\ w_3^{[1]T}.x + b_3^{[1]} \end{bmatrix} = \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]} \end{bmatrix} \\ z^{[1]} &= W^{[1]}.x + b^{[1]} \\ a^{[1]} &= \sigma(z^{[1]}) \\ z^{[2]} &= W^{[2]}.a^{[1]} + b^{[2]} \\ a^{[2]} &= \sigma(z^{[2]}) \end{aligned}

Now, these equations are for a single trianing example! But in general, we will have many such examples like the values of the Boolean functions shown above. Let $$X = \begin{bmatrix} x ^{(1)} & x^{(2)} & … & x^{(m)}\end{bmatrix}$$ where $$x ^{(1)}, x^{(2)}, … , x^{(m)}$$ are different training examples. Clearly, we have $$m$$ training examples. Then \begin{aligned} Z^{[1]} &= W^{[1]}.X + b^{[1]} \\ A^{[1]} &= \sigma(Z^{[1]}) \\ Z^{[2]} &= W^{[2]}.A^{[1]} + b^{[2]} \\ A^{[2]} &= \sigma(Z^{[2]}) \\ Z^{[1]} &= \begin{bmatrix} z ^{[1](1)} & z^{[1](2)} & … & z^{[1](m)}\end{bmatrix}\\ A^{[1]} &= \begin{bmatrix} a ^{[1](1)} & a^{[1](2)} & … & a^{[1](m)}\end{bmatrix} \end{aligned} where $$a^{[l](m)}$$ denotes the output of different layers of different examples and $$l$$ denotes the layer number and $$m$$ denotes the different examples.

## Neural Network: Back Propagation

[latexpage]To improve our model, we first have to quantify how wrong our model predictions are with compared to the target values of the model. Then, we adjust the weights accordingly so that the margin of error is decreased. Similar to the forward propagation, back propagation calculations occur at each layer but as the name indicates, backwardly. We begin by changing the weights between the hidden layer and the output layer.

## Cost Function

For quantifying how wrong our model is, first we have to calculate the error between predicted values and the output values of the model. To do this we have to use a cost function. A cost function can be the sum of the difference between the output value and the predicted value like $$E_{total} = \sum (target – output)$$ Let, we have target values $$2, 3, 5, 9$$ and output values $$1, 5, 3, 6$$ respectively.

Then the total error becomes $$E_{total} = 1 – 2 + 2 +3 = 4$$ We can see that second and third value cancels each other. So, we are not getting the actual error. To make our model more accurate we have to use a different cost function. What about the sum of the absolute value of the error?
$$E_{total} = \sum |target – output|$$

Then the total error becomes $$E_{total} = 1 + 2 + 2 +3 = 8$$ and it doesn’t cancel anything. The reason this isn’t popular is that the slope isn’t continuous near the minimum and this makes gradient descent not work so well, because we can bounce around the V-shaped valley that this error function has. The slope doesn’t get smaller or closer to the minimum, so our steps don’t get smaller, which means they risk overshooting. A better option is to use the sum of the squares of the errors. So, we will calculate the error for each output neuron using the squared error function and sum them to get the total error: $$E_{total} = \sum \frac{1}{2}(target – output)^2$$

For example, the target output for our network is $$0$$ but the neural network output is $$0.77$$, therefore its error is: $$E_{total} = \frac{1}{2}(0 – 0.77)^2 = .29645$$ Cross Entropy is another very popular cost function which equation is: $$E_{total} = – \sum target * \log(output)$$

## The Backward Pass

With backpropagation our goal is to update each of the weights in the network so that the cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole. Like forward propagation, we will derive equations for a single neuron to update the weights and then expand the concept for the rest of the network.

Consider, a network with inputs $$x_1, x_2$$ and $$x_3$$ and the associated weights $$w_1, w_2$$ and $$w_3$$ respectively. we know from the forward propagation part that \begin{aligned}\sigma(z) &= \frac{1}{1 + e^{-z}} \\ where, \; z &= \sum w_ix_i = w_1x_1 + w_2x_2 + w_3x_3\end{aligned} which we call as the output of the network. We have a target value for the network and for a untrained network when weights are not calibrated, there will be an error. We denote this error as $$E_{total}$$ where $$E_{total} = \frac{1}{2}(target – output)^2$$ Our job is to find out how we have adjust the weights to decrease this error. consider $$w_1$$. We want to know how much a change in $$w_1$$ affects the total error, $$\frac{\partial E_{total}}{\partial w_1}$$. If we look closely, we can see that, error is affected by the output and output is affected by $$z$$ while $$z$$ is affected by the weight $$w_1$$. By applying the chain rule, we know that, $$\frac{\partial E_{total}}{\partial w_1} = \frac{\partial E_{total}}{\partial output} * \frac{\partial output}{\partial z} * \frac{\partial z}{\partial w_1}$$ We need to break down each piece of the equation

\begin{aligned} \frac{\partial E_{total}}{\partial output} &= 2. \frac{1}{2} . (target – output)^{2-1} . -1\\ &= (output – target)\end{aligned} Next, we have to find out how much the output changes with respect to its total net input where \begin{aligned}output &= \sigma(z) = \frac{1}{1 + e^{-z}} \\ \frac{\partial(output)}{\partial z} &= \frac{e^{-x}}{{(1+e^{-x})}^2} \\ &= \frac{1}{1 + e^{-z}} . \frac{e^{-z}}{1 + e^{-z}} \\ &= \sigma(z).(1 – \sigma(z))\end{aligned} Finally, how much the total net input $$z$$ changes with respect to $$w_1$$ needs to be determined \begin{aligned}z &= w_1x_1 + w_2x_2 + w_3x_3 \\ \frac{\partial z}{\partial w_1} &= x_1 \end{aligned} Putting all the pieces together $$\frac{\partial E_{total}}{\partial w_1} = (output – target) * \sigma(z).(1 – \sigma(z)) * x_1$$ To adjust the weights we then use the formula \begin{aligned}w_1 &= w_1 – \eta * dw_1 \;, where \; \eta \; is \; called \; the \; learning \; rate \\ dw_1 &= \frac{\partial E_{total}}{\partial w_1} \end{aligned}

This rule for updating weights is called Gradient Descent. Now, let’s do a workout example for updating the weight $$w_{10}$$ in figure 1. Here, \begin{aligned} dw_{10} &= \frac{\partial E_{total}}{\partial w_{10}} \\ &= (output – target) * \sigma(z).(1 – \sigma(z)) * h_1 \\ &= (.77 – 0) * \sigma(1.2)*(1 – \sigma(1.2)) * .73 \\ &= 0.1 \\ w_{10} &= w_{10} – \eta * dw_{10} \\ &= 0.3 – (.5 * .1)\\ &= 0.25\end{aligned} Similarly, we can update other weights and this is generally a long process.

## Prediction

We have done all the hard works so far so that we can predict for new data using our neural network. The dataset we work on, generally split into two parts. One part is called training data where we do all the training and another is called the test data where we test our network. We have developed equations for training and using them we have got a calibrated set of weights. We will then use this set of weights to predict the result for our new data using the equation $$Prediction = W^T.X_{test} + b$$ where W and b are the calibrated set of weights and bias respectively and X_test is the test set split from our dataset.

Now that we have finished the theoretical part of our tutorial now you can see the code and try to understand different blocks of the code.