## #002 RNN – Architecture, Mapping, and Propagation

*Highlights*: Recurrent Neural Networks (RNN) are sequence models that are a modern, more advanced alternative to traditional Neural Networks. Right from Speech Recognition to Natural Language Processing to Music Generation, RNNs have continued to play a transformative role in handling sequential datasets.

In this blog post, we will learn how to build and map a Recurrent Neural Network with some interesting examples. In addition, we will represent basic RNN models using the mathematical notations of Forward Propagation and Back Propagation techniques. So, shall we begin?

**Tutorial Overview:**

- Brief Overview of RNN Fundamentals
- Mapping Input and Output
- Drawbacks of Using Standard Neural Networks
- The architecture of a Recurrent Neural Network
- Unidirectional Recurrent Neural Networks

- Forward Propagation in RNN
- Back Propagation in RNN

## 1. Brief Overview of RNN Fundamentals

Before we begin learning about how to build a Recurrent Neural Network, let us quickly recap the basic definition of RNN, as we learned in an earlier blog post.

Recurrent Neural Networks (RNN) are an alternative to traditional neural networks, especially used in cases where either the input data or the output data or both are sequential. By building sequence models such as RNNs, we can gain deeper insights into the problems of Speech Recognition, Machine Translation, Natural Language Processing, Music Generation, and many more.

Let us, now, move forward and understand how these sequence models or Recurrent Neural Networks are built, by training them how to map the input \(x \) and output \(y \).

**2. Mapping Input and Output**

The first task at hand for us is to try and see how much of the solution can standard neural networks provide us in our sequential dataset problems.

**Drawbacks of Using Standard Neural Networks**

First, let us recall the example we used in the previous blog post.

A standard neural network for this example will look something like this:

In our example, there are a total of nine input words. We could use nine One-Hot Vectors for each of these nine words and, along with a few hidden layers, feed it into a standard neural network. The output of this standard neural network will be in the form of nine values, either 0 or 1, where 1 denotes the word which is part of a person’s name.

However, there are a couple of problems that will arise eventually with such standard neural networks.

- Not all datasets we come across will have the same lengths of Input $latex T_{x} $ and Output $latex T_{y} $. We can try capping the input and output lengths to a set maximum value, but it will still not be a good representative solution.
- For a naïve neural network architecture, sharing features learned across different text positions is important. For example, if the neural network has learnt that the word “
*Luke*” appearing in a certain position in a sequence is a part of a person’s name, it should also learn automatically that the word “*Luke*” appearing elsewhere in the sequence, say at \(x^{\left \langle t \right \rangle} \), is also part of a person’s name. This is quite similar to Convolutional Neural Networks, where we wanted the network to generalize results for one part of the image to other parts of the image as well. However, working with even a small 10,000-dimensional one-hot vector will make the input layer very large and we would end up having an enormous number of parameters.

Recurrent Neural Networks solve the above two key problems that arise if we start to think about using Standard Neural Networks to solve our sequential data problems.

We are now ready to see the framework of building a Recurrent Neural Network.

**The architecture of a Recurrent Neural Network**

Understanding why standard neural networks fail to address the problems of interpreting sequential data is important to dig deeper into the blueprint of RNNs. The advantages of using Recurrent Neural Networks are significant but first, let us understand how these RNNs are built.

Let us look at the network below, where the length of the input and output sequence is the same. We will slowly move towards more scenarios and gain deeper knowledge about the benefits that RNN provides us.

The first input to the first layer of the above Neural Network is \(x^{\left \langle 1 \right \rangle} \). This input goes through a hidden layer and predicts an output whether the first word is part of a person’s name or not. The role of the Recurrent Neural Network comes into play when the second word of the input sentence, \(x^{\left \langle 2 \right \rangle} \), is being read. Instead of just predicting \(\hat{y}^{\left \langle 2 \right \rangle} using only \)latex x^{\left \langle 2 \right \rangle} $, it also receives a second input in the form of an activation value from Time Step 1. The same process is repeated for the next time steps as well, until the last time step, where the input is \(x^{\left \langle T_{x} \right \rangle} \) and the output is \(\hat{y}^{\left \langle T_{y} \right \rangle} \). For the initial time step 1, the activation value is kept at 0, which is nothing but a vector of zeros with a dummy time step 0. There are some other ways adopted by few researchers who initialize the activation value, \(a^{\left \langle 0 \right \rangle } \), randomly but having a vector of zeroes is the most common practice.

Like we mentioned earlier and as you can see too that in this example, the length of the input sequence \(T_{x} \) is equal to the length of the output sequence \(T_{y}\). The architecture of the Recurrent Neural Network will change once the input and output sequence lengths start to differ.

Apart from the above graphical representation of a Recurrent Neural Network, there is another way of representing RNN as seen in some research papers. In such representations, a loop is drawn on the time step indicating that the layer feeds back to itself.

The above RNN reads the input data from left to right while sharing parameters at each time step. We will describe the parameters of the hidden layers in detail ahead but for now, let’s look at the parameters that govern the connection between \(x^{\left \langle 1 \right \rangle} \) and the hidden layer. This parameter, which is the same for every time step, can be denoted as \(w_{aa} \). And similarly, the common parameter that governs the output predictions is denoted as \(w_{ya} \).

Consider the output parameter \(\hat{y}^{\left \langle 3 \right \rangle} \). This output or this prediction is made not just from this time step’s input \(x^{\left \langle 3 \right \rangle} \), but also from the information received from the previous inputs \(x^{\left \langle 1 \right \rangle} \) and \(x^{\left \langle 2 \right \rangle} \) which is being passed along with activation values and the parameters governing them.

In all the graphical representations above, we can see the flow of information is in one direction only, i.e., from left to right. Such sequence models or RNNs pose some limitations which we will learn ahead.

**Unidirectional Recurrent Neural Networks**

In our discussions above, we mentioned how information from previous inputs in the sequence is used when predicting, say, \(\hat{y}^{\left \langle 3 \right \rangle} \). Having said that, we can also notice how it only used the previous information and doesn’t take into consideration information from the words ahead in the sequence, such as \(x^{\left \langle 4 \right \rangle} \), \(x^{\left \langle 5 \right \rangle} \), \(x^{\left \langle x \right \rangle} \) and so on.

Consider a new sentence: “*Bear Grylls loves adventure.*” For the network to judge that “*Bear*” is a part of a person’s name, not only will it be useful for it to have information from the first two words but the rest of the sentence as well. Why we say this is because the sentence could also have been: “*Bear with me for some time please.*” How is a neural network supposed to judge whether “*Bear*” is a part of a person’s name or just an ordinary word? In the first example, it is a name while in the second it is not.

This limitation arises with the use of Unidirectional Neural Network architecture and can only be solved using Bidirectional Recurrent Neural Networks (BRNN), which modify the architecture slightly to incorporate information from all input words of the sentence at each time step. You will learn more about BRNNs in our future posts. For the sake of simplicity and understanding key concepts, we will be working with Unidirectional Recurrent Neural Networks only, for now.

Now we know how a basic RNN architecture is created. Let us continue building RNNs, this time using mathematical notations, and understanding how forward propagation works in Recurrent Neural Networks.

**3. Forward Propagation in RNN**

After a thorough understanding of the basic architectural framework of RNNs, we will learn how the whole process works and can be represented mathematically. We will start with the computational formulas used in the process of Forward Propagation in Recurrent Neural Networks (RNN).

Let us look at the RNN model again.

So, as we learnt earlier, the process starts off with the first ‘fake’ initial activation input, \(a^{\left \langle 0 \right \rangle}= \vec{0} \). Followed by this initial value, the subsequent forward propagation of input activation values can be written in the form of weights and biases as:

$$ a^{\left \langle 1 \right \rangle}= g\left ( w_{aa}\times a^{\left \langle 0 \right \rangle}+w_{ax}\times x^{\left \langle 1 \right \rangle}+b_{a} \right ) $$

Again, in order to compute \(\hat{y}^{\left \langle 1 \right \rangle} \), there will be yet another activation function internally, which will have the following mathematical formula:

$$ \hat{y}^{\left \langle 1 \right \rangle}= g\left ( w_{ya}\times a^{\left \langle 1 \right \rangle}+b_{y} \right ) $$

If we look at the notation \(w_{ax} \), the \(x \) means that the weight is to be multiplied by $latex x^{\left \langle 1 \right \rangle} $, and the \(a \) means that the notation \(w_{ax} \) is being used to calculate a value like \(a \). In the same way, when we write $latex w_{ay} $, it means that it is going to be multiplied by the value $latex a $ and will be used to calculate the value \(y \).

The activation function, as mentioned above, will most commonly be a Tanh. While sometimes for RNNs, ReLu can also be used but Tanh is the preferred option to solve the vanishing gradient problem. In the case of our output \(y \), choosing an activation function is an important step as well. If the output is a binary classification, we will use a Sigmoid activation function. Else, if the output is a K-classification problem, we shall go with Softmax. Therefore, for the Name Entity Recognition problem where w was either 0 or 1, \(g \) can be taken as a Sigmoid activation function. Writing in more general terms, we could denote \(a^{\left \langle t \right \rangle}\) at time \(t \) as:

$$ a^{\left \langle t \right \rangle}= g\left ( w_{aa}\times a^{\left \langle t-1 \right \rangle}+w_{ax}\times x^{\left \langle t \right \rangle}+b_{a} \right ) $$

And \(\hat{y}^{\left \langle t \right \rangle} \) is:

$$ \hat{y}^{\left \langle 1 \right \rangle}= g\left ( w_{ya}\times a^{\left \langle t \right \rangle}+b_{y} \right ) $$

So, these were the essential equations that define Forward Propagation in Recurrent Neural Networks. We start with an initial activation value \(a^{\left \langle 0 \right \rangle} \) which is a vector of all zero values. Then, we use this \(a^{\left \langle 0 \right \rangle}\) and the first input \(x^{\left \langle 1 \right \rangle} \) to predict our first output \(\hat{y}^{\left \langle 1 \right \rangle} \) and the next activation value \(a^{\left \langle 1 \right \rangle} \). Similarly, in the second time step, we use the input \(x^{\left \langle 2 \right \rangle}\) and the previous activation value \(a^{\left \langle 1 \right \rangle}\) to calculate the next activation value \(a^{\left \langle 2 \right \rangle} \) and a new output \(\hat {y}^{\left \langle 2 \right \rangle} \). And so we continue with our forward propagation from left to right in our Recurrent Neural Network.

We agree the equations are a little lengthy and pose a challenge when we think about using them in complex neural network systems. Let us, therefore, simplify these equations for a much more friendly RNN Notation.

$$ a^{\left \langle t \right \rangle}= g\left ( w_{aa}\times a^{\left \langle t-1 \right \rangle}+w_{ax}\times x^{\left \langle t \right \rangle}+b_{a} \right ) $$

$$ \hat{y}^{<t>}=g\left(W_{y a} a^{<t>}+b_{y}\right) $$

We can even write this Forward Propagation notation in a matrix form to arrive at a neatly stacked formula such as this:

Notice how we are defining \(w_{a}\ \) by stacking two matrices \(w_{aa}\ \) and \(w_{ax}\ \) horizontally, side by side. So, for example, if the vector \(a \) is 100-dimensional and our input vector \(x \) is 10,000-dimensional (because of dictionary size), then, the resultant matrices \(w_{aa}\ \) and \(w_{ax}\ \) would be of sizes \(100\times 100 \) and \(100\times 10,000 \), respectively.

Upon stacking \(w_{aa}\ \) and \(w_{ax}\ \) as above, the total number of elements in the matrix will thus, become 10,100 and hence, the resultant matrix \(w_{a}\ \) will be a \(100\times 10,100 \) dimensional matrix.

We could also write the above scenario in general form, by stacking the vectors \(a^{\left \langle t-1 \right \rangle} \) and \(x^{\left \langle t \right \rangle} \) to form a 10,100-dimensional vector, something like this:

You can even validate the above equation by working backward to see if you can reach the original Forward Propagation equation using this. Let us see how that works out.

$$ \begin{bmatrix}w_{aa} & w_{ax}\end{bmatrix}\times \begin{bmatrix}a^{\left \langle t-1 \right \rangle}\\x^{\left \langle t \right \rangle}\end{bmatrix}= w_{aa}\times a^{\left \langle t-1 \right \rangle}+w_{ax}\times x^{\left \langle t \right \rangle} $$

$$ \begin{bmatrix}w_{aa} & w_{ax}\end{bmatrix}\times \begin{bmatrix}a^{\left \langle t-1 \right \rangle}\\x^{\left \langle t \right \rangle}\end{bmatrix}= w_{aa}\times a^{\left \langle t-1 \right \rangle}+w_{ax}\times x^{\left \langle t \right \rangle} $$

There you go! By working our way backward from the simplified formula, we have reached our original notation for Forward Propagation. This proves our point that rather than using two-parameter matrices \(w_{aa} \) and \(w_{ax} \), we can simply compress them into a single parameter matrix \(w_{a} \) and make our lives easy while working on more complex models.

Now we can rewrite our original equations using the simplified notation in the following way:

$$ \hat{y}^{\left \langle 1 \right \rangle}= g\left ( w_{ya}\times a^{\left \langle 1 \right \rangle}+b_{y} \right ) $$

$$ \hat{y}^{\left \langle 1 \right \rangle}= g\left ( w_{a}\times a^{\left \langle t \right \rangle}+b_{y} \right ) $$

Look how neatly we have denoted Forward Propagation by using simple subscripts \(w_{y} \) and $latex b_{y} $. Here, \(w_{y} \) represents the weight of the matrix being computed and $latex b_{y} $ represents the type of output that we are computing. And, \(w_{a} \) and \(b_{a} \) represent the parameters for computing the input and output activation values.

Great! We have successfully learned how to notate Forward Propagation, i.e., moving from left to right through time, for basic Recurrent Neural Networks. However, as we have seen with other neural network models, moving forward in time is just not enough to see if our model is working accurately. This is the reason why Back Propagation is equally, if not more, important to understand even though the programming framework automatically takes care of it at the back-drop.

**4. Back Propagation in RNN**

We have reached an important stage in our understanding of Recurrent Neural Networks and how the basic models of RNN work. Now that we have seen how we move with time using various Input and Output parameters, we shall take the recursive path of going from right to left through time, in our basic RNN model.

In the section above, we saw how Forward Propagation works. Let us now see how the whole system of RNN model works, including Forward as well as Back Propagation.

In the graphical representation above, the blue arrows represent Forward Propagation and the red arrows represent Back Propagation. Going forward, we start with the input sequence \(x^{\left \langle 1 \right \rangle} \), \(x^{\left \langle 2 \right \rangle} \), \(x^{\left \langle 1 \right \rangle} \) up to \(x^{\left \langle T_{x} \right \rangle} \). We, then, use \(x^{\left \langle 1 \right \rangle} \) and \(a^{\left \langle 0 \right \rangle} \) to calculate the activation at time step 1, i.e., \(a^{\left \langle 1 \right \rangle} \). Next, we use this activation \(a^{\left \langle 1 \right \rangle} \) along with the next input \(x^{\left \langle 2 \right \rangle} \) to calculate the activation for the next time step, which is \(a^{\left \langle 2 \right \rangle} \). And, so on, up to \(a^{\left \langle T_{x} \right \rangle} \).

Now, earlier we mentioned how the initial activation value is just assumed to be a vector of zeros. However, if we were to actually compute this initial activation \(a^{\left \langle 0 \right \rangle} \), we would need the parameters \(w_{a} \) and \(b_{a} \). Similarly, we need these two parameters to calculate activation values \(a^{\left \langle 1 \right \rangle} \), \(a^{\left \langle 2 \right \rangle} \), \(a^{\left \langle 3 \right \rangle} \) and so on, for all subsequent time steps.

The outputs \(\hat{y} \) are calculated in a similar fashion as we computed the activation values. The first activation value \(a^{\left \langle 1 \right \rangle} \) is used to predict the first output \(\hat{y}^{\left \langle 1 \right \rangle} \), and the pattern follows for prediction of other outputs \(\hat{y}^{\left \langle 2 \right \rangle} \), \(\hat{y}^{\left \langle 3 \right \rangle} \) and so on up to $latex \hat{y}^{\left \langle T_{y} \right \rangle} $. For the RNN to compute these outputs \(\hat{y} \), it needs a set of weight and bias parameters too, which as represented as \(w_{y} \) and \(b_{y} \).

Now, for us to compute Back Propagation, first we need to determine a Loss Function. Let us start by defining an element-wise Loss Function. We know that if for a certain word in our sequence, to be a part of a person’s name, the output \(\hat{y}^{\left \langle t \right \rangle} \) will be 1. Our Neural Network will output some probability value, either 0 or 1, indicating if it thinks that the word is part of a person’s name. This can be perceived as a Standard Logistic Regression Loss, also called the Cross-Entropy Loss. This element-wise loss for a word is associated with a single prediction at a certain position or time step \(t \). We can write this mathematically like this:

$$ L^{\left \langle t \right \rangle}\left ( \hat{y}^{\left \langle t \right \rangle},y^{\left \langle t \right \rangle} \right )= -y^{\left \langle t \right \rangle}log\hat{y}^{\left \langle t \right \rangle}-\left ( 1-y^{\left \langle t \right \rangle} \right )log\left ( 1-\hat{y}^{\left \langle t \right \rangle} \right ) $$

The above equation represents Loss for a single word of the sequence. Let us see how we can extrapolate this to compute the overall loss of the entire sequence, represented as \(L \).

$$ L\left ( \hat{y},y \right )= \sum_{t= 1}^{T_{x}}L^{\left \langle t \right \rangle}\left ( \hat{y}^{\left \langle t \right \rangle},y^{\left \langle t \right \rangle} \right ) $$

The Total Loss of the entire input sequence is nothing but the sum of element-wise losses at each time step. Also, note that in our example, \(T_{x} \) and \(T_{y} \) are equal. This is how the overall computational graph of the Sequence’s Loss Function looks like:

The computation problem, as we can see, is to pass messages in the opposite direction as compared to Forward Propagation. These reverse-sent messages will allow you to compute appropriate quantities, which will, in turn, help you derive and update all parameters using gradient descent.

This algorithm is interestingly named “Back Propagation through time” due to the nature of its significant messages arriving from right to left, working backwards through the time steps of Recurrent Neural Networks. If only RNN could go back in actual time and attend Stephen Hawking’s Time Traveller Tea Party, but, oh well!

This brings us to the end of this tutorial post. We hope you have learned a great deal about how Recurrent Neural Networks are structured and mapped, and how forward and backpropagation works in basic RNN models. An important observation that we have mentioned earlier as well is that in the entire tutorial, we have considered the lengths of both input and output sequences to be equal. However, you can expect a new tutorial moving ahead, where we will understand a much wider range of RNN architectures tackling a variety of applications.

## Forward and Back Propagation in Recurrent Neural Networks

- RNNs work better than Standard Neural Networks
- RNN architecture contains hidden layers that have memory
- Simplified Forward Propagation methods
- Element-wise loss calculated for Back Propagation
- Wide range of real-life applications

**Summary**

RNNs are really an exciting topic to venture into, aren’t they? We wish to keep continuing with our tutorials to help you build your own models soon. In no time, you will be capable enough to train your own Neural Network to identify speech, names, music, and whatever else you wish. So keep reading, keep learning, and keep propagating! See you 😊