Attention is all you need – Transformers

Attention is all you need – Transformers

Highlights:

  1. What is attention?
Image
Image
ImageImage
Example of a paper describing an algorithm for scene classificationFalse positive detections: In this case, a tree is classified as a car with a very high accuracy of 99%. 
Image

Now, RNNs are not ideal as they have some limitations as well.:

  • Computation for long sequences is challenging
  • Vanishing and exploding gradients are problematic in RNN architectures
  • Combining information if input sentences are long is difficult was difficult to combine information if the input sentences were too long
  1. Introducing the transformer

The transformer is essentially made up of two main blocks:

  1. Encoder 
  2. Decoder
Image

Let’s start with the encoder block and learn about the first process within the encoder, i.e., the input embedding. 

What is an input embedding?

Let’s take an example wherein we assume that our input sentence consists of 6 words/tokens.

Image

Now, the first step is to determine the Input IDs of these words from the word’s dictionary. 

In the dictionary, we may assume that we have collected all words from the English language. 

Every word will have the same Input ID regardless of where it appears in the sentence. For instance, the word ‘CAT’, in both occurrences in the sentence, will be assigned the same Input ID number. 

Figure 6 above shows the process of assigning Input IDs.Now, this can be also viewed as a one-hot vector. This means that we have a value of 1 at the Input ID position. This is shown in Figure 7 below.Image

Image

[ adjust/change the word in the image and #position … change image make our own. ]

Next, we map each of these Input IDs with a new vector via a process called input embedding. 

This input embedding will result in a new vector (of size 512). In this way, similar words are mapped into the same vectors, and in this way, the same word will be mapped into the same vector. Image

For instance, ‘cat’ embedded vectors will be the same. This may change during the training process with the values changing according to the training algorithm.

Note that the size of the embedding vector is defined in the original paper [2] as d_model.

You may be puzzled by the embedding step. Figure 8 below illustrates the input embedding process simply. Notice how similar objects (words) are relatively close to each other in the embedded space. For instance, ‘dogs’ and ‘cats’ will be relatively next to each other. On the other hand, the word “car” will be relatively far away from both of these two words. 

What is positional encoding?

Apart from input embedding, the model needs additional information about the relative position of the words in our input sequence. 

For instance, if we want a model to be able to determine which words are close or next to each other in a sentence, or if we want the model to determine which ones are distant or relatively far from each other in a sentence, a simple process called ‘naive positional encoding’ can be applied.

It is important to understand that the position of words in a sentence can change the meaning of the sentence, that is why positional encoding becomes all the more important. 

Positional encoding is a simple extension of the embedding vector in such a way that we concatenate the position value at the end of the vector. ImageImage

In practice, however, an idea is developed to construct a vector of the same size as the original and to sum it to the input vector. This is depicted in Figure 9.

This positional embedding is calculated only once, and then, the vectors are simply added (summed) to the original embedded input vector. Naturally, for summation to be possible, both vectors are defined as 512-element long vectors [2].

Let us quickly and briefly discuss how the positional embedding vector is calculated. For this, we will use the following formula: 

Image

It is interesting to note that position embedding vectors are calculated only once, and these values will be the same for every sentence. For instance, PE(pos, 2*i) or PE(pos, 2*i+1) will have the same values for different sentences, at the corresponding positions. This means that PE(0, i), will be the same vector of 512 elements calculated for all the first words in all input sentences. Note that these variables in the formula (pos) determine the frequency of the sine/cosine wave, whereas the variable i can be seen as a discrete-time that will adjust the sine wave values throughout the vector of length 512. 

One illustration, how this can be of help, can be seen in Figure 10 below, where different sine waves are shown. 

Image

We can see that along the y-axis on the graph, we have an embedding vector’s length (512). And, the position variable on the x-axis is depicted up to the values of 2000. The sine/cosine values will cover a continuous interval from -1 till 1 as can be seen in the legend of the graph. 

Hopefully, the model is expected to also “see” and understand these regular “sine patterns” and to better model the relative position of the words. 

4. Single Head Attention: Self-Attention

Moving on, we will now explore the concept of self-attention. Understanding this important concept is necessary for further study of the multi-head attention concept. Have a look at Figure 11 below which shows the block diagram of the subsequent processing step of this process.

Image

What is self-attention?

Self-attention is a mechanism that existed even before 2017. It was a model that enabled words to relate to each other. The formula that is currently used for self-attention is 

Image

To illustrate self-attention with a simple example, we will assume that we have 6 words as our model’s input. Next, we will use the input embedding step and, thereby obtain the matrix of size 6×512. 

This means that our matrix Q (query) will be of size 6×512.

Next, KT will remain the same as this equation:

 (equation needed! Transpose KT)

However, the matrix needs to be transposed, so that we can multiply them. This multiplication step is shown using the following illustrated equation.  

Image

In our case, the model size dmodel is the same as dk for simplicity such that there is only one head. Hence, to normalise it, we divide the matrix multiplication result. This will result in a matrix of 6×6, and this matrix will tell us how each of the 6 words relate to each other.

In addition, we will apply a softmax() function to the final result to add the values along each row.. This will give us the sum that equals one. 

Image

Finally, one additional step is to multiply this 6×6 matrix with the matrix V (Value).

This matrix is of size 6×512. This will result in the 6×512 matrix as well. This will be our resulting self-attention representation value. 

Image

This representation is constructed from a row-vector for each word from the input embedding vector. Both the position and relation with other words in this sentence is captured using the self-attention matrix. For instance, it tells us the interaction between the words. 

Properties of Self-Attention

So far, we have managed to derive a self-attention matrix. We can see that it actually does not rely on any single parameter. All the steps have been performed by the input embedding and positional encoding.

The attention matrix, intuitively, will have the largest values along the main diagonal. The larger the values of the off-diagonal elements, the stronger connection between two particular words will be. 

In addition, it is a good time that we mention and introduce that sometimes we want to prevent the interaction between some words! To accomplish this, we will input a value of “– infinity“ in the matrix. This will further become a 0 value, after a softmax() function is applied. We will use this later within a decoder when we want to prevent “future” words from being seen by the model. 

 Multi-Hhead Attention

Till now, we learnt about Single-Head Self-Attention. It’s timeSo far, we have been exploring a single head self-attention.

Now, we will  expand this concept and introduce Mmulti-Hhead Aattention. 

It is defined usinggiven with the following formulas:

Let’s start slowly to decompose this and systematically explain all the variables her

Image
ImageImage

e. 

Let’s start slowly to deconstruct these equations and systematically explain all the variables here.

In the above equations, we haveWe will again assumed, that we ourhave an input sentence that  consists of 6 words. After Iinput Embedding, we will get a matrix of size 6×512. 

Next, four copies of this matrix will be sent throughout four different paths, as shown in the graph in Figure 12 below. These paths areand depicted as “red paths”.  

Image

TNext, these matrix embeddings are further multiplied with the corresponding W matrices, as depicted below: 

Image

Next, depending on the number of heads (h) ,and we calculatevalue dx = dmodel / h

Each head will see the same word from the input, but it will be processing only a single fraction from the final matrix. 

Image

Now, the formulas become more intuitive and the final head will become athe concatenation of the following head_i:

ThisImage is also illustrated in the following graph. 

Image

TNow, the matrix H is the result of concatenation of (seq x dv) matrices. Hence, instead of calculatingon attention between Q’, K’ and V’ matrices, the multi-head attention model splits them and then calculates “attention” ??? Between Q1, Q2, Q3, and Q4 and so on. 

To summarise, the whole idea behind multi-head attention is so that some of the heads can lean to relate to nouns, verbs, or adjectives as well.For instance, some of the heads can learn to relate to nouns, verbs, or adjectives. This is the idea behind multi-head attention. 

We can even visualise this ‘attention’Moreover, attention can also be visualised. For instance, we can apply the softmax function, when calculating the product between Q and KT. Image

The final matrix in this case is (seq x seq) or (6 x 6), whichand it will tell us about the intensity of the relation between each word. 

In Figure 13, Next, we present an illustrative example from the paper for you to better grasp the multi-head attention process. Notice howFor instance, some of the words are more related than each other. 

Image

In this case, we can see how the verb “making” is related to all other words in the sentence.

Different colours are related to different heads from the model. (??? check). 

It seems that the pink head can see the interaction between the values that other heads cannot see. 

Query, Kkeys and Vvalues. 

From a well-known Python data structure dictionary point of view, the terms ‘Query’, ‘Keys’, and ‘Values’ are quite familiar. Even still, if you want to know about them, you can checkout the very interesting reasonsing that’s provided on the Stack Overflow website.At the first glance, these terms should be already familiar from a well-known Python data structure – dictionary. In addition, a very interesting reasoning is provided at “Stack-overflow” website. 

Image

The authors relate these words with the retrieval system (e.g. Youtube video search). AThe ‘Qquery’ can be seen as an input search word. Video titles and descriptions can be seen as sets of ‘Keys’. And, a set of best matched videos can be seen as a final result or ‘Value’. This ‘Value’ is an Then, there is a set of keys such as video title and description. Finally, a set of best matched videos can be seen as a final result (value). The value is an output result that the YouTube search engine will presents to us. 

The following example, can help us even more to get familiar with Qquery, Kkey and Vvalue interpretation. 

Image
Image

Masked Multi-Head Attention.

Image
Image