Transformers — Attention is all you need

5 min readMay 19, 2022

This article describes in what way Transfomers help sequence to sequence data to capture the essence of sequence by the idea of dot product without using time processing units such as RNN/LSTM.

Photo by Somchai Kongkamsri: https://www.pexels.com/photo/red-and-black-robot-statue-185725/

In terms of text processing, Transfomers modifies the word embeddings in such a way that each value in the word embeddings captures the context of the sentence. This is done by calculating similarity between the words using dot product in a given sentence and scale it down using Softmax to get probability representation of the similarities.

Before going to transformers let us touch base RNN and LSTM.

RNN — Recurrent Neural Network

RNNs are the motivation behind creating neural networks that goes around the time. For instance the engineers designed LSTM, Transformers e.t.c after understanding the properties of RNN and to resolve the issues that are seen in RNN.

One of the main disadvantage of feed forward neural network is that it has fixed input length, we can argue that it can be fixed by doing one hot encoding taking the no. of unique words in a sentence, but in real word this is not possible because of the huge data that flows around us nowadays. Also, if we have data that depends on previous data for e.g. languages, music, video which is highly correlated with the previous information, the traditional neural network is not capable enough to capture the historical information. This motivates to create a network that can capture the historical data called as Recurrent Neural Network.

Usually for a model, there are two outer dimensions we consider, 1) No. of data points 2) No. of features. In case of RNN an additional dimension “No. of timesteps” for a datapoint is also added which in turn increased complexity as well.

LSTM/GRU

The idea behind LSTM/GRU is to have learnable parameter for how much we need to incorporate the previous history. Consider a valve that decides how much quantity of previous data can be considered for current data. This can be achieved by using Tanh or Sigmoid i.e +1 being consider all the previous data and 0 means ignore previous history. So each output of the previous data has learnable parameter for much that can be considered for the output.

Encode Decoder Architecture

The drawback of encoder/decoder is the single Context Vector which is unable to capture the essence if the data is long sentence/essay.

Transformers

In Encoder Decoder architecture , we know that there are 2 states that comes out after processing one word.

Hidden State
Cell State

These are only used to initialise the next state and it is discarded immediately after that, whereas the transformers intelligently use these state vectors to create a context vector that can be used in the decoder module

Transformer is a type of time domain modelling technique in which the words are processed in parallel without using time dimensional system such as RNN/LSTM/GRU. The core components of transformers are Self Attention and Positional Embedding.

Model Design

Transformer Contains two blocks Encoder and Decoder.

There will n such modules of “Self Attention” module and a “FeedForward neural network” connected Serially for the Encoder block, similar setup for the Decoder block too. Both Encoder and Decoder blocks of the transformer will have Residual connections and Normalization layers which is typical setup in any Neural network design.

Self Attention

In LSTM/GRU/RNN the words are processed in sequence manner, why? because we need the hidden state of previous word to process the next word which lead to time complexity. The transformer is different in the way that it avoids RNN/LSTM also incorporate logic to give special weigtages for important words in the same sentence which it is currently processing by finding similarity score between the words that resides in same sentence. This approach is called as Self Attention.

Let. as take a language translation for instance English to Tamil Language

English: Where are you going?

Tamil : நீ எங்கே போகிறாய்?

The idea is to give more weightage to the word “going” and “where” compare to the other words in the english sentence while we are predicting the word “போகிறாய்” in Tamil.

Positional embedding

In the transformer, since we do not have time dimensional processing units such as RNN/LSTM, it has a seperate module to capture the word position number that comes in sentence , this idea is known as positional embedding.

Model Training : Feed Forward Neural Network

Given 4 words in the English Sentence, we will have some representation for these words called “Value Vectors” say v1,v2,v3,v4 and the objective is to find proper weightage for each word which is done through Feed forward neural network

The neural network will learn weigtages for v1 , v2, v3, v4 , these weightages are called “Score Values”. This represents importance of the other words in the sentence while we are processing one of the word in the sentence. Let say the weightage for the words are s1,s2,s3,s4 then if v1 and v2 are very relevant to the current translated word, then the value of s1 and s2 will be more than s3 and s4 at the end of model training, so the decoder will give more Attention to v1 and v2.

Score as Probability Distribution

We can think of Score as a probability distribution over the given word which provides the likehood of the word providing attention to the currently processing word(sequence data)

Context Vector for each word vs Single Context Vector

In Encode Decoder model as we know there will be single Context Vector that will be used while decoding for all the words, but in case of Transformers, a separate context vector is created for each word.

Context Vector Calculation for first word:

CV_1= s1_1*v1+s2_1*v2+s3_1*v3+s4_1*v4

Context Vector Calculation for 2nd word:

CV_2= s1_2*v1+s2_2*v2+s3_2*v3+s4_2*v4

Generally written as:

CV_i = s1_i*v1+s2_i*v2+s3_i*v3+s4_i*v4

where s1_i is the score value of the first word while we are processing the ith word.

Here s1_i,s2_i,s3_i,s4_i are the score values that is learned as we discussed in Model Training section.

This setup is nothing but dot product of the score values with the Value vectors.

Adding all the Context Vectors for one Sentence:

Z _i= CV_1 + CV_2 + CV_3 +….+CV_i

Z is nothing but summary of attention of all the words that is present in the ith sentence.

Score Calculation

The below Jay Alammar blog explains clearly step by step approach using an example

https://jalammar.github.io/illustrated-transformer/ in the section “Self-Attention in Detail” as the idea of this blog is to give how that it makes sense of creating such an idea.

Conclusion

Now we understood why the transformer setup helps improving the performance of sequence to sequence data. To know more with examples, you can refer Jay Alammar Blog — https://jalammar.github.io/illustrated-transformer/