dH #001: Understanding Transformers and BERT: The Foundation of Modern NLP

dH #001: Understanding Transformers and BERT: The Foundation of Modern NLP

*Exploring the revolutionary architectures that power today’s generative AI systems*

## The Evolution That Changed Everything

Language modeling has evolved over the years. The recent breakthroughs in the past 10 years include the usage of neural networks to represent text, such as Word2Vec and N-grams in 2008.

Timeline evolution

In 2014, the development of sequence-to-sequence models, such as RNNs and LSTMs, helped improve the performance of ML models on NLP tasks, such as translation and text classification. In 2015, the excitement came with attention mechanisms, which enabled the development of Transformer models.

## The Context Problem

Although all the models before Transformers, such as Natural Language Models, Word2Vec and n-grams, and Multi-task learning, were able to represent words as vectors, these vectors did not contain the context, and the usage of words changes based on the context.

Pre-transformer models

For example, before the attention mechanism came about, the word ‘bank’ in different contexts like ‘Bank’ and ‘River Bank’ versus ‘Bank’ in ‘Bank Robert’ might have the same vector representation in models like Word2Vec and N-grams.

Bank context example

## Transformers: The Game Changer

Transformers is based on a 2017 paper named ‘Attention is All You Need’, which introduced the attention mechanism. A transformer is an encoder-decoder model that uses the attention mechanism, as shown in the timeline on the slide.

Transformer timeline

A transformer is an encoder-decoder model that can take advantage of parallelization (GPU/TPU) and process all tokens at once, allowing it to process much more data in the same amount of time.

Parallelization benefits

Transformer models were built using attention mechanisms at the core. Attention mechanism helps improve the performance of machine translation applications.

## Architecture Deep Dive

A transformer model consists of an encoder with positional encoding, self-attention, and feed forward components, and a decoder with positional encoding, self-attention, encoder-decoder attention, and feed forward components, along with input and output embeddings.

Full transformer architecture

The encoder encodes the input sequence using self-attention and feed forward layers, and the decoder decodes a representation for a relevant task by attending to the encoder output and the previous decoder outputs through self-attention and encoder-decoder attention layers.

The Transformer model consists of an encoder component that is a stack of encoders of the same number. The encoders are all identical in structure, but with different weights. They are part of an Encoder-Decoder architecture in a Transformer model.

Encoder stack

## Inside Each Encoder

Each encoder can be broken down into two sublayers: a Self-Attention layer and a Feedforward layer. The first layer is called the Self-Attention.

Encoder components

The input of the encoder first flows through a Self-Attention layer, which helps the encoder look at relevant parts of the words, as it encodes a center word in the input sentence. And the second layer is called a Feedforward layer.

The output of the Self-Attention layer is fed to the Feedforward Neural Network. The exact same Feedforward layer is independently applied to each position.

The decoder has both the Self-Attention and the Feedforward layers, but between them is the Encoder-Decoder Attention layer, that helps the decoder focus on relevant parts of the input sentence.

## How Self-Attention Works

The word at each position passes through a Self attention process. Then it passes through a Feedforward neural network. The exact same network with each vector flowing through it separately.

Self-attention flow

Dependencies exist between these parts in the Self-Attention layer. However, the Feed Forward Layer does not have these dependencies. And therefore, various parts can be executed in parallel, while they flow through the Feed Forward Layer.

In the Encoder #1 layer, the input embedding is broken up into query, key, and value vectors through the Self attention mechanism.

Query, Key, Value vectors

These vectors, including the query vector (k), key vector (k), and value vector (v), are computed using learned weights (W) that the transformer learns during the training process.

Weight matrices

Once we have the query, key, and value vectors, then we can apply the learned weights w^q, w^k, and w^v shown in the image.

Applied weights

## The Attention Calculation Process

The next step is to multiply each value vector v by the softmax score, as shown in the diagram. The intuition here is to keep intact the values of the words you want to focus on and leave out irrelevant words by multiplying them by tiny numbers like 0.001 for example.

Softmax multiplication

Next, we have to sum up the weighted value vectors v, which produces the output vector z of the self-attention layer at this position for the first word.

You can send along the resulting vector Z to the feedforward neural network, which is the output of the self-attention layer consisting of multiple attention heads (#0 to #7) that compute weighted sums of value vectors based on queries and keys.

Multi-head attention

## Putting It All Together

To sum up this process of getting the final embeddings, these are the steps that we take: compute weighted sums of value vectors using queries and keys across multiple attention heads in the self-attention layer, and pass the resulting vector Z to the feedforward neural network.

We start with the input natural language sentences, embed each word. After that, we embed each word, perform multi-headed attention with multiple weight matrices on the embedded words, and calculate the attention using the resulting QKV matrices.

Complete process

We then calculate the attention using the resulting QKV matrices, where Q, K, and V are obtained by multiplying the input word embeddings with the respective weight matrices Wq, Wk, and Wv.

QKV calculation

## Transformer Variations

There’s multiple variations of transformers out there now that perform multi-headed attention and multiple embed the words with respective weight matrices. Some use both the encoder and the decoder component from the original architecture.

Transformer variations

Some pre-trained transformer models use an encoder and decoder architecture, like BART, while others use only the decoder, like GPT-3 and GPT-2, or only the encoder, like BERT.

Model types

## BERT: The Encoder-Only Revolution

A popular encoder only architecture is BERT. BERT is one of the trained transformer models, specifically an encoder-only model.

BERT introduction

BERT stands for bidirectional encoder representations from transformers and was developed by Google in 2018. Since then, multiple variations of BERT have been built, including the Bidirectional Encoder Representations from Transformers (BERT) model shown in the image.

BERT acronym

## BERT’s Real-World Impact

Today, BERT powers Google search. You can see how different the results provided by the search engine are for the query ‘Can you get medicine for someone pharmacy’ before and after, with the before result being about getting a prescription filled and the after result being about whether a family member can pick up a prescription.

Google search improvement

## BERT’s Training Approach

It was trained in two variations. The BERT model is powerful because it can handle long input context. It was trained on the entire Wikipedia corpus and BookCorpus. The BERT model was trained for one million steps.

BERT training details

BERT is trained on different tasks, which means it has multi-task objective. This makes BERT very powerful. Because of the kind of tasks it was trained on, it works at both a sentence-level and a token-level tasks.

BERT is able to handle long input context, was trained on entire Wikipedia and BookCorpus, trained for one million steps, targeted at multi-task objective, trained on TPU, and works at both sentence-level and token-level tasks, and can be fine-tuned for many different tasks.

BERT capabilities

## BERT’s Training Tasks

The way that BERT works is that it has different versions like BERT BASE with 12 layers and 768 feedforward dimensions, BERT LARGE with 24 layers and 1024 feedforward dimensions, and a standard Transformer model with 6 layers and 512 feedforward dimensions.

BERT versions

Task 1 is called masked language modeling (MLM), where a percentage of the input words are masked, and the model is trained to predict the masked words.

Masked language modeling

The recommended percentage for masking is 15%. The 15% masking percentage achieves a balance between too little masking, which makes the training process too expensive, and too much masking, which does not provide enough context for the model.

The task is masked language modeling (MLM), where a percentage of the input words are masked out, and the model predicts the masked words.

MLM example

For example, the model is given two sets of sentences for the Next sentence prediction (NPS) binary classification task.

Next sentence prediction

The NPS task aims to learn the relationships between sentences and predict whether the second sentence follows the first one or not.

For example, sentence A could be ‘The man went to the store.’, and sentence B is ‘He bought a gallon of milk.’

NPS example

The task shown is Next Sentence Prediction (NPS), a binary classification task where a model like BERT learns the relationships between sentences and predicts if the second sentence follows the first one.

This is a binary classification task called Next sentence prediction (NPS). This helps BERT perform at a sentence level.

Binary classification

## BERT’s Input Processing

For the input sentence, you get three different embeddings: token embeddings, segment embeddings, and position embeddings.

Three embeddings

The token embeddings is a representation of each token as an embedding vector in the input sentence, consisting of token embeddings, segment embeddings, and position embeddings as shown in the diagram.

Token embeddings detail

Words are transformed into vector representations of certain dimensions using token embeddings, segment embeddings, and position embeddings as shown in the BERT input embeddings diagram.

Vector transformation

## BERT for Text Classification

Bert can solve NLP tasks that involve text classification as well. An example is to classify whether two sentences ‘My dog is cute’ and ‘He likes play ##ing’ are semantically similar.

Text classification example

How does BERT distinguish the inputs in a given pair? The answer is to use segment embeddings, as shown by the ‘Segment Embeddings’ row in the image.

There is a special token represented by [SEP] that separates the two different splits of the sentence. Another problem is to learn the order of the words in the sentence, which is handled by the ‘Position Embeddings’ row in the image.

As you know, BERT consists of a stack of transformers. BERT is designed to process input sequences up to a length of 512 tokens.

BERT architecture

The order of the input sequence is incorporated into the position embeddings, which are learned vector representations for each position. This allows BERT to learn a vector representation for each position in the input sequence.

## BERT’s Versatility

BERT can be used for different downstream tasks. BERT can be used for various downstream tasks, for example: Single sentence classification, Sentence pair classification, Question answering, and Single sentence tagging tasks like the CoNLL-2003 NER task shown in the diagram.

BERT applications

## Conclusion

The journey from Word2Vec to BERT represents one of the most significant advances in artificial intelligence. By solving the context problem through attention mechanisms and introducing bidirectional training, transformers and BERT have become the foundation for virtually every major breakthrough in modern NLP.

From powering Google Search to enabling the latest generative AI applications, these architectures continue to drive innovation across the field. Understanding their core principles – attention, parallelization, and contextual understanding – is essential for anyone working with modern AI systems.

Thank you slide