Transformer

What is a Transformer?

The Transformer is a type of neural network architecture introduced in the landmark 2017 paper "Attention Is All You Need." It replaced older sequence models like RNNs and LSTMs and has become the backbone of virtually every modern large language model, including GPT, Claude, and Gemini.

Think of older models like reading a book one word at a time, always carrying a short summary in your head. The Transformer, by contrast, can look at every word on the page simultaneously and decide which words are most relevant to each other. This ability to process in parallel makes it both faster and better at capturing long-range relationships in text.

How Does It Work?

A Transformer consists of two main components:

Encoder -- reads and understands the input sequence by applying layers of self-attention and feed-forward networks.
Decoder -- generates output one token at a time, attending both to the encoder's representation and to the tokens it has already produced.

Many modern LLMs use only the decoder half (GPT-style) or only the encoder half (BERT-style), depending on the task.

Why Does It Matter?

The Transformer's parallel processing and attention mechanism unlocked the ability to train on massive datasets efficiently, giving rise to the era of large language models and generative AI we see today.

What is a Transformer?

How Does It Work?

Why Does It Matter?

Related terms