Skip to main content
Back to List
AI Infrastructure·Author: Trensee Editorial·Updated: 2026-03-26

[Road to AI 08] The Transformer Revolution: "Attention Is All You Need"

A single paper from Google in 2017 changed AI history. The transformer architecture that overcame the limits of RNN and LSTM, and its self-attention mechanism — an intuitive explanation of why ChatGPT, Claude, and Gemini exist today.

AI-assisted draft · Editorially reviewed

This blog content may use AI tools for drafting and structuring, and is published after editorial review by the Trensee Editorial Team.

Episode recap: In Episode 07, we examined how deep learning learns through backpropagation and gradient descent. This episode covers the transformer architecture — a fundamentally new application of deep learning to language processing. The 2017 Google research paper "Attention Is All You Need" is widely considered one of the most influential papers in AI history. We'll understand it intuitively, without any math.


The questions for this episode

In Episode 07, we explored the learning principles of deep learning — how neural networks reduce error through billions of computations via backpropagation and gradient descent.

This episode asks three core questions:

  1. Why did RNN and LSTM hit their limits? Why couldn't language models understand long sentences for so long?
  2. What is attention? What is the mathematical reality of AI "focusing on important words"?
  3. How did the transformer replace RNN? And why are ChatGPT, Claude, and Gemini all transformers?

1. The world before transformers: RNN and LSTM

What is a Recurrent Neural Network (RNN)?

While deep learning was revolutionizing image recognition, the language processing field had a different kind of problem.

Images have fixed-size input. Language doesn't. "AI advanced" is 2 tokens. "Artificial intelligence technology has advanced remarkably over the past ten years" is much longer. The length is variable.

RNN (Recurrent Neural Network) solved this problem.

The idea is simple: pass information from the previous state to the next step.

"I" → [RNN] → "love" → [RNN] → "AI"
       ↑____________↑___________↑
           (previous state passed forward)

When processing each word, it uses the "memory" passed forward from previous words. In theory, it remembers all prior words.

Why can't RNN remember long sentences?

In practice, RNNs showed a fatal limitation: the Long-Term Dependency Problem.

Consider this example:

"I was born in Korea. I loved programming from an early age, and studied computer science in college. So I speak ___ well."

Should the blank be "Korean" or "programming"? The most relevant information for the blank ("Korea," "programming") is far back in the sentence.

Because RNN passes information sequentially, information dilutes over distance. This is called the Vanishing Gradient problem. During backpropagation, the gradient for distant words approaches zero, meaning the network learns almost nothing from those words.

How did LSTM overcome RNN's limitations?

In 1997, Hochreiter and Schmidhuber partially solved this problem with LSTM (Long Short-Term Memory).

LSTM's core idea is Gates. Three types of gates control information flow:

Gate Role
Forget gate Decides what to discard from previous memory
Input gate Decides what new information to add to memory
Output gate Decides what to output from current memory

LSTM significantly improved on RNN. It was widely used in the early 2010s for Google Translate, speech recognition, and sentiment analysis.

But LSTM had its own limits

Two problems remained unsolved:

  1. No parallelization: RNN and LSTM use sequential processing. The first word must be processed before the second can be. GPU parallel processing capability couldn't be utilized.

  2. Still limited for long-range dependencies: LSTM improved things, but performance still degraded over distances of hundreds or more tokens.

In 2017, the Google Brain team solves both problems in a completely different way.


2. What is attention?

The core idea of "Attention Is All You Need" can be summarized in one sentence:

"Instead of processing sequentially, every word looks at every other word simultaneously."

How does the attention mechanism understand relationships between words?

Take a translation task. When translating "I love AI" into another language, to translate "AI," the model references the entire sentence simultaneously.

The attention mechanism computes a "relevance score" for each pair of words:

How much does "AI" reference each word when being translated?
I    → 0.1 (low relevance)
love → 0.2 (medium relevance)
AI   → 0.7 (itself, most directly relevant)

Based on these scores, each word's representation is updated as a weighted average that reflects the full sentence context.

Query, Key, Value: the three elements of attention

More precisely, attention uses three vectors. Think of a library search system:

  • Query: what you're looking for ("information related to AI")
  • Key: the title/tags of each book ("deep learning," "machine learning," "AI")
  • Value: the actual content of each book

Similarity between Query and Key produces a relevance score; that score is used to take a weighted average of Values. The method brings in more information from highly relevant words.

How does self-attention differ from previous approaches?

What makes the transformer special is self-attention. Every word in the sentence uses every other word as Query, Key, and Value. Each word references the entire sentence simultaneously to update its own representation.

This is the decisive difference from LSTM. LSTM passed memory forward sequentially from the start. Self-attention computes all word-pair relationships in parallel, all at once.


3. The full transformer architecture

How does the transformer's encoder-decoder structure work?

The original architecture in "Attention Is All You Need" is an encoder-decoder structure for translation:

Input sentence → [Encoder stack] → representation vector → [Decoder stack] → Output sentence

Encoder: Responsible for understanding the input sentence. Generates contextual representations of each word using self-attention.

Decoder: Responsible for generating the output sentence. References both already-generated words and encoder output simultaneously.

Multi-head attention: multiple perspectives at once

The transformer uses Multi-Head Attention. Instead of a single attention operation, multiple attention heads run in parallel.

For example, in "She lost her book," one head identifies which "she" the pronoun refers to, while another captures the verb-object relationship — simultaneously. Multiple relational perspectives are captured at once.

Positional encoding: how the transformer remembers order

Since self-attention processes all words simultaneously, word order information is not automatically included. "I love AI" and "AI love I" could be treated as the same thing.

To solve this, position information for each token is added via Positional Encoding — a vector carrying position information is added to each token vector.


4. Why did the transformer change everything?

Why did the transformer fully unleash GPU potential?

LSTM's sequential processing couldn't leverage GPU parallel processing capability. The transformer's self-attention computes all word pairs in parallel.

If the GPU revolution explored in Episode 06 made deep learning possible, the transformer made it so those GPUs could be fully leveraged for language processing as well.

Scaling: the more you scale, the better it gets

The transformer's most remarkable characteristic is its Scaling Laws. As model size and training data increase, performance improves almost linearly.

This property is what made it rational for big tech to invest hundreds of billions in GPUs. More compute + more data = better model. This simple law is the foundation of the current AI arms race.


5. From transformer to GPT, Claude, and Gemini

The GPT family: decoder only

OpenAI's GPT series uses only the decoder portion of the transformer. It's a structure specialized for text generation. From GPT-1 (2018) through GPT-3 (2020) to today's GPT-5, this principle remains unchanged.

BERT and encoder models: specialized for understanding

Google's BERT uses only the encoder portion. It's specialized not for generating text but for understanding it — used in search, classification, and question answering.

Modern LLMs: scaled transformers

Claude, Gemini, and GPT-4/5 are all variants of the transformer. The core ideas are unchanged from the 2017 paper, but dozens of detailed improvements have been layered on:

  • Flash Attention: memory efficiency improvements for attention computation
  • Rotary Positional Encoding (RoPE): positional encoding for handling longer contexts
  • MoE (Mixture of Experts): sparse activation for efficient scaling

All are descendants of "Attention Is All You Need."


What's next

In Episode 08, we saw how the transformer works. In Episode 09, we'll look at pre-training and fine-tuning — how transformers are trained to create LLMs that converse like ChatGPT.

In particular, we'll cover why RLHF (Reinforcement Learning from Human Feedback) transformed a model that "simply predicts text" into AI that converses like a person.


Key concepts summary

Concept Before transformers Transformer approach
Processing method Sequential (RNN/LSTM) Parallel (self-attention)
Memory mechanism Sequential state forwarding Direct reference to all tokens
GPU utilization Limited Full utilization
Scaling behavior Non-linear, unstable Nearly linear, predictable
Current successors Almost none All of ChatGPT, Claude, Gemini

FAQ

Q. Who created the transformer?

Eight researchers from Google Brain, Google Research, and the University of Toronto co-authored it in 2017. First author Ashish Vaswani and several co-authors subsequently joined OpenAI, Adept AI, and other organizations now leading the AI industry.

Q. Why is attention computation slow?

With N tokens, self-attention computes N×N pairs. Double the tokens, quadruple the computation. This is why context windows have limits, and why technologies like Flash Attention play a key role in addressing this problem.

Q. Have RNN and LSTM completely disappeared?

In the frontier LLM space, they've been effectively replaced by transformers. However, LSTM is still used in constrained-memory and computing environments such as IoT and edge devices.

Q. Does having more attention heads always mean better performance?

Not necessarily. The optimal number of heads is proportional to model size. Unnecessarily many heads only increases computation cost without performance gains. Modern models use dozens to hundreds of heads.

Q. The paper was designed for translation — why does it work so well for language generation?

Translation is a specialized Seq2Seq task, but the transformer's core structure is general-purpose. It has been found applicable through the same principles to language generation (GPT family), understanding (BERT family), code generation, and even image processing.

Q. Haven't better architectures emerged since the transformer?

Alternatives like Mamba (state space models) and RWKV (RNN+transformer hybrid) have been proposed and show some efficiency gains, but they haven't fully replaced the transformer in the frontier LLM space. As of 2026, transformers remain the standard.

Q. How does understanding the transformer help in actual development?

You can understand at a principled level: what context window size means, why longer prompts cost more, and why prompt strategies that "put the most relevant information first" are effective. This helps with better prompt design and RAG architecture decisions.

Q. Can I read "Attention Is All You Need" directly?

It's freely available on arXiv. The math is dense, but reading it alongside Jay Alammar's "The Illustrated Transformer" makes it much easier to understand visually.


Further reading

Update notes

  • First published: 2026-03-26
  • Data basis: Vaswani et al. 2017 original paper, GPT-3, GPT-4, Claude, Gemini technical documents
  • Next update: Episode 09 — Pre-training and RLHF (planned 2026-04-01)

References

Execution Summary

ItemPractical guideline
Core topic[Road to AI 08] The Transformer Revolution: "Attention Is All You Need"
Best fitPrioritize for AI Infrastructure workflows
Primary actionProfile GPU utilization and memory bottlenecks before scaling horizontally
Risk checkConfirm cold-start latency, failover behavior, and cost-per-request at target scale
Next stepSet auto-scaling thresholds and prepare a runbook for capacity spikes

Data Basis

  • Based on the original paper: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017). Joint research by Google Brain, DeepMind, and the University of Toronto. 100,000+ Google Scholar citations as of 2026.
  • Continuity verification: compared against Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014) and Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997).
  • Modern LLM connection: analysis of transformer architecture inheritance in GPT-3 (2020), GPT-4 (2023), Claude 3 (2024), and Gemini 1.5 (2024).

Key Claims and Sources

External References

Was this article helpful?

Have a question about this post?

Sign in to ask anonymously in our Ask section.

Ask

Related Posts