[Road to AI 08] The Transformer Revolution: "Attention Is All You Need"
A single paper from Google in 2017 changed AI history. The transformer architecture that overcame the limits of RNN and LSTM, and its self-attention mechanism — an intuitive explanation of why ChatGPT, Claude, and Gemini exist today.
AI-assisted draft · Editorially reviewedThis blog content may use AI tools for drafting and structuring, and is published after editorial review by the Trensee Editorial Team.
Series overview (8 of 8)▾
- 1.Road to AI 01: How Computers Were Born
- 2.Road to AI 02: Transistors and ICs, the Origin of AI Cost Curves
- 3.Road to AI 03: Why Operating Systems and Networks Still Decide AI Service Quality
- 4.The Path to AI 04: World Wide Web and the Democratization of Information, from Collective Intelligence to Artificial Intelligence
- 5.[Road to AI 05] The Infrastructure Revolution: How Distributed Computing Scaled the AI Brain
- 6.[AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster
- 7.[AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn
- 8.[Road to AI 08] The Transformer Revolution: "Attention Is All You Need"
Episode recap: In Episode 07, we examined how deep learning learns through backpropagation and gradient descent. This episode covers the transformer architecture — a fundamentally new application of deep learning to language processing. The 2017 Google research paper "Attention Is All You Need" is widely considered one of the most influential papers in AI history. We'll understand it intuitively, without any math.
The questions for this episode
In Episode 07, we explored the learning principles of deep learning — how neural networks reduce error through billions of computations via backpropagation and gradient descent.
This episode asks three core questions:
- Why did RNN and LSTM hit their limits? Why couldn't language models understand long sentences for so long?
- What is attention? What is the mathematical reality of AI "focusing on important words"?
- How did the transformer replace RNN? And why are ChatGPT, Claude, and Gemini all transformers?
1. The world before transformers: RNN and LSTM
What is a Recurrent Neural Network (RNN)?
While deep learning was revolutionizing image recognition, the language processing field had a different kind of problem.
Images have fixed-size input. Language doesn't. "AI advanced" is 2 tokens. "Artificial intelligence technology has advanced remarkably over the past ten years" is much longer. The length is variable.
RNN (Recurrent Neural Network) solved this problem.
The idea is simple: pass information from the previous state to the next step.
"I" → [RNN] → "love" → [RNN] → "AI"
↑____________↑___________↑
(previous state passed forward)
When processing each word, it uses the "memory" passed forward from previous words. In theory, it remembers all prior words.
Why can't RNN remember long sentences?
In practice, RNNs showed a fatal limitation: the Long-Term Dependency Problem.
Consider this example:
"I was born in Korea. I loved programming from an early age, and studied computer science in college. So I speak ___ well."
Should the blank be "Korean" or "programming"? The most relevant information for the blank ("Korea," "programming") is far back in the sentence.
Because RNN passes information sequentially, information dilutes over distance. This is called the Vanishing Gradient problem. During backpropagation, the gradient for distant words approaches zero, meaning the network learns almost nothing from those words.
How did LSTM overcome RNN's limitations?
In 1997, Hochreiter and Schmidhuber partially solved this problem with LSTM (Long Short-Term Memory).
LSTM's core idea is Gates. Three types of gates control information flow:
| Gate | Role |
|---|---|
| Forget gate | Decides what to discard from previous memory |
| Input gate | Decides what new information to add to memory |
| Output gate | Decides what to output from current memory |
LSTM significantly improved on RNN. It was widely used in the early 2010s for Google Translate, speech recognition, and sentiment analysis.
But LSTM had its own limits
Two problems remained unsolved:
No parallelization: RNN and LSTM use sequential processing. The first word must be processed before the second can be. GPU parallel processing capability couldn't be utilized.
Still limited for long-range dependencies: LSTM improved things, but performance still degraded over distances of hundreds or more tokens.
In 2017, the Google Brain team solves both problems in a completely different way.
2. What is attention?
The core idea of "Attention Is All You Need" can be summarized in one sentence:
"Instead of processing sequentially, every word looks at every other word simultaneously."
How does the attention mechanism understand relationships between words?
Take a translation task. When translating "I love AI" into another language, to translate "AI," the model references the entire sentence simultaneously.
The attention mechanism computes a "relevance score" for each pair of words:
How much does "AI" reference each word when being translated?
I → 0.1 (low relevance)
love → 0.2 (medium relevance)
AI → 0.7 (itself, most directly relevant)
Based on these scores, each word's representation is updated as a weighted average that reflects the full sentence context.
Query, Key, Value: the three elements of attention
More precisely, attention uses three vectors. Think of a library search system:
- Query: what you're looking for ("information related to AI")
- Key: the title/tags of each book ("deep learning," "machine learning," "AI")
- Value: the actual content of each book
Similarity between Query and Key produces a relevance score; that score is used to take a weighted average of Values. The method brings in more information from highly relevant words.
How does self-attention differ from previous approaches?
What makes the transformer special is self-attention. Every word in the sentence uses every other word as Query, Key, and Value. Each word references the entire sentence simultaneously to update its own representation.
This is the decisive difference from LSTM. LSTM passed memory forward sequentially from the start. Self-attention computes all word-pair relationships in parallel, all at once.
3. The full transformer architecture
How does the transformer's encoder-decoder structure work?
The original architecture in "Attention Is All You Need" is an encoder-decoder structure for translation:
Input sentence → [Encoder stack] → representation vector → [Decoder stack] → Output sentence
Encoder: Responsible for understanding the input sentence. Generates contextual representations of each word using self-attention.
Decoder: Responsible for generating the output sentence. References both already-generated words and encoder output simultaneously.
Multi-head attention: multiple perspectives at once
The transformer uses Multi-Head Attention. Instead of a single attention operation, multiple attention heads run in parallel.
For example, in "She lost her book," one head identifies which "she" the pronoun refers to, while another captures the verb-object relationship — simultaneously. Multiple relational perspectives are captured at once.
Positional encoding: how the transformer remembers order
Since self-attention processes all words simultaneously, word order information is not automatically included. "I love AI" and "AI love I" could be treated as the same thing.
To solve this, position information for each token is added via Positional Encoding — a vector carrying position information is added to each token vector.
4. Why did the transformer change everything?
Why did the transformer fully unleash GPU potential?
LSTM's sequential processing couldn't leverage GPU parallel processing capability. The transformer's self-attention computes all word pairs in parallel.
If the GPU revolution explored in Episode 06 made deep learning possible, the transformer made it so those GPUs could be fully leveraged for language processing as well.
Scaling: the more you scale, the better it gets
The transformer's most remarkable characteristic is its Scaling Laws. As model size and training data increase, performance improves almost linearly.
This property is what made it rational for big tech to invest hundreds of billions in GPUs. More compute + more data = better model. This simple law is the foundation of the current AI arms race.
5. From transformer to GPT, Claude, and Gemini
The GPT family: decoder only
OpenAI's GPT series uses only the decoder portion of the transformer. It's a structure specialized for text generation. From GPT-1 (2018) through GPT-3 (2020) to today's GPT-5, this principle remains unchanged.
BERT and encoder models: specialized for understanding
Google's BERT uses only the encoder portion. It's specialized not for generating text but for understanding it — used in search, classification, and question answering.
Modern LLMs: scaled transformers
Claude, Gemini, and GPT-4/5 are all variants of the transformer. The core ideas are unchanged from the 2017 paper, but dozens of detailed improvements have been layered on:
- Flash Attention: memory efficiency improvements for attention computation
- Rotary Positional Encoding (RoPE): positional encoding for handling longer contexts
- MoE (Mixture of Experts): sparse activation for efficient scaling
All are descendants of "Attention Is All You Need."
What's next
In Episode 08, we saw how the transformer works. In Episode 09, we'll look at pre-training and fine-tuning — how transformers are trained to create LLMs that converse like ChatGPT.
In particular, we'll cover why RLHF (Reinforcement Learning from Human Feedback) transformed a model that "simply predicts text" into AI that converses like a person.
Key concepts summary
| Concept | Before transformers | Transformer approach |
|---|---|---|
| Processing method | Sequential (RNN/LSTM) | Parallel (self-attention) |
| Memory mechanism | Sequential state forwarding | Direct reference to all tokens |
| GPU utilization | Limited | Full utilization |
| Scaling behavior | Non-linear, unstable | Nearly linear, predictable |
| Current successors | Almost none | All of ChatGPT, Claude, Gemini |
FAQ
Q. Who created the transformer?▾
Eight researchers from Google Brain, Google Research, and the University of Toronto co-authored it in 2017. First author Ashish Vaswani and several co-authors subsequently joined OpenAI, Adept AI, and other organizations now leading the AI industry.
Q. Why is attention computation slow?▾
With N tokens, self-attention computes N×N pairs. Double the tokens, quadruple the computation. This is why context windows have limits, and why technologies like Flash Attention play a key role in addressing this problem.
Q. Have RNN and LSTM completely disappeared?▾
In the frontier LLM space, they've been effectively replaced by transformers. However, LSTM is still used in constrained-memory and computing environments such as IoT and edge devices.
Q. Does having more attention heads always mean better performance?▾
Not necessarily. The optimal number of heads is proportional to model size. Unnecessarily many heads only increases computation cost without performance gains. Modern models use dozens to hundreds of heads.
Q. The paper was designed for translation — why does it work so well for language generation?▾
Translation is a specialized Seq2Seq task, but the transformer's core structure is general-purpose. It has been found applicable through the same principles to language generation (GPT family), understanding (BERT family), code generation, and even image processing.
Q. Haven't better architectures emerged since the transformer?▾
Alternatives like Mamba (state space models) and RWKV (RNN+transformer hybrid) have been proposed and show some efficiency gains, but they haven't fully replaced the transformer in the frontier LLM space. As of 2026, transformers remain the standard.
Q. How does understanding the transformer help in actual development?▾
You can understand at a principled level: what context window size means, why longer prompts cost more, and why prompt strategies that "put the most relevant information first" are effective. This helps with better prompt design and RAG architecture decisions.
Q. Can I read "Attention Is All You Need" directly?▾
It's freely available on arXiv. The math is dense, but reading it alongside Jay Alammar's "The Illustrated Transformer" makes it much easier to understand visually.
Further reading
- [Road to AI 07] Deep Learning Structure: Backpropagation and Gradient Descent
- Cursor's Dilemma: The Structural Crisis Facing a $3B AI Coding Startup
Update notes
- First published: 2026-03-26
- Data basis: Vaswani et al. 2017 original paper, GPT-3, GPT-4, Claude, Gemini technical documents
- Next update: Episode 09 — Pre-training and RLHF (planned 2026-04-01)
References
Execution Summary
| Item | Practical guideline |
|---|---|
| Core topic | [Road to AI 08] The Transformer Revolution: "Attention Is All You Need" |
| Best fit | Prioritize for AI Infrastructure workflows |
| Primary action | Profile GPU utilization and memory bottlenecks before scaling horizontally |
| Risk check | Confirm cold-start latency, failover behavior, and cost-per-request at target scale |
| Next step | Set auto-scaling thresholds and prepare a runbook for capacity spikes |
Data Basis
- Based on the original paper: Vaswani et al., "Attention Is All You Need" (NeurIPS 2017). Joint research by Google Brain, DeepMind, and the University of Toronto. 100,000+ Google Scholar citations as of 2026.
- Continuity verification: compared against Cho et al., "Learning Phrase Representations using RNN Encoder-Decoder" (2014) and Hochreiter & Schmidhuber, "Long Short-Term Memory" (1997).
- Modern LLM connection: analysis of transformer architecture inheritance in GPT-3 (2020), GPT-4 (2023), Claude 3 (2024), and Gemini 1.5 (2024).
Key Claims and Sources
Claim:The "Attention Is All You Need" paper (2017) achieved state-of-the-art Seq2Seq translation performance using only attention mechanisms, without any RNNs
Source:Vaswani et al., NeurIPS 2017Claim:LSTM (1997) addressed the long-term dependency problem but could not be parallelized due to its sequential processing structure, and performance degraded on long sequences
Source:Hochreiter & Schmidhuber, Neural Computation 1997
External References
Have a question about this post?
Sign in to ask anonymously in our Ask section.
Related Posts
[Series][Road to AI 05] The Infrastructure Revolution: How Distributed Computing Scaled the AI Brain
Data is only useful if you can process it. Discover the history of distributed computing and the cloud revolution that laid the foundation for modern AI models.
[Series]Road to AI 03: Why Operating Systems and Networks Still Decide AI Service Quality
Even in the model era, service quality is determined by operating systems and network structure.
[Series]Road to AI 01: How Computers Were Born
Like people, computing has a life story. This kickoff post explains where it started and maps the next 12 weekly episodes.
[Series][AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn
Now that AI has an engine (the GPU), how does it actually learn? This episode breaks down backpropagation, gradient descent, and loss functions with zero math — just clear intuition.
[Series][AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster
Tracing how a gaming graphics chip became the backbone of modern AI — from the birth of CUDA in 2007 to the AlexNet moment in 2012 and today's GPU clusters powering billion-parameter LLMs.