[AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn

Q: Q6. Why does having more parameters matter for LLMs?

More parameters give a model greater capacity — the ability to represent more complex patterns. However, more parameters also demand more training data, more GPU memory, more energy, and more time. Scale alone does not guarantee quality; the balance between model size, data quantity, and compute budget is what drives progress.

Q: Q7. Why did deep learning only take off in the 2010s when the principles existed in the 1980s?

Three ingredients were missing: ① Data — before the internet, large labeled datasets did not exist at scale. ② Compute — the GPU revolution (covered in Episode 06) had to happen first. ③ Algorithmic improvements — practical techniques like ReLU activations, dropout, and batch normalization were developed throughout the 2000s and early 2010s. All three combined for the first time in AlexNet (2012).

Q: Q8. Are modern LLMs trained using backpropagation alone?

The initial pre-training phase is entirely backpropagation-based: the model learns to predict the next token from massive text corpora. The subsequent RLHF phase introduces a reward model and a policy optimization algorithm (such as PPO). But even in RLHF, backpropagation and gradient descent remain the underlying update mechanism.

Key takeaway: With the GPU engine in place, how does AI actually learn? This episode unpacks backpropagation and gradient descent — the two mechanisms by which a neural network reduces its own errors through trillions of calculations — with no equations, just clear intuition. These concepts are the reason ChatGPT, Claude, and Gemini exist today.

What This Episode Is About

In Episode 06, we explored the GPU revolution — how a pair of consumer graphics cards (AlexNet, 2012) rewrote the history of image recognition by accelerating deep learning computations more than 1,000-fold, and how the CUDA ecosystem became the foundation of modern AI infrastructure. The engine existed. But how does it drive learning?

This episode answers three questions:

How does a neural network "learn"? What is the mathematical reality behind the idea that AI learns from experience?
What is backpropagation? How does a network improve by studying its own mistakes?
What is gradient descent? What is the principle behind the iterative process of minimizing loss?

1. What Is the Basic Structure of a Neural Network?

How are neurons and layers organized?

A neural network is, as the name implies, inspired by the biological brain's network of neurons. In practice, however, it operates differently from a biological brain — its essence is a connected chain of mathematical functions.

The basic unit is a neuron (also called a node). Each neuron does the following:

Receives multiple input values from the previous layer.
Multiplies each input by a weight and sums them all up.
Adds a bias term.
Passes the result through an activation function.
Sends the output to the next layer.

That's it. Deceptively simple — yet when millions to billions of neurons are stacked in layers and repeat this operation, they develop the ability to recognize extraordinarily complex patterns.

Why are weights so important?

Weights are where a neural network stores its "learned knowledge." GPT-3, for instance, has approximately 175 billion parameters (weights plus biases). Every bit of GPT-3's understanding of language lives in those 175 billion numbers.

At the start of training, all weights are initialized randomly. Learning is the process of adjusting those random numbers toward "correct" values using data.

2. What Is a Loss Function?

What does a loss function actually measure?

Before a neural network can learn, it needs a way to measure how wrong it is. That measurement is called a loss function (or cost function).

A simple example:

You show a neural network a photo of a cat and ask: "Is this a cat?" The network answers: "Cat: 30%, Dog: 70%." The correct answer is cat.

The loss here is high — the network was 70% wrong. If instead the network had answered "Cat: 95%," the loss would be low. The goal of training is to keep pushing this loss value down.

Why shouldn't the loss be driven to zero?

Making loss exactly zero on the training set sounds like success — but it is dangerous. When a network memorizes every training example perfectly, it collapses on new data it has never seen. This is called overfitting.

A network should not memorize training data. It should understand patterns. The real goal of training is to maintain the ability to generalize — to perform well on unseen data — even if that means tolerating some training loss. This is why techniques like dropout, weight decay, and data augmentation exist.

3. How Does Gradient Descent Work?

What is gradient descent?

Gradient descent is the algorithm that iteratively adjusts a network's parameters to minimize the loss function.

Imagine you are blindfolded on a mountain and must find the lowest point. You feel the slope underfoot and take a step in the steepest downhill direction. You repeat this — one step at a time — until you reach a valley. That is the intuition behind gradient descent.

Mathematically, it works like this:

Compute the gradient (slope) of the loss function at the current parameter values.
Shift the parameters a small amount in the opposite direction of the gradient.
Repeat this process tens of thousands to millions of times.

Why is the learning rate so critical?

The size of each step is controlled by the learning rate.

Too large: The step overshoots the minimum, and the loss may actually increase — a phenomenon called divergence.
Too small: Training becomes painfully slow, and the optimizer may get stuck in a local minimum and never escape.

Choosing the right learning rate is one of the most important skills in training neural networks. Modern deep learning largely addresses this with adaptive optimizers like Adam and AdaGrad, which automatically tune the learning rate for each parameter during training.

4. What Is Backpropagation and How Does It Work?

What is backpropagation?

Backpropagation is simultaneously the most important and the most counterintuitive concept in neural networks.

Here is the problem it solves: suppose a network's loss is high and we need to update its weights. But there are millions of weights. How do we know which weight contributed how much to that final loss? We need to attribute blame, layer by layer, all the way back to the inputs.

Backpropagation does exactly this: it propagates the error backward from the output layer to the first layer, computing each weight's gradient — its contribution to the loss.

How does backpropagation actually work?

Backpropagation uses a mathematical principle called the chain rule from calculus.

Intuitively:

Imagine a team project that produced a bad outcome. To improve, you must trace backwards: did the final presenter explain it poorly? Did a mid-stage editor change the content? Did the initial researcher gather the wrong data? You work backwards, apportioning responsibility at each step.

Backpropagation does the same thing. Starting from the final loss, it works backward through each layer, computing how much each neuron contributed to the error. Those contributions are the gradients that tell gradient descent how much to adjust each weight and in which direction.

What is the historical significance of backpropagation?

The mathematical foundations of backpropagation were established in the 1960s and 1970s. But its effective application to neural network training arrived in a landmark 1986 paper by Geoffrey Hinton, David Rumelhart, and Ronald Williams, published in Nature.

This paper proved that multi-layer neural networks could be trained effectively — a claim that skeptics at the time considered impossible. Backpropagation broke through that wall, laying the groundwork for everything that followed.

5. How Does One Complete Learning Cycle Work?

What are the four steps of a single training iteration?

A full learning cycle consists of four stages:

Step 1: Forward Pass

Input data (e.g., a photo of a cat) flows forward through each layer in sequence, from the first to the last. At each layer, computations are performed using the current weights, producing a final output (e.g., probabilities for "cat" vs. "dog").

Step 2: Loss Calculation

The network's output is compared against the correct label to compute the loss value.

Step 3: Backward Pass (Backpropagation)

The chain rule is applied in reverse, from the final layer back to the first, computing the gradient of each weight with respect to the loss.

Step 4: Weight Update

Gradient descent adjusts each weight by the learning rate, moving in the direction opposite to the gradient.

This four-step cycle constitutes a single training iteration. Training GPT-3 involved running this cycle trillions of times.

6. Why Do We Use Mini-Batches Instead of the Full Dataset?

Why can't a network train on the entire dataset at once?

In theory, batch gradient descent — computing the gradient over the entire training dataset before updating weights — is the most accurate approach. In practice, it is impossible. GPT-3's training data exceeds hundreds of gigabytes; loading it all into GPU memory at once is not feasible.

How does stochastic gradient descent solve this?

The practical solution is Stochastic Gradient Descent (SGD): divide the data into small chunks called mini-batches, compute a gradient for each chunk, and update the weights after each chunk.

Mini-batch sizes typically range from 32 to 2,048 samples. Smaller batches are memory-efficient and introduce useful stochasticity that can help escape local minima, but updates are noisier. Larger batches produce more stable gradient estimates but require more GPU memory and can, in some settings, hurt generalization.

7. Is Backpropagation Still Used to Train ChatGPT and Modern LLMs?

Does backpropagation power today's largest language models?

Yes. Backpropagation and gradient descent are the same core mechanisms used to train ChatGPT, Claude, Gemini, and every other large language model (LLM) today. The 1986 principles operate at the heart of 2020s state-of-the-art AI.

The difference is scale and complexity:

Item	AlexNet (2012)	GPT-3 (2020)	Modern LLMs (2025+)
Parameter count	60 million	175 billion	Hundreds of billions to trillions
Training data	A few GB	Hundreds of GB	Several TB
Hardware	GTX 580 × 2	A100 × thousands	H100 × tens of thousands
Training duration	5–6 days	Several months	Months to years

What does RLHF add on top of backpropagation?

Modern LLMs undergo a second training phase after basic pre-training: RLHF (Reinforcement Learning from Human Feedback). Human evaluators rate model outputs; those ratings become reward signals that guide additional fine-tuning toward more helpful, accurate, and safe responses.

This is why ChatGPT responds like a conversationalist rather than a raw text predictor. RLHF still relies on backpropagation, but the gradient signal comes from a learned reward model rather than labeled data.

What Comes Next: The Transformer — The Architecture That Changed Everything

Episode 08 covers the Transformer architecture — the true origin point of the modern AI revolution. We will examine how a single 2017 paper, "Attention Is All You Need," rewrote the history of NLP; why self-attention is so powerful; and how the Transformer became the shared foundation of GPT, BERT, and Claude.

Key Concepts Summary: Why Does Understanding Deep Learning Matter?

Concept	Connection to Today's AI
Neural network structure (layers, weights)	The physical identity of LLM parameters
Loss function	How AI measures "how wrong it is"
Gradient descent	The core method by which GPT and Claude learned
Backpropagation (1986)	The algorithm that made deep learning possible
Mini-batch SGD	How trillions of parameters are trained in practice
RLHF	How ChatGPT and Claude are shaped to give good answers

FAQ

Q1. Do I need to implement backpropagation myself to build AI?▾

Not anymore. Deep learning frameworks like PyTorch and TensorFlow include automatic differentiation (autograd), which handles backpropagation automatically. A developer only needs to define the model architecture and the loss function. That said, understanding how backpropagation works makes it much easier to diagnose why training is not converging.

Q2. Can a neural network always find the optimal weights?▾

There is no guarantee. Gradient descent can get trapped in a local minimum and may never reach the global minimum of the loss landscape. In practice, however, deep networks tend to find local minima that are "good enough" — empirically, the many local minima in high-dimensional spaces tend to have similar loss values to the global minimum.

Q3. Does more training data always produce a better model?▾

Generally yes, but with conditions. Biased or noisy data can hurt performance regardless of volume. In many real-world cases, data quality and diversity matter more than raw quantity.

Q4. What is the difference between deep learning and machine learning?▾

Machine learning is the broad field of methods that learn patterns from data. Deep learning is a subset of machine learning that specifically uses deep neural networks — networks with many stacked layers. The key distinction: deep learning can automatically extract features from raw data, whereas traditional machine learning typically requires humans to engineer features by hand.

Q5. Does a neural network truly "understand" anything?▾

This is one of the central debates in AI philosophy. A neural network is mathematically a function that maps inputs to outputs by approximating patterns. Whether that constitutes "understanding" in any human sense remains contested. What is clear is that neural networks can extract and generalize meaningful patterns in ways that are practically useful — and often surprising.

Q6. Why does having more parameters matter for LLMs?▾

More parameters give a model greater capacity — the ability to represent more complex patterns. However, more parameters also demand more training data, more GPU memory, more energy, and more time. Scale alone does not guarantee quality; the balance between model size, data quantity, and compute budget is what drives progress.

Q7. Why did deep learning only take off in the 2010s when the principles existed in the 1980s?▾

Three ingredients were missing: ① Data — before the internet, large labeled datasets did not exist at scale. ② Compute — the GPU revolution (covered in Episode 06) had to happen first. ③ Algorithmic improvements — practical techniques like ReLU activations, dropout, and batch normalization were developed throughout the 2000s and early 2010s. All three combined for the first time in AlexNet (2012).

Q8. Are modern LLMs trained using backpropagation alone?▾

The initial pre-training phase is entirely backpropagation-based: the model learns to predict the next token from massive text corpora. The subsequent RLHF phase introduces a reward model and a policy optimization algorithm (such as PPO). But even in RLHF, backpropagation and gradient descent remain the underlying update mechanism.

Q9. What is the vanishing gradient problem?▾

As backpropagation travels backward through many layers, gradient values can shrink exponentially, approaching zero. When gradients become too small, weights in early layers stop updating and learning effectively halts. Solutions include: ReLU activation functions (which do not saturate for positive inputs), residual connections (introduced by ResNet, allowing gradients to skip layers), and batch normalization (stabilizing the distribution of layer inputs during training).

Update Note

This article was written in March 2026 based on publicly available materials covering deep learning fundamentals. The core principles are well-established; however, the latest optimization techniques and LLM training methodologies continue to evolve rapidly.

Item	Practical guideline
Core topic	[AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn
Best fit	Prioritize for AI Infrastructure workflows
Primary action	Profile GPU utilization and memory bottlenecks before scaling horizontally
Risk check	Confirm cold-start latency, failover behavior, and cost-per-request at target scale
Next step	Set auto-scaling thresholds and prepare a runbook for capacity spikes

[AI Evolution Chronicle #07] How Deep Learning Actually Works: Backpropagation, Gradient Descent, and How Neural Networks Learn

What This Episode Is About

1. What Is the Basic Structure of a Neural Network?

How are neurons and layers organized?

Why are weights so important?

2. What Is a Loss Function?

What does a loss function actually measure?

Why shouldn't the loss be driven to zero?

3. How Does Gradient Descent Work?

What is gradient descent?

Why is the learning rate so critical?

4. What Is Backpropagation and How Does It Work?

What is backpropagation?

How does backpropagation actually work?

What is the historical significance of backpropagation?

5. How Does One Complete Learning Cycle Work?

What are the four steps of a single training iteration?

6. Why Do We Use Mini-Batches Instead of the Full Dataset?

Why can't a network train on the entire dataset at once?

How does stochastic gradient descent solve this?

7. Is Backpropagation Still Used to Train ChatGPT and Modern LLMs?

Does backpropagation power today's largest language models?

What does RLHF add on top of backpropagation?

What Comes Next: The Transformer — The Architecture That Changed Everything

Key Concepts Summary: Why Does Understanding Deep Learning Matter?

FAQ

Further Reading

Update Note

References

Execution Summary

Data Basis

Key Claims and Sources

External References

What This Episode Is About

1. What Is the Basic Structure of a Neural Network?

How are neurons and layers organized?

Why are weights so important?

2. What Is a Loss Function?

What does a loss function actually measure?

Why shouldn't the loss be driven to zero?

3. How Does Gradient Descent Work?

What is gradient descent?

Why is the learning rate so critical?

4. What Is Backpropagation and How Does It Work?

What is backpropagation?

How does backpropagation actually work?

What is the historical significance of backpropagation?

5. How Does One Complete Learning Cycle Work?

What are the four steps of a single training iteration?

6. Why Do We Use Mini-Batches Instead of the Full Dataset?

Why can't a network train on the entire dataset at once?

How does stochastic gradient descent solve this?

7. Is Backpropagation Still Used to Train ChatGPT and Modern LLMs?

Does backpropagation power today's largest language models?

What does RLHF add on top of backpropagation?

What Comes Next: The Transformer — The Architecture That Changed Everything

Key Concepts Summary: Why Does Understanding Deep Learning Matter?

FAQ

Related Terms (Glossary)

Further Reading

Update Note

References

Execution Summary

Data Basis

Key Claims and Sources

External References