Skip to main content
ml-foundations

Compute-Optimal Scaling

A training strategy that balances model size and token count under a fixed compute budget to maximize quality-per-compute

#Compute-Optimal Scaling#Chinchilla#Scaling Laws#Training Compute#LLM

What Is Compute-Optimal Scaling?

Compute-optimal scaling is a training principle: with a fixed compute budget, the best model is not always the largest one.
Instead, quality is maximized by balancing parameter count (N) and training tokens (D).

This framing became widely known through DeepMind's Chinchilla work (2022).

How Does It Work?

The core idea is to avoid one-sided scaling.

  • If you scale parameters too aggressively while under-scaling tokens, the model is often undertrained.
  • If you scale tokens too aggressively on a small model, you can hit model-capacity limits.
  • Better efficiency comes from balancing both axes under the same compute envelope.

A commonly cited rule of thumb from Chinchilla analysis is an approximate ratio near N:D ≈ 1:20.

Why Does It Matter?

Compute-optimal scaling changed model planning from "largest possible model" to "best quality under a fixed budget."

For practical teams, it improves:

  • training ROI (quality-per-dollar)
  • hardware planning and run scheduling
  • decisions on whether to spend on larger models or better data mixtures

In short, it turns scaling into an optimization problem, not a size race.

Related terms