Skip to main content
Back to List
AI Infrastructure·Author: Trensee Editorial Team·Updated: 2026-03-11

[AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster

Tracing how a gaming graphics chip became the backbone of modern AI — from the birth of CUDA in 2007 to the AlexNet moment in 2012 and today's GPU clusters powering billion-parameter LLMs.

AI-assisted draft · Editorially reviewed

This blog content may use AI tools for drafting and structuring, and is published after editorial review by the Trensee Editorial Team.

Key Takeaway: One of the foundational reasons modern AI exists is the GPU. How did a chip designed for gaming graphics become the central compute engine for deep learning? This installment traces the historical transformation — from the emergence of NVIDIA CUDA to the AlexNet moment in 2012, and through to today's LLM training clusters.

The Questions This Installment Addresses

In Episode 05, we examined how the distributed computing and cloud revolution built an infrastructure foundation capable of processing tens of petabytes of data — an environment where tens of thousands of machines could move as one. But another wall remained.

Three core questions drive this installment.

  1. Why did CPUs hit a wall for AI training? Even with many high-performance CPUs, something was fundamentally missing.
  2. How did researchers discover the potential in graphics chips? How did hardware designed for gaming become the engine of AI?
  3. How did the GPU revolution shape today's AI market structure? Was NVIDIA's dominant position the result of technical choices — or something deeper?

1. The CPU Wall: Why a Single Chip Architecture Couldn't Train AI

Serial vs. Parallel: The Fundamental Difference Between Two Chips

The CPU (Central Processing Unit) is designed for generality. It is optimized to handle complex logical operations, conditional branching, and memory management rapidly. High-performance CPUs may have dozens of cores, but each core is powerful and handles complex operations sequentially. This is a serial processing architecture.

The GPU (Graphics Processing Unit) was designed from an entirely different philosophy. Because graphics processing requires calculating millions of pixels for display simultaneously, GPUs adopted a parallel architecture in which thousands of simpler cores process operations at the same time. By 2006, a high-performance GPU had hundreds of times more cores than a comparable CPU.

Why Deep Learning's Core Nature Matches the GPU

The fundamental operation in deep learning neural networks is matrix multiplication. Multiplying and summing millions of parameters (weights) with input data — this operation repeats throughout the training process. It has two defining characteristics.

Simplicity: Each individual operation is not especially complex. It is multiplication and addition, repeated.

Scale: But the count is enormous. Training a model like GPT-3 requires trillions of matrix operations.

In other words, deep learning computation consists not of "a small number of very complex calculations" but of "a very large number of very simple calculations" — the latter being precisely where GPUs excel. This is the foundational reason GPUs became the principal hardware for AI training.

Estimates from AI researchers in the early 2010s suggested that training a modern deep learning model on CPUs alone would, in some cases, require decades. Reports emerged of that timeline compressing to days or weeks upon switching to GPUs.


2. An Accidental Discovery: Why Researchers Turned to Graphics Chips for AI

The Context of Early Experiments, 2006–2007

Attempts to use GPUs for general-purpose computation actually predated CUDA. At the time, researchers found workarounds using GPU shader programming languages to perform matrix operations. The results were striking — computation speeds tens to hundreds of times faster than CPUs. But the method was deeply inconvenient. GPU instructions could only be issued through graphics APIs, so scientific computations had to be "disguised" as graphics rendering operations.

Laboratories including Geoffrey Hinton's at the University of Toronto and Yann LeCun's at New York University were already attempting GPU-based neural network training through these cumbersome methods. A more accessible path was urgently needed.

The Launch of CUDA (2007): What It Meant

In 2007, NVIDIA released CUDA (Compute Unified Device Architecture) — a programming platform designed specifically to enable direct use of the GPU for general-purpose parallel computation, rather than for graphics rendering alone. For the first time, developers could instruct GPU cores to perform arbitrary computations using a C-based language.

Before it was a technical innovation, this was an act of ecosystem design. NVIDIA sold GPU hardware while simultaneously providing the software layer needed to exploit that hardware to its fullest potential. This decision was the foundation that, a decade later, would make NVIDIA the central company in AI infrastructure.

After CUDA's release, researchers could for the first time use a GPU as a genuine general-purpose parallel computer. Writing neural network training code in C and executing it on a GPU became possible.


3. The AlexNet Moment: How 2012 Changed AI History

Why the 2012 ImageNet Competition Was Different

In September 2012, the results of the ImageNet LSVRC (Large Scale Visual Recognition Challenge) — the world's largest image recognition competition at the time — were announced. The winning team was composed of Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton from the University of Toronto. Their model was called AlexNet.

The result was startling. The gap in error rate between first and second place was 10.8 percentage points. The winning model achieved a 15.3% error rate; the second-place entry achieved 26.2%. In a competition where improvements of one to two percentage points were considered significant, this margin was unprecedented.

According to the AlexNet paper (NeurIPS 2012), the model was trained on two NVIDIA GTX 580 GPUs — consumer gaming graphics cards by the standards of the day. A gaming GPU had changed AI history.

What GPU Parallelism Made Possible for AlexNet

AlexNet differed from its predecessors not simply in size. By the standards of its time, it was a comparatively deep network — eight layers — with 60 million parameters. Training a model of this scale within a reasonable timeframe required GPU-based parallel computation.

Without GPUs, one of two constraints would have applied: either a smaller model (limiting performance), or a training process requiring years of computation. The training run that took approximately five to six days on two GPUs would have required many times longer using CPUs alone.

After AlexNet, the AI research community became convinced that the combination of "deeper neural networks + GPU parallelism" was the breakthrough path. From that year forward, GPUs became de facto essential equipment for deep learning research.


4. How NVIDIA Became the Central Company in AI Infrastructure

The CUDA Ecosystem Strategy: Market Share, Not Just Technology

NVIDIA's dominant position in AI infrastructure was not earned through GPU performance alone. The core factor was the CUDA ecosystem.

From CUDA's launch in 2007, NVIDIA offered CUDA libraries to the AI and scientific computing communities free of charge, encouraging researchers to write code targeting CUDA. cuDNN (a deep learning compute library) and cuBLAS (a linear algebra compute library) became essential components of this ecosystem.

As a result, major deep learning frameworks such as TensorFlow and PyTorch adopted CUDA as their default backend. Today, the vast majority of deep learning code is written to run on top of CUDA. Switching to a different GPU vendor requires rebuilding this entire ecosystem. This is NVIDIA's real moat.

From Tesla to H100: A Lineage of AI-Dedicated GPUs

NVIDIA has developed a dedicated data center GPU line for AI and high-performance computing alongside its consumer offerings.

Generation Product Year Significance
1st Tesla C1060 2008 First data center GPU; CUDA-based
2nd Fermi (GF100) 2010 Enhanced double-precision (FP64); HPC-targeted
3rd Kepler (K80) 2012 Response to AI demand following the AlexNet boom
4th Pascal (P100) 2016 NVLink introduced; AI training optimization begins in earnest
5th Volta (V100) 2017 Tensor Cores introduced — dedicated circuitry for AI matrix operations
6th Ampere (A100) 2020 Used to train GPT-3; becomes the standard AI infrastructure
7th Hopper (H100) 2022 Training and inference for GPT-4, Claude 2, and other large models
8th Blackwell (B100/B200) 2024– Enhanced Transformer Engine; inference optimization focus

The Tensor Core, introduced in the Volta generation (2017), is a dedicated hardware circuit for AI matrix operations. Operations that had previously run on general-purpose CUDA cores could now be processed far more rapidly on purpose-built circuitry. This was the inflection point at which chip architecture began to be designed specifically for AI — not the other way around.


5. How Many GPUs Does LLM Training Require Today?

The Reality of Scale: LLM Training in Numbers

The scale of training for modern large language models (LLMs) is incomparable with the AlexNet era.

  • AlexNet (2012): 2 × GTX 580, approximately 5–6 days
  • GPT-2 (2019): Dozens of V100s, several weeks
  • GPT-3 (2020): Thousands of A100s, several months (estimated training cost: millions to tens of millions of dollars)
  • GPT-4 (2023): Tens of thousands of A100s, several months or more (exact figures not disclosed by OpenAI)
  • Llama 3 (2024): Tens of thousands of H100s, weeks to months

These figures do not merely mean "more" — they mean that GPU clusters themselves have become the core asset of AI research. The capacity to own and operate tens of thousands of GPUs is equivalent to the capacity to build state-of-the-art AI models.

The Evolution of GPU Usage Patterns

How GPUs are used has also evolved. The critical question is no longer simply how many GPUs to connect, but how to connect them and how to distribute work.

  • Data Parallelism: The same model is copied to multiple GPUs, each processing different data, with results aggregated.
  • Model Parallelism: When the model is too large for a single GPU, the model is partitioned across multiple GPUs.
  • Pipeline Parallelism: GPUs are assigned to individual layers, processing workloads in a pipeline fashion.
  • Tensor Parallelism: Matrix operations themselves are partitioned and processed simultaneously across multiple GPUs.

Training models at GPT-3 scale and beyond requires a hybrid parallelism strategy that combines all four approaches. Optimizing this combination has itself become a core engineering discipline.


6. The New Power Structure Created by the GPU Revolution

NVIDIA's Market Dominance: Technology or Ecosystem?

As of 2024–2025, NVIDIA is estimated to hold over 80% of the market for AI training GPUs. This dominant position is not primarily a function of GPU performance. The lock-in effect of the CUDA ecosystem is the core mechanism.

The major deep learning frameworks (PyTorch, TensorFlow, JAX), key libraries (cuDNN, NCCL), and optimization compilers (TensorRT) are all optimized to run on top of CUDA. The years of CUDA-based expertise accumulated by researchers and engineers further raise switching costs.

How Realistic Are the Alternatives?

Challenges to NVIDIA's dominance do exist.

  • AMD ROCm: An open-source alternative to CUDA. Support in frameworks such as PyTorch continues to improve, but the prevailing assessment is that performance and compatibility gaps remain.
  • Google TPU (Tensor Processing Unit): A dedicated AI chip developed by Google. Used alongside the JAX framework on Google Cloud, it has demonstrated competitive performance in some large model training scenarios.
  • Meta MTIA (Meta Training and Inference Accelerator): A chip Meta is developing for its own AI workloads. Internal deployment began in 2023, but external access remains limited.
  • Emerging AI chip companies: Cerebras (wafer-scale chip), Groq (LPU), and SambaNova are competing on specialized performance claims.

The prevailing view is that these alternatives are unlikely to threaten NVIDIA's position in the near term. However, in the inference segment in particular, dedicated chips are building competitive capability rapidly.


The Connection to Today: Inference Cost Decline Starts With GPUs Too

The observed 97–99% decline in inference costs over two years (covered in detail in the companion deep-dive article) also has its roots in the GPU revolution. Inference optimization features strengthened from NVIDIA's Hopper (H100) generation, FP8 compute support in the Blackwell generation, and community-level inference optimization projects (vLLM, TensorRT-LLM, llama.cpp) have all progressed in tandem with GPU architecture advances.

At the same time, efforts to reduce GPU dependency are also contributing to lower inference costs. On-device AI, model lightweighting, and CPU-optimized inference have expanded the range of AI inference workloads that do not require expensive GPUs. This too is a product of the maturing ecosystem that the GPU revolution created.


Up Next: The Architecture of Deep Learning — How Neural Networks Actually "Learn"

In Episode 07, we turn to how AI, now equipped with the GPU engine, actually "learns." Backpropagation, gradient descent, loss functions — what do these concepts actually mean, and why did this approach outperform every other attempt? We will aim to unpack the mathematical substance of "AI learning from experience" as intuitively as possible.


Key Takeaway Summary: Why This History Matters for Understanding AI Today

Historical Event Connection to Modern AI
GPU parallel architecture (1990s–) The physical foundation of deep learning matrix operations
CUDA launch (2007) The root of the PyTorch and TensorFlow ecosystems
AlexNet (2012) The starting point of "deep learning = practical AI"
Tensor Core introduction (2017) The opening of the dedicated AI hardware era
A100/H100 clusters The physical reason GPT-4, Claude, and Gemini exist
CUDA ecosystem lock-in The real cause of NVIDIA's dominance
Alternative chip competition An additional driver of declining inference costs

For everyone who uses or builds AI services today, the history of the GPU revolution is more than background knowledge. It provides direct context for understanding why AI training costs are as high as they are, why NVIDIA holds such a powerful market position, and why AI inference costs are falling as rapidly as they are.


Frequently Asked Questions (FAQ)

Q1. Are GPUs and CPUs cooperative or competitive?

They are cooperative. In real AI systems, the CPU handles overall flow control, memory management, and I/O processing, while the GPU handles large-scale parallel operations such as matrix multiplication. The two chips work together, connected via PCIe bus or NVLink.

Q2. Does learning CUDA actually help in AI development?

Direct CUDA programming is primarily relevant for AI infrastructure engineers and optimization specialists. Most AI model developers never need to work with CUDA directly, since frameworks like PyTorch and TensorFlow leverage CUDA automatically. That said, an understanding of GPU memory architecture and parallel computation principles is genuinely useful for performance optimization work.

Q3. Can consumer-grade GPUs be used for AI training?

Yes. NVIDIA's RTX series (e.g., RTX 4090) supports CUDA and is widely used for small-scale model fine-tuning and personal research. However, the VRAM capacity is significantly smaller than data center GPUs (A100: 80 GB, H100: 80–141 GB), which limits full training of large models.

Q4. Can Apple Silicon (M-series) be used for AI training?

Yes. Apple's M-series chips integrate CPU, GPU, and Neural Engine in a unified SoC (System on Chip) architecture. The unified memory architecture makes them highly energy-efficient for training and inference of small to mid-sized models. PyTorch supports Apple's Metal Performance Shaders (MPS) backend. However, compatibility with the CUDA ecosystem is not complete, and some library constraints apply.

Q5. Is the correlation between NVIDIA's stock price and AI progress because of GPUs?

It is one of the core causes. Because NVIDIA GPUs are essential for AI model training and inference, demand for GPUs rises as AI investment grows. The fact that H100 GPU wait times extended to several months during the generative AI boom of 2023–2024 illustrates the scale of that demand.

Q6. Can AMD catch up with NVIDIA in AI?

The prevailing assessment is that this is unlikely in the near term (one to two years). The primary barrier is the software ecosystem rather than hardware performance. AMD's ROCm platform continues to improve, but fully replacing the CUDA-optimized libraries and tooling requires substantial time. In the inference segment, however, the MI300X series has demonstrated competitive performance in some workloads.

Q7. Why is GPU memory (VRAM) so critical for AI training?

Model parameters, intermediate computation results (activations), and optimizer states all need to reside in GPU memory simultaneously. For example, full fine-tuning of a 7-billion-parameter model at FP16 precision requires approximately 70 GB or more of VRAM. When memory is insufficient, teams must reduce batch sizes or resort to techniques such as gradient checkpointing and offloading, both of which reduce training speed.

Q8. Will it become possible to train AI without GPUs in the future?

The direction of development is more likely toward "not solely dependent on GPUs" than "no GPUs at all." Google TPUs, dedicated AI chips (Cerebras, Groq), and neuromorphic chips (Intel Loihi) are all active research areas. However, the inertia of the CUDA ecosystem is powerful enough that the prevailing view is that GPUs are unlikely to lose their central role in AI training in the near term.

Q9. Does Korea have sufficient GPU infrastructure for AI training?

Analyses suggest a meaningful gap remains between Korea's AI computing infrastructure and that of the United States and China. Naver Cloud, KT, and Samsung SDS have built A100/H100 clusters, and government-level AI computing infrastructure investment is underway. However, GPU cluster capacity at the scale required to train state-of-the-art frontier models remains limited domestically.



Further Reading

Execution Summary

ItemPractical guideline
Core topic[AI to the Future 06] The GPU Revolution: How NVIDIA's CUDA Made AI 1,000x Faster
Best fitPrioritize for AI Infrastructure workflows
Primary actionProfile GPU utilization and memory bottlenecks before scaling horizontally
Risk checkConfirm cold-start latency, failover behavior, and cost-per-request at target scale
Next stepSet auto-scaling thresholds and prepare a runbook for capacity spikes

Data Basis

  • Series basis: cross-analysis of NVIDIA official historical materials, academic papers on GPU computing, and AI infrastructure history literature
  • Validation sources: original CUDA paper (2007), AlexNet paper (NeurIPS 2012), NVIDIA official blog and technical documentation
  • Interpretive focus: the historical transition of the GPU from graphics accelerator to general-purpose parallel compute device, and its connection to modern LLM training

Key Claims and Sources

  • Claim:AlexNet (2012) achieved a 41% reduction in error rate compared to previous methods at the ImageNet competition through GPU-based parallel computation, marking the beginning of the deep learning era

    Source:Krizhevsky et al. NeurIPS 2012 Paper
  • Claim:NVIDIA CUDA (2007) is documented as the pivotal software platform that transformed the GPU from a graphics-only device into a general-purpose parallel compute unit

    Source:NVIDIA Official CUDA Documentation

External References

Was this article helpful?

Have a question about this post?

Sign in to ask anonymously in our Ask section.

Ask