GPU (Graphics Processing Unit)
The core compute engine behind AI training and inference, specialized for massive parallel computation
What is a GPU?
A GPU (Graphics Processing Unit) was originally designed for graphics rendering, but it has become the core compute engine for AI training and inference. Thousands of small parallel cores execute large-scale linear algebra — especially matrix multiplication — tens to hundreds of times faster than CPUs.
NVIDIA's H100/A100, AMD's MI300, and Google's TPU are the representative AI accelerators. Training a modern frontier LLM requires thousands to tens of thousands of GPUs running for weeks or months.
How Does a GPU Accelerate AI?
More than 95% of AI workloads are matrix multiplication (GEMM). GPUs ship dedicated Tensor Core units tuned for this operation, delivering far higher throughput per watt and per square millimeter than general-purpose CPUs.
- Training: hundreds of GBs to multiple TBs of model parameters live in GPU HBM memory while backpropagation runs in parallel
- Inference: batches of user requests are fused so that one GPU serves many concurrently (batching, throughput optimization)
- Distributed training: ultra-fast interconnects like NVLink and InfiniBand turn hundreds to thousands of GPUs into a single giant cluster
A single H100 reaches about 1 PFLOPS FP16, with 80 GB of HBM3 at 3 TB/s bandwidth. The next generation — B200 and MI350 — roughly 2–3× those numbers.
Why Does It Matter?
GPU supply is both the bottleneck and the strategic asset of the AI industry. Training a large LLM presupposes thousands of GPUs and tens of millions of dollars; without GPU availability, power, and cooling, frontier model development is simply not possible. That supply chokehold is a major reason NVIDIA rose to the top of global market capitalization between 2023 and 2025. On the inference side, latency, cost, and energy efficiency depend directly on GPU generation — making GPUs a critical variable in product pricing, SLAs, and margin.