LoRA (Low-Rank Adaptation)

What is LoRA?

LoRA, short for Low-Rank Adaptation, is a technique for customizing large AI models without the massive computational cost of retraining them from scratch. Imagine you have a talented employee who already knows general business skills. Instead of sending them back to university for four years, you give them a short, focused training course for a specific role. LoRA works the same way: it adds a small set of learnable adjustments to an existing model while keeping the original model weights frozen.

How Does It Work?

In a standard neural network, weight matrices can have millions or billions of parameters. LoRA introduces two small matrices that, when multiplied together, approximate the changes needed in the original weight matrix. Because these additional matrices are much smaller (low-rank), the number of trainable parameters drops dramatically, often by 90% or more. During inference, the LoRA weights can be merged back into the original model, adding no extra latency. You can also swap different LoRA adapters in and out, allowing one base model to serve many specialized tasks.

Why Does It Matter?

LoRA has democratized fine-tuning. Before LoRA, adapting a large language model required expensive GPU clusters and significant engineering effort. Now, researchers and hobbyists can fine-tune billion-parameter models on a single consumer GPU. This has fueled an explosion of specialized open-source models and made it practical for businesses to customize AI for their specific domain without enormous infrastructure budgets.

What is LoRA?

How Does It Work?

Why Does It Matter?

Related terms