Attention

What is Attention?

Attention is a mechanism in neural networks that lets a model decide which parts of its input are most important for the task at hand. Rather than treating every word or token equally, the model assigns different weights to different parts of the input, "paying attention" to the pieces that matter most.

Imagine you are at a crowded party and someone across the room says your name. Despite dozens of simultaneous conversations, your brain instantly focuses on that one voice. Attention in AI works on a similar principle: it highlights the relevant signal amidst a sea of information.

How Does It Work?

The most common form is self-attention (also called scaled dot-product attention), used in the Transformer architecture:

Each token in a sequence is represented as three vectors: Query (Q), Key (K), and Value (V).
The model computes a similarity score between every Query and every Key.
These scores are normalized into weights using a softmax function.
The final output for each token is a weighted sum of all Value vectors, where the weights reflect relevance.

This process allows every token to "look at" every other token and gather context from wherever it is most useful.

Why Does It Matter?

Attention solved the long-range dependency problem that plagued earlier sequence models. It is the core innovation behind the Transformer and, by extension, all modern large language models. Without attention, today's generative AI breakthroughs would not have been possible.

What is Attention?

How Does It Work?

Why Does It Matter?

Related terms