LLM-as-a-Judge
An evaluation methodology where a capable LLM scores another model's or agent's outputs against a predefined rubric
#LLM-as-a-Judge#LLM evaluator#AI evaluation#evals#rubric#automated evaluation
What is LLM-as-a-Judge?
LLM-as-a-Judge is an evaluation methodology where a powerful language model — such as GPT-4o or Claude — scores the outputs of another model or agent against a predefined rubric. Instead of human reviewers, the model serves as the evaluator.
Why is it used?
AI agent outputs often have no fixed correct answer, making simple keyword-matching approaches insufficient. LLM-as-a-Judge evaluates open-ended outputs against flexible criteria, making it well-suited for large-scale automated evaluation pipelines.
Limitations to watch for
- Bias: The evaluating model may score outputs that resemble its own style more favorably.
- Rubric quality: Vague or poorly defined criteria produce unreliable scores.
- Cost: Every evaluation requires an LLM call, which compounds at scale.
Related Terms
Related terms
operations
Evals (AI Evaluation)
A structured framework for measuring AI agent and model outputs against quantified criteria and detecting regressions
operations
Minimum Viable Agent (MVA)
A smallest-possible agent design that validates one core task first with single-input, single-output execution
operations
Verification Loop
An operational pattern that converges quality by repeatedly testing, reviewing, and retrying AI-generated outputs