
Glossary
AI Evaluation (Evals)
The practice of systematically measuring AI system quality.
Definition
AI evaluation is the practice of measuring whether an AI system produces correct, safe, and useful outputs against defined criteria. Evals range from automated tests (exact match, structured-output checks, LLM-as-judge) to human review and end-to-end production monitoring.
Context
Evals are the difference between a demo and a production AI system. Without them, you cannot tell whether a prompt change improved or degraded behavior, whether a model swap is safe, or whether the system is drifting in production. Strong evaluation pipelines are now table-stakes for any enterprise AI deployment.