The core primitive of evaluation is scoring. Scoring allows you to look at an entire agent interaction (trace) or turn (span) and quantify it. Usually, this means producing a number (for example, how factually grounded is the answer?), but it can also be categorical (for example, what type of error is this?). The best teams use online evals to help them discover what to test in dev and CI.
...
using online evals to help discover what to test in dev and CI