span

Backlinks

blog post - the three pillars of AI observability - 2025-11

The core primitive of evaluation is scoring. Scoring allows you to look at an entire agent interaction (trace) or turn (span) and quantify it. Usually, this means producing a number (for example, how factually grounded is the answer?), but it can also be categorical (for example, what type of error is this?). The best teams use online evals to help them discover what to test in dev and CI.
...
using online evals to help discover what to test in dev and CI

see in context
traces should provide reproducible context, not just spans

traces should provide link not tracked, not just spans

see in context