evals persona -- domain expert

link not tracked

Backlinks

blog post - the three pillars of AI observability - 2025-11

In traditional observability, when you notice something is wrong, the next action to take is almost always to update code and try again. However, in AI, incorrect behavior often requires input from an expert (product manager, subject-matter expert, or even a user) who can clarify the behavior. The best workflows for annotation involve curating interesting examples that would benefit from annotation, flagging them for review, and then utilizing the annotated data in evals to improve performance.
...
annotation evals persona -- domain expert curating interesting examples
...
utilizing annotations in evals to improve dim - model performance

see in context

Once again, this breaks core assumptions in traditional observability. First, traces must be mutable, so that you can save annotations and query them alongside other metadata. Supporting updates on traces at "agent-scale" makes the agent-tracing database problem even more challenging. Second, the users who annotate are rarely developers, and so they benefit from UIs that simplify the data they must annotate into its simplest components.
...
the users who annotate are rarely developers ... maybe they are evals persona -- domain experts... that simplification of the tooling is one reason why Hamel Husain suggests building your own human eval tools. Because you can build for the other personas. UI-UX. Also captured in the ideas that it can be helpful to represent data in a more user-friendly review format

see in context
evals should be easy and cheap to run

Easy so your evals persona -- domain expert can do it. Cheap so it happens often

see in context
the users who annotate are rarely developers

Maybe they are evals persona -- domain expert or even maybe link not tracked?

see in context
why give evals a different name than just tests?

More closely tied to experimentation than traditional tests. Or at least more like TDD than a regression test suite. For example, they can be directly tweakable by evals persona -- domain experts in integrated prompting environment, and evals can be co-evolved with prompts. That need for tweaking makes the need for manually triggered in the app context be important.

see in context