Hamel Husain

Backlinks

blog post - the three pillars of AI observability - 2025-11

For example, if you discover that your agent is highly repetitive, you can write an evaluator to detect that case, and then capture a handful of real-world examples that you can test on your laptop. While testing, you should use exactly the same tracing that you run in production, and assess how the changes you make affect both the repetition score and other performance indicators that you track. Once you feel confident, you can ship a new iteration, and see how it affects production eval scores.
...
example score -- repetition
...
This seems to fly in the face of Hamel Husain's suggestion prefer binary pass or fail evaluations instead of scaled ones unless they are actually suggesting a binary score

see in context

Once again, this breaks core assumptions in traditional observability. First, traces must be mutable, so that you can save annotations and query them alongside other metadata. Supporting updates on traces at "agent-scale" makes the agent-tracing database problem even more challenging. Second, the users who annotate are rarely developers, and so they benefit from UIs that simplify the data they must annotate into its simplest components.
...
the users who annotate are rarely developers ... maybe they are evals persona -- domain experts... that simplification of the tooling is one reason why Hamel Husain suggests building your own human eval tools. Because you can build for the other personas. UI-UX. Also captured in the ideas that it can be helpful to represent data in a more user-friendly review format

see in context
maybe some evals should fail in order to be useful?

in link not tracked Hamel Husain encourages link not trackeds.

see in context
prefer binary pass or fail evaluations instead of scaled ones

Hamel Husain's idea

see in context