For example, if you discover that your agent is highly repetitive, you can write an evaluator to detect that case, and then capture a handful of real-world examples that you can test on your laptop. While testing, you should use exactly the same tracing that you run in production, and assess how the changes you make affect both the repetition score and other performance indicators that you track. Once you feel confident, you can ship a new iteration, and see how it affects production eval scores.
...
example score -- repetition
...
This seems to fly in the face of Hamel Husain's suggestion prefer binary pass or fail evaluations instead of scaled ones unless they are actually suggesting a binary score