⇩ Markdown

having LLM judges check for discrete signals

These discrete signals could be specific structured failure modes

This feels like a focused task... like agent task scope should be narrow.

This is kind of the opposite of trying to grade based on generic evals, or even scaled evaluation like Likert scale.