maybe some evals should fail in order to be useful?

in link not tracked Hamel Husain encourages link not trackeds.

I'm less sure if that's always applicable. I think there is utility in evals functioning like a regression test suite, for example to support things like link not tracked or link not tracked.
...
edit: 2026-01-27 - In fact, in link not tracked, they call out explicitly having different set types, with one of them being link not tracked

However, if you have something like perfect link not tracked, then yes, you would expect failures, since models likely won't get things completely perfect. And there is value in that, as well, since you can use that to find structured failure modes, which will help you improve your system.

But an automated suite of tools that continually fails is not very useful. So, that brings up the question, again, why give evals a different name than just tests?. Are they actually different in some meaningful way? In the link not tracked brand of link not tracked you write a link not tracked first, then you improve your code to make it pass. Is a failing eval very analogous to a failing test, or only roughly analogous?