why give evals a different name than just tests?

validation naming is hard

evals are a bit like unit or system tests, so why not just call evals "tests" like we have for every other software system we've been evaluating for decades? Maybe because of how they started? Evals started off simply evaluating models and only over time have companies started to develop more complex compound systems, including agents, where evaluating the system is more like doing a system test.

The "evals" name has stuck, though... so we're developing "evals" as opposed to "system tests". Let's lean into it and try to motivate the name by considering the distinguishing characteristics of these particular systems.

But first, let's revisit our definition: evals evaluate system behavior that is closely tied to AI functionality

What is so special about the closely tied to AI functionality part of that?

How they are the same

the hard part of evals and tests is replicating the data and environment

How they are different

maybe some evals should fail in order to be useful?, but typically you don't want traditional tests to fail in your suite because it adds noise.

Differences - negative aspects of evals

In AI, running prompts can be extremely slow, so you typically want to do it with great parallelism so your eval suite will conclude in a reasonable amount of time.
...
dim - amount of parallelism -- very high

In traditional software systems, you often mock your external dependencies, but in these systems, those dependencies are an integral part of what you are actually testing.

The subtle ways in which things fail requires more examples than typical tests. You can only gain real confidence with scale.
...
non-determinism

Unlike most traditional tests, you need to decide the level of exactness that you care about. Often there is more than one acceptable answer. Or worse, the answers are not eval type -- verifiable, so you need to get LLM as a judge involved.
...
dim - comparison exactness

If you do use LLM as a judge you need to do fitness to task optimization for the judge as well. For example the LLM model in LLM as a judge is often smaller than the task model because tweet - llm as a judge can be crazy expensive, otherwise. So, you need to experiment to see just how small that evaluation model can be and still get evals you can trust. In traditional testing, you primarily test the code.

Interesting constraints... primarily constraint - external call time and maybe constraint - cost, which are quite different from the standard constraint - compute time. This means that you can actually run dim - amount of parallelism -- very high potentially within a single thread if using non-blocking IO.

Although slow, and costly, the work is done by machines on external infrastructure, so you can hit them with extreme parallelism without needing to be concerned about your own system capacity.
...
dim - amount of parallelism -- very high

More closely tied to experimentation than traditional tests. Or at least more like TDD than a regression test suite. For example, they can be directly tweakable by evals persona -- domain experts in integrated prompting environment, and evals can be co-evolved with prompts. That need for tweaking makes the need for manually triggered in the app context be important.

Differences - positive aspects of evals

The thing you're testing, though it is closely tied to AI functionality, is likely a compound system and the evals, if they're not pass-fail, may be differentiable and could act as a sort of loss function on that compound system.