evals are a bit like unit or system tests, so why not just call evals "tests" like we have for every other software system we've been evaluating for decades? Maybe because of how they started? Evals started off simply evaluating models and only over time have companies started to develop more complex compound systems, including agents, where evaluating the system is more like doing a system test.