⇩ Markdown

considering dark software factories→dark software factory -- key media→blog post - software factories and the agentic moment - 2026-02→validation naming is hard→why give evals a different name than just tests?→fitness to task optimization

fitness to task optimization

link not tracked

Could be link not tracked, or could be compound system optimization of a broader sort.

Backlinks

why give evals a different name than just tests?

If you do use LLM as a judge you need to do fitness to task optimization for the judge as well. For example the LLM model in LLM as a judge is often smaller than the task model because tweet - llm as a judge can be crazy expensive, otherwise. So, you need to experiment to see just how small that evaluation model can be and still get evals you can trust. In traditional testing, you primarily test the code.

see in context