If you do use LLM as a judge you need to do fitness to task optimization for the judge as well. For example the LLM model in LLM as a judge is often smaller than the task model because tweet - llm as a judge can be crazy expensive, otherwise. So, you need to experiment to see just how small that evaluation model can be and still get evals you can trust. In traditional testing, you primarily test the code.