Unlike most traditional tests, you need to decide the level of exactness that you care about. Often there is more than one acceptable answer. Or worse, the answers are not eval type -- verifiable, so you need to get LLM as a judge involved.
...
dim - comparison exactness