⇩ Markdown

considering dark software factories→dark software factory -- key media→blog post - OpenAI harness engineering 2026-02→review - Open AI harness blog post - 2026-02-22→blog post - the three pillars of AI observability - 2025-11→offline eval

offline eval

Another term for link not tracked... basically before online monitoring

Backlinks

blog post - the three pillars of AI observability - 2025-11

Once you capture annotations, you should save them into datasets which are the basis for offline evals. Each time you want to make a change to your application, you should evaluate it against the datasets you've accumulated to approximate its impact. Although people often use the term golden dataset, we have seen a shift away from this concept in favor of a more fluid approach called "dataset reconciliation", where the goal shifts to incrementally and frequently updating datasets to represent real-world behaviors, rather than commissioning one up front.
...
saving annotations into your eval dataset
...
A move away from golden-set to dataset reconciliation. diff - golden set vs. dataset reconciliation

see in context