Once you capture annotations, you should save them into datasets which are the basis for offline evals. Each time you want to make a change to your application, you should evaluate it against the datasets you've accumulated to approximate its impact. Although people often use the term golden dataset, we have seen a shift away from this concept in favor of a more fluid approach called "dataset reconciliation", where the goal shifts to incrementally and frequently updating datasets to represent real-world behaviors, rather than commissioning one up front.
...
saving annotations into your eval dataset
...
A move away from golden-set to dataset reconciliation. diff - golden set vs. dataset reconciliation