⇩ Markdown

considering dark software factories→dark software factory -- key media→blog post - OpenAI harness engineering 2026-02→review - Open AI harness blog post - 2026-02-22→blog post - the three pillars of AI observability - 2025-11→evaluation

evaluation

Backlinks

blog post - the three pillars of AI observability - 2025-11

You can't stare at a prompt and know what's going to happen. AI systems are inherently non-deterministic, and therefore you must measure their behavior to know how they perform. This process is called "evaluation", and you can do it both in production ("online") and in dev and CI ("offline").
...
The process is called evaluation. online monitoring diff - pre-deploy evals vs. online monitoring

see in context