review - Open AI harness blog post - 2026-02-22

Notes:


I will re-read blog post - OpenAI harness engineering 2026-02 and make notes about what I marked link not tracked in my earlier free recall - Open AI harness blog post - 2026-02 session. Also, I'll make note of any other areas that I think I may have missed.


Gaps

Detailed Reading Notes

One noteworthy thing is that this is internal users only (and external alpha testers) but not real users.

To do that, we needed to understand what changes when a software engineering team’s primary job is no longer to write code, but to design environments, specify intent, and build feedback loops that allow Codex agents to do reliable work.
...
Emphasis on link not tracked and building link not trackeds

Even the initial AGENTS.md file that directs agents how to work in the repository was itself written by Codex.

This translates to an average throughput of 3.5 PRs per engineer per day, and surprisingly the throughput has increased as the team has grown to now seven engineers. Importantly, this wasn’t output for output’s sake: the product has been used by hundreds of users internally, including daily internal power users.
...
The team velocity increased over time and as more developers were added. Basically the opposite of the link not tracked.

The lack of hands-on human coding introduced a different kind of engineering work, focused on systems, scaffolding, and leverage.
...
I feel like leverage and link not trackeds are related

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks. When something failed, the fix was almost never “try harder.” Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: “what capability is missing, and how do we make it both legible and enforceable for the agent?”
...
link not tracked: I didn't quite get this notion of working depth first, but I can see how it would fall out of their principle of no human -- coding.
...
link not tracked: I caught the idea of making dim - legibility -- higher, but "enforceability" kinda eluded me. The only thing I can think of, as I read this again, is linting, but I know there's more to it. I'll keep my eyes peeled as I read further.
...
breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks
...
SDLC building block -- design, SDLC building block -- code, SDLC building block -- review, link not tracked

To drive a PR to completion, we instruct Codex to review its own changes locally, request additional specific agent reviews both locally and in the cloud, respond to any human or agent given feedback, and iterate in a loop until all agent reviewers are satisfied (effectively this is a ralph wiggum loop). Codex uses our standard development tools directly (gh, local scripts, and link not tracked) to gather context without humans copying and pasting into the CLI.
...
practice - specialized agent -- code review or eval
...
"both locally and in the cloud". Hmm... what's the advantage here?

In practice, this meant working depth-first: breaking down larger goals into smaller building blocks (design, code, review, test, etc), prompting the agent to construct those blocks, and using them to unlock more complex tasks
...
It kind of sounds like they are saying that they sort of forced the agent to do the building blocks so if they ask for something directly, the agent isn't good enough to do those building blocks itself, but when directly driven towards doing them, it is

We also wired the Chrome DevTools Protocol into the agent runtime and created skills for working with DOM snapshots, screenshots, and navigation. This enabled Codex to reproduce bugs, validate fixes, and reason about UI behavior directly.
...
Hmm... I wonder what "wired the Chrome DevTools Protocol into the agent runtime" means? Sounds more direct than an model context protocol - MCP or agent skill. They did create skills, but those are separate: link not tracked, link not tracked, link not tracked

Codex works on a fully isolated version of that app—including its logs and metrics, which get torn down once that task is complete. Agents can query logs with LogQL and metrics with PromQL. With this context available, prompts like “ensure service startup completes in under 800ms” or “no span in these four critical user journeys exceeds two seconds” become tractable.
...
The part I missed was that the agent has specific tools that allow it to interact with the logs and metrics better than just reading the logs and metrics
...
Their emphasis on logs and metrics me think about link not tracked and link not tracked. It also makes me think about one of the specialized tools I wrote for meadow that is a essentially a Terminal Ui - TUI that ensures that everything gets cleanly committed as I use the app. Like link not tracked

Documentation

Note: in this section they describe their documentation. I've been referring to this type of documentation as a documentation context graph

Most of the way through there is an epic section about why a monolithic document is problematic and a graph is much better
...
link not tracked

There is a short map of context, essentially, that is directly injected into the agent.MD.
...
Makes me think about how skills are sometimes ignored and directly injecting information about them into the agent file is often more effective

In that section is a quote about when everything is important nothing is important. It's also something about how documentation skews from code is inevitable in a large document because it's impossible to do things like cross check links look at metadata around how recently something was updated, etc.

We regularly see single Codex runs work on a single task for upwards of six hours (often while the humans are sleeping).
...
dim - task horizon length -- longer

Design documentation is catalogued and indexed, including verification status and a set of core beliefs that define agent-first operating principles. link not tracked provides a top-level map of domains and package layering. A quality document grades each product domain and architectural layer, tracking gaps over time.
...
The architecture.md file makes me think about link not tracked and how link not tracked.

Plans are treated as first-class artifacts. Ephemeral lightweight plans are used for small changes, while complex work is captured in link not tracked with progress and decision logs that are checked into the repository. Active plans, completed plans, and known technical debt are all versioned and co-located, allowing agents to operate without relying on external context.
...
I wan't sure I had gotten this in my free recall, but it looks like I did. link not tracked makes it seem like they don't rely on something like link not tracked or Beads, but they do, also... just at a higher level.

We enforce this mechanically. Dedicated linters and CI jobs validate that the knowledge base is up to date, cross-linked, and structured correctly. A recurring “doc-gardening” agent scans for stale or obsolete documentation that does not reflect the real code behavior and opens fix-up pull requests.
...
Basically validation scripts... some of them are custom linting rules. link not tracked

Agent Legibility

Giving Codex more context means organizing and exposing the right information so the agent can reason over it, rather than overwhelming it with ad-hoc instructions. In the same way you would onboard a new teammate on product principles, engineering norms, and team culture (emoji preferences included), giving the agent this information leads to better-aligned output.
...
link not tracked
...
kind of the opposite of link not tracked, where it helps you work with all those other sources. In this case, the sources are pulled into the repo (and organized).

This framing clarified many tradeoffs. We favored dependencies and abstractions that could be fully internalized and reasoned about in-repo. Technologies often described as “boring” tend to be easier for agents to model due to composability, api stability, and representation in the training set. In some cases, it was cheaper to have the agent reimplement subsets of functionality than to work around opaque upstream behavior from public libraries. For example, rather than pulling in a generic p-limit-style package, we implemented our own map-with-concurrency helper: it’s tightly integrated with our OpenTelemetry instrumentation, has 100% test coverage, and behaves exactly the way our runtime expects.
...
link not tracked

Enforcing architecture and taste

Documentation alone doesn’t keep a fully agent-generated codebase coherent. By enforcing invariants, not micromanaging implementations, we let agents ship fast without undermining the foundation. For example, we require Codex to link not tracked, but are not prescriptive on how that happens (the model seems to like Zod, but we didn’t specify that specific library).
...
link not tracked
...
link not tracked

todo: I don't think I got through this section fully

Throughput changes the merge philosophy

todo: read this section and fill this out

Hmm... merge philosophy
...
Perhaps they touch on this: specialized agent -- code merge