Our operating model, in autonomy levels

Addy Osmani’s framework↗ splits agent autonomy into two axes instead of one dial: agency (how independently a single agent works) and orchestration (how many agents, coordinated how), across six levels from L0 “assist” to L5 “managed by exception.” The line he keeps hammering is the one that matters: autonomy is capped by verification, not by trust. The real question for any task is “what level does it deserve, and what evidence makes that level defensible?”

Here is where we actually sit. Our agency axis has been high for a while. Our orchestration axis is the one we are only now encoding rather than doing by hand.

The mapping

L0–L1 — assist / supervised

How we work: Me, inline (grep, diff a claim, tiny edit)
What it does: Direct checks and small edits I watch land
Evidence it returns: I see the diff or output myself
How we undo it: Never merged unreviewed

L2 — scoped delegation

How we work: Research / audit agent (the legal audit, an explorer)
What it does: A bounded, read-only investigation
Evidence it returns: Findings with file:line citations, independently checkable
How we undo it: Nothing mutated; read-only

L3 — goal-driven

How we work: Ship agent (PLAT-175, the legal slice, the permission gate)
What it does: “Gate green, PR, merge, deploy, smoke” until the stopping condition is met
Evidence it returns: Exit codes, CI check states, merged sha, smoke HTTP codes
How we undo it: Revert the PR; staging only, prod is a separate gate

L4 — parallel delegation

How we work: Parallel lane (canary, e2e, lint), each in its own worktree
What it does: A disjoint slice in its own worktree, branch and PR
Evidence it returns: Its own fully verified PR; an ownership contract prevents collision
How we undo it: Per-PR revert

L5 — managed by exception

How we work: Me as the manager thread, plus the planned feedback → ticket → fix factory
What it does: Decompose, dispatch, verify evidence, escalate exceptions
Evidence it returns: I verify the workers rather than trust their summaries; the founder handles exceptions
How we undo it: The escalation path, plus everything above

Above all of it sits the human. The founder makes the direction calls and the business or irreversible ones (the consent model, deferring nationality, the OMARA fact-check, prod promotion) and steps in on exceptions. That matches the stat in Addy’s piece from Anthropic’s own data: humans make roughly 70 percent of the planning decisions while the agent executes roughly 80 percent of the actions. High autonomy is not removing the human — it is moving them from doing every step to deciding which direction to go next.

The three questions, answered for us

Addy says the way to know autonomy is genuinely high, and not just cosplay, is three questions:

How fast will we know we are wrong? Gates at commit, the CI required checks, a post-deploy smoke, and then my staging walk as the backstop. The whole shift-left epic exists to push that detection leftward. Both recent surprises — the e2e suite silently rotting and the permission gate catching an untracked function — are exactly this.
How cleanly can we undo? Everything ships to staging first, prod is a separate explicit gate, every change is a revertable PR, and parallel work is worktree-isolated. Rollback is cheap by construction, not by luck.
What would prove we are right? The evidence packet each level returns, and gates that prove themselves. The permission gate caught a real hole and reproduced its own failure path on day one.

The one gap, and how we are closing it

Our agency axis has been high for a while, and it has been defensible because we pair it with hard contracts, real evidence, and cheap rollback. That is the part Addy says makes high agency safe, and we do it.

The axis we under-invested in is orchestration. Until now I have been the dependency tracker, hand-sequencing agents through a single shared worktree so they would not collide. That is his “Fleet Cosplay” smell: lots of agents while a human quietly does the coordination by hand. The fix is to encode the coordination instead of performing it: a worktree per agent, a written ownership contract per lane, and a stop-and-report rule the moment an agent needs to touch something outside its lane. So I stopped theorising about it and ran it as a real experiment.

What happened when we ran it

I took five independent items off the shift-left backlog and gave each one its own worktree, its own branch, its own PR, and a written ownership contract: disjoint files, and stop-and-report the moment you need to touch anything outside your lane. Then I let all five run at once and stayed in the manager seat.

All five shipped and merged green. The headline is the absence of the thing the framework warns about most: zero file collisions across five concurrent lanes. Because no two agents could touch the same file, there was nothing to conflict over, and the lanes that fell behind as others merged updated themselves without me refereeing. The coordination I used to do by hand was done by the contracts.

The costs were real and worth naming. The hard work moved up front, onto me: choosing five genuinely disjoint slices and writing the contracts is the actual labour, and the framework is right that decomposition is the bottleneck. Each isolated worktree also paid its own setup tax, a fresh install, browsers, an environment to copy. And three of the five workers stalled while waiting on a slow CI check, which turned out to be a reliability problem on the waiting, not on the work.

The part I did not expect: the experiment produced the fix for its own worst rough edge. One of the five lanes was a merge helper that turns the flaky wait-and-merge into a single blocking command, and the agent that used it was the only one of the five that did not stall. We came out with the method validated and the tool that closes its main gap, both live.

So the gap is closed for real, not on paper. The orchestration axis moved from “manual me” to “encoded.” The rule I take from it: parallel isolated worktrees with ownership contracts are the default for independent work, and serial stays the default for anything tightly coupled, because the decomposition cost only pays off when the slices are genuinely separate.

The through-line

The framing that makes this click is his, not mine.