Loops, levels, and building a platform that drives itself
First, loops
Everyone throws around “agent loops” and “loop engineering,” and it took me a beat to separate the tool from the idea.
A loop is the thing that makes an agent autonomous at all. It is a cycle: take an action, observe what happened, decide the next action from that, repeat until a condition is met. A one-shot prompt answers once. A loop keeps going — “run the tests, read the failure, fix it, run them again” — until the tests are green or it hits a wall. That cycle is the engine under every level of autonomy above plain assistance.
“Loop engineering” gets used two ways. The narrow one is a feature: a way to run a task on a cadence or let it self-pace, so an agent keeps working across time without me re-prompting it. The broad one is the craft of building a good loop, and that is the part that actually matters. A bad loop spins forever or drifts off a cliff. A good loop has four things:
- a stopping condition you can measure (“until the suite is green,” “until two rounds find nothing new”), so it knows when it is done or stuck;
- verification every iteration, so it never builds on a broken step;
- a budget, a hard cap on attempts or tokens, so it cannot run away;
- an escalation path, so when it is stuck it surfaces to a human instead of grinding.
Notice the theme: the hard part of a loop is not the doing, it is knowing it did the right thing. Which is the same thing that governs autonomy in general.
Where we are: level four, tipping into five
There is a clean ladder for this, and it comes from Addy Osmani’s writeup on agentic autonomy levels↗, which is what got me mapping this out in the first place. I put our whole operating model onto it in a separate note: L0 assist, up to L5 managed-by-exception. We sit at L4, parallel delegation, with the L5 manager loop starting to form.
Concretely: work fans out to several worker agents at once, each in its own isolated copy of the codebase under a written ownership contract, and a manager thread dispatches them, checks their evidence, and only escalates the genuine decisions to me. I proved this was not theory in a live experiment this week — five independent pieces of work running at the same time, zero file collisions, each shipping its own verified change. The old way, me hand-sequencing one agent at a time, was leaving most of that speed on the table.
How we got here
The path was not a plan, it was a series of realisations.
It started with me building everything by hand. Then I began handing execution to a “central nervous system” — an orchestrator that decomposes work and delegates it, while I keep the reasoning. The turning point was noticing that I had quietly become the QA department: I was catching bugs by eye on a staging walk that a machine should have caught first. That is backwards. My attention is the most expensive checkpoint in the company. So the work became building the checks, and once the checks existed it became safe to run more agents in parallel, because a bad change could not get far.
The mechanisms holding this level up
Autonomy is capped by verification, not by trust. So the reason we can run at L4 is a stack of concrete gates, each one a place a mistake dies automatically:
- An enforced merge gate. One script is the only way anything merges, and it refuses until every required check is green. No agent, and no rushed script of mine, can merge red (I know, because I tried by accident and it caught me).
- A permission gate that rebuilds the database from its migration history and fails the build if the paid dataset ever becomes readable by a stranger. It caught a real, untracked piece of drift on its very first run.
- Lint gates that make whole classes of bug impossible to reintroduce: a dropped Slack ping, a hardcoded fake number, a copy-pasted route string.
- A post-deploy canary that screenshots and click-crawls the site after every deploy, so the walk I used to do by hand starts happening automatically.
- A restored end-to-end suite (it had silently rotted, and we found and fixed that too).
- Isolated worktrees and ownership contracts, so parallel agents physically cannot step on each other, and the one shared file comes back to me to reconcile.
- Ticket-first and discover-first, so nothing gets built without a record, or without first checking it does not already exist.
Every one of those is a loop or a gate that lets an agent move fast without me watching.
Where we are headed: level five, managed by exception
The target is a factory. Work enters as an issue and leaves as a shipped, verified change, and I only ever get pulled in on exceptions — the things that genuinely need a human call. I stop operating the machine and start setting its direction.
What it takes to get there
This is the honest roadmap, and it is all mechanism, not magic:
- Finish the safety machine. A synthetic user that runs signup, pre-register and report generation on a schedule and screams if a side effect silently fails. A design lint that fails on anything off-brand. The end-to-end suite promoted to a required check so it cannot rot again. The slow pre-push hook trimmed so it stops getting bypassed.
- Build the factory loop. The one that turns a staging-walk catch into an auto-filed ticket, a scoped fix agent, a verified PR, and a closed loop, with a human only on the exception. That is the feedback-to-fix loop, and it is the single biggest lever for handing off more.
- Stand up the recurring loops. A backlog-grind loop that works a well-scoped list to done. A health loop that exercises the app on a cadence and self-heals or alerts. This is where “loop engineering” earns its name: standing, self-verifying cycles that keep the platform in a good state without anyone driving.
- Set the autonomy dial. The one call that stays mine: how much the factory is allowed to ship without asking — file the ticket only, or spin the fix and wait for a nod, or auto-ship the low-risk ones. That dial is what actually moves us from L4 to L5, and it is deliberately a human decision.
The through-line
A platform that drives itself is not one with fewer checks. It is one with more checks, running automatically, so the human moves from doing the work to pointing at what matters and catching the rare thing the machine cannot.
The cap_drop note saved me an afternoon. Hadn’t thought about the TTY issue during build at all.
Curious whether you stuck with Ollama in the end, or went back to the Copilot model once the 403 cleared up?