PRESENTS A FIELD GUIDE

THE FIELD GUIDE/14 STEPS/3 TIERS

From prompter to loop designer.

9 out of 10 builders have never written a single loop that prompts the agent for them. No automation, no state file, no verifier, no schedule. The leverage point has moved — from typing prompts to designing the systems that prompt.

Sourced from Anthropic's engineering docs · long-form on loop engineering · recent measurement studies.

↻ THE LOOP you design it once · it runs unattended

01 Find work SCAN · TRIGGER

→

02 Hand off PROMPT AGENT

→

03 Verify THE GATE

→

04 Record STATE FILE

→

05 Decide next CONTINUE · STOP

↺ repeats on a schedule — no human in the chair

The agent was a tool you held the entire time. That part is ending.

Stop prompting. Start designing.

SCROLL

The roadmap at a glance

Three tiers. Figure out if you need a loop → learn the building blocks → build the smallest one that works.

TIER 01 STEPS 01–04

The Why & The Test

Decide whether you actually need a loop before you spend a token building one.

01 Replacing yourself as prompter 02 The 4-condition test 03 Who wins, who loses 04 The 30-second loop check

TIER 02 STEPS 05–09

The 5 Building Blocks

The five primitives every working loop is assembled from — in the two tools that matter.

05 Automations — the heartbeat 06 Worktrees — parallel without chaos 07 Skills — write knowledge once 08 Connectors — touch real tools 09 Sub-agents — maker vs checker

TIER 03 STEPS 10–14

Build It Right or Don't

Assemble the smallest loop that works — and the failure modes that turn loops into money pits.

10 The state file 11 The minimum viable loop 12 The Ralph Wiggum loop 13 Comprehension debt 14 The security tax

PART ONE · THE WHY & THE TEST

Do you actually need a loop?

Most developers don't — not yet. The honest version of this guide is that a loop earns its cost only under specific conditions. Miss one and it bleeds tokens. This tier is the decision, before a single line of setup.

01 Replace yourself 02 The 4-condition test 03 Who wins, who loses 04 The 30-second check

THE SHIFT

Loop engineering is replacing yourself as the prompter.

For two years the way you got something out of a coding agent was: write a prompt, share context, read what came back, write the next prompt. The agent was a tool and you held it the entire time.

Loop engineering is building a small system that finds the work, hands it to the agent, checks the result, records what happened, and decides the next move — on its own. You design that system once. It prompts the agent from then on.

The leverage moved from typing prompts to designing the loop that prompts.

WHERE YOU SIT

BEFORE — you are the loop

YOU ⇄ AGENT · type · wait · read · repeat

AFTER — the system is the loop

YOU → LOOP → AGENT

you design once · the loop prompts from then on

8×

Anthropic engineers now merge roughly eight times as much code per day as in 2024 — a figure Anthropic itself calls "almost certainly an overstatement of the true productivity gain." The number is debated. The mechanism isn't.

THE STRATEGIC TEST

Run the 4-condition test before you build anything.

A loop earns its cost only when all four of these hold at once. They are an AND, not a menu. Miss one and the loop costs more than it returns.

CONDITION 01

The task repeats

A loop amortizes setup across many runs. If the work doesn't recur weekly, you don't have a loop — you have a script you ran once.

CONDITION 02

Verification is automated

A test, type checker, linter, or build that can fail the work without you in the room. No automated check → you're back in the chair reading every diff.

CONDITION 03

Budget absorbs the waste

Loops re-read context, retry, explore — burning tokens whether or not a run ships. Obvious on free tokens, reckless on a metered plan.

CONDITION 04

Senior-engineer tools

Logs, a reproduction environment, the ability to run the code it writes and see what breaks. Without that, the loop iterates blind.

ALL FOUR GREEN → build the loop. ANY ONE RED → keep it a one-shot prompt.

THE ECONOMICS

Who wins, who loses. Loops favor whoever can spend.

The economics are not universal. People who call loop engineering obvious tend to have unmetered tokens. People for whom it's reckless are usually on a $20 consumer plan trying to run heavy verification loops without hitting limits — or a surprise invoice.

✓ Who benefits today

Repetitive, machine-checkable work + budget

Continuous test triage, dependency bumps, lint-and-fix passes, issue-to-PR drafts on a codebase with strong coverage.

Codebases with strong test suites

If a junior could do the task from a checklist and a suite would catch their mistakes, a loop fits.

Async-first teams already running multi-agent

For these teams, routines are the missing orchestration layer.

✕ Who should skip it

Solo builders on consumer plans

The token bill arrives before the productivity gain does.

Code with no automated verification

A loop with no real check is the agent agreeing with itself on repeat.

Teams bottlenecked on review, not typing

A loop generates more code; if review was already the constraint, it just lengthens the queue.

The honest version of this guide: loop engineering is real — and most developers don't need it yet.

THE TACTICAL CHECK · 30 SECONDS

The 30-second loop check.

Step 02 was the strategic decision. This is the one you run on a specific task before turning it into a loop. Miss one box → keep it a manual prompt.

☑ THE FIVE BOXES — ALL MUST CHECK

✓

The task happens at least weekly.

Less than weekly → setup cost never amortizes.

✓

A test, type check, build, or linter can reject bad output.

No automated gate → the agent grades its own homework.

✓

The agent can run the code it changes.

No reproduction environment → iteration is blind.

✓

The loop has a hard stop.

Token budget, iteration count, or time limit — or it runs until someone notices the bill.

✓

A human reviews before merge, deploy, or dependency changes.

Anything irreversible needs a human approval gate before action.

✓ GOOD FIRST LOOPS

CI failure triage — nightly: scan failures, classify causes, draft fix PRs for the easy ones.

Dependency bump PRs — weekly: scan updates, test compatibility, open PRs.

Lint-and-fix passes — on every PR open, apply style fixes automatically.

Flaky test reproduction — loop until a theory survives the test.

Issue-to-PR drafts — on code with strong tests that reject bad output.

✕ BAD FIRST LOOPS — HUMAN IN THE CHAIR

✕ Architecture rewrites

✕ Auth or payments code

✕ Production deploys

✕ Vague product work

✕ Anything where "done" is a judgment call

PART TWO · THE 5 BUILDING BLOCKS

Everything is built from five primitives.

Every working loop is assembled from the same five parts. Here's each one, and exactly how it shows up in the two tools that matter — Codex and Claude Code.

05 Automations 06 Worktrees 07 Skills 08 Connectors 09 Sub-agents

BLOCK 01 · THE HEARTBEAT

Automations — what makes a loop a loop.

Automations fire on a schedule, an event, or a trigger condition. They're the heartbeat — everything else in the loop hangs off them. Without one, you don't have a loop; you have a run you did once.

fires every 30m · on event · on trigger →

CODEX

The Automations tab — pick a project, set a prompt, set a cadence, choose local checkout or background worktree. Runs that find something land in a Triage inbox; runs that find nothing archive themselves.

CLAUDE CODE

Three primitives that compose into the same shape: /loop for session cadence, Desktop scheduled tasks for restart-survival, Routines for laptop-off cloud runs. Pair with hooks for lifecycle events.

/loop

Re-runs on a cadence. Use it when you want regular checks regardless of state.

/goal

Keeps going until a condition you wrote is actually true — checked by a separate small model, so the agent that wrote the code isn't the one grading it. The maker-vs-checker split, applied to the stop condition itself.

auth-quality-loop · claude code

> /loop 30m /goal All tests in test/auth pass and lint is clean.
  Scan src/auth for new failures, propose fixes in claude/auth-fixes,
  open draft PR when goal condition holds.

▲ Claude
  CronCreate(*/30 * * * * : auth quality loop)
  Stop condition: tests pass + lint clean (verified by checker)
✓ Scheduled. Will continue past intermediate completions
  until /goal condition is met by independent checker.

BLOCK 02 · ISOLATION

Worktrees — parallel without chaos.

The second you run more than one agent, files start colliding. Two agents writing the same file is the same headache as two engineers committing to the same lines without talking first.

A git worktree fixes it — a separate working directory on its own branch sharing the same repo history, so one agent's edits literally cannot touch the other's checkout.

Worktrees take away the mechanical collision — but you're still the ceiling. Your review bandwidth decides how many parallel agents you can run, not the tool.

ONE REPO · SHARED HISTORY

● main · git history

agent A → worktree/auth-fixes

agent B → worktree/dep-bumps

agent C → worktree/lint-pass

✓ no two checkouts can touch the same file

CODEX

Builds worktree support in — several threads hit the same repo at once without bumping into each other.

CLAUDE CODE

Exposes git worktree directly, a --worktree flag to open a session in its own checkout, and an isolation: worktree setting on subagents — each helper gets a fresh checkout that cleans itself up after.

BLOCK 03 · MEMORY

Skills — write project knowledge once, read it on every run.

A Skill is how you stop re-explaining the same project context every session like a goldfish. Both tools use the same format: a folder with a SKILL.md inside — instructions and metadata, plus optional scripts, references, and assets.

A loop without skills re-derives your whole project context from zero every cycle. With skills, intent compounds — conventions, build steps, "we don't do it like this because of that one incident," written once on the outside and read by every run.

NO SKILL re-derive context every cycle · WITH SKILL resume with intent intact

📄 .claude/skills/ci-triage/SKILL.md

name: ci-triage
description: Classify CI failures by root cause (env,
  flake, real bug, dependency, infra), draft fixes for the
  easy ones, escalate the rest. Trigger on workflow failure
  or the morning triage loop.
---

# CI triage skill

## Classification rules
- env: missing secret / infra not provisioned. # human
- flake: passes on retry, no code change. # retry, file
- bug: deterministic, tied to recent commit. # draft fix
- dependency: tied to a version bump. # draft rollback
- infra: timeout, OOM, runner issue. # escalate

## Never do
- Disable failing tests — file an escalation instead
- Modify CI config without human approval
- Touch src/payments/ or src/billing/

## State
Update STATE.md after each run: paths checked,
classifications, PRs opened, items escalated.

BLOCK 04 · REACH · VIA MCP

Connectors — the loop touches your real tools.

A loop that can only see the filesystem is a tiny loop. Connectors, built on the Model Context Protocol (MCP), let the agent read your issue tracker, query a database, hit a staging API, drop a message in Slack. Codex and Claude Code both speak MCP — so a connector you wrote for one usually just works in the other. This is the difference between an agent that says "here is the fix" and a loop that opens the PR, links the ticket, and pings the channel once CI is green.

YOUR LOOP → MCP → GitHub Linear Slack Sentry

PAY BACK FASTEST — IN ORDER

GitHub

Read repos, create branches, open PRs, comment on issues, react to webhook events. The single biggest day-one win for any code loop.

Linear or Jira

Update tickets as the loop progresses, link PRs back to issues, close items automatically when verification passes.

Slack

Post triage results, ping humans on escalations, summarize overnight runs in the morning.

Sentry / your error tracker

Let the loop investigate live alerts and draft fixes for the high-frequency ones.

BLOCK 05 · THE SPLIT

Sub-agents — keep the maker away from the checker.

The most useful structural thing in a loop, by far, is splitting the agent that writes from the agent that checks. Osmani's framing is exact: the model that wrote the code is "way too nice grading its own homework." A second agent — different instructions, sometimes a different model — catches the stuff the first one talked itself into.

MAKER

Generates

Writes the code. Optimizes toward the task.

⇄ REPEAT

CHECKER

Critiques

Different instructions, no exposure to the maker's reasoning.

the evaluator-optimizer pattern · documented by Anthropic, Dec 2024 · viral 18 months later

CODEX

Spawns subagents only when you ask, runs them at the same time, folds results into one answer. Define agents as TOML in .codex/agents/ — name, description, instructions, optional model and reasoning effort. Security reviewer = strong model on high effort; explorer = fast read-only.

CLAUDE CODE

Same, with subagents in .claude/agents/ and agent teams that pass work between them. The usual split: one explores, one implements, one verifies against the spec.

The loop runs while you're not watching — a verifier you actually trust is the only reason you can walk away. Sub-agents burn more tokens; spend them where a second opinion is worth paying for.

PART THREE · BUILD IT RIGHT — OR DON'T BUILD IT

The smallest loop that works.

Persistent state, four parts, and the failure modes that turn loops into money pits. Build small, in order — or pay for a system no one understands.

10 State file 11 Minimum viable loop 12 Ralph Wiggum 13 Comprehension debt 14 Security tax

THE SPINE

The state file. The agent forgets — the file does not.

It sounds too dumb to matter and is the spine of every working loop. A markdown file, a Linear board, a JSON state — anything that lives outside the single conversation and holds what's done and what's next.

Agents have short memory by default. What they learn this session is gone tomorrow unless you write it down.

The agent forgets, the repo does not. A loop without state restarts every run; a loop with state resumes.

WHERE STATE LIVES — TWO PATTERNS

Markdown in the repo

STATE.md at root or in .claude/. Version-controlled, diff-readable. Best for solo or small-team work.

External system

Linear, GitHub Issues, a database. Survives across repos, queryable, team-wide visibility. Best for production loops multiple humans watch.

▌ STATE.md · committed to repo

# Loop state · ci-triage

## Last run
2026-06-09 03:30 UTC · 7 classified, 3 drafted, 4 escalated

## In progress
- fix-auth-token-refresh — passing locally, awaiting CI
- fix-flaky-payment-webhook — retry applied, monitoring

## Completed today
- bump-axios-1.7.4 → merged (CI green, deps verified)
- lint-fix-pass-june-9 → merged

## Escalated to humans
- src/billing/refund.ts — failing 3 ways, unclear
- ci/staging-runner — infra timeouts, not code

## Lessons learned (here, not in chat)
- 06-08: PowerShell hits TLS 1.2 on this runner. Use bash.
- 06-07: e2e/checkout needs Stripe secret. Skip if missing.

## Stop conditions met
- /goal "tests pass + lint clean" ✓ commit 3a7b8c1

For long-running loops that drift, pair the state file with a standing high-level spec — VISION.md or AGENTS.md — reread each run. State tells the agent where it is. The spec tells it where to go.

THE BUILD · FOUR PARTS, NO SWARM

The minimum viable loop.

Passed the 4-condition test? Build the smallest loop that works before anything fancy. Four parts assemble into one loop.

ONE

Automation

Scheduled run that fires on cadence, stops on a clear condition.

TWO

Skill

One SKILL.md storing context it would re-derive every run.

THREE

State file

Records done/next so tomorrow resumes, not restarts.

FOUR

Gate

The test/build that fails bad work — decides if the loop helps or just spends.

▼ ASSEMBLE INTO ▼

↻ one working loop

ORDER MATTERS — SKIPPING AHEAD IS HOW LOOPS FAIL IN PRODUCTION

Get one manual run reliable

→

Turn it into a skill

→

Wrap it in a loop

→

Then schedule it

THE ONLY METRIC THAT MATTERS

Cost per accepted change

Not tokens spent, not tasks attempted, not loops scheduled. Below a 50% accepted-change rate, you're doing the review work the loop was supposed to save you — and the loop is losing.

50%

MINIMUM ACCEPTED-CHANGE RATE

losingwinning

FAILURE MODE · NAMED BY GEOFFREY HUNTLEY

The Ralph Wiggum loop. Loops that fail quietly.

An agent meant to emit a completion token only when finished emits it early — and the loop exits on a half-done job. Without a hard gate, loops fail quietly and keep spending.

⚠ COMPLETION TOKEN EMITTED EARLY

"done" ✓

actually done →

loop exits at 54% · ships a half-finished job · nobody's watching

IT HAPPENS WHEN —

No real verifier

Just a second agent asked to "review," no objective signal. Two optimists agreeing.

Soft completion conditions

"Done" defined by the agent's judgment, not by a test, build, or type check.

No hard stops

Loop runs until something external kills it — a rate limit, you noticing — rather than until success is verified.

THE FIX — THE GATE FROM STEP 11

Something objective that can fail the work. A test that passes or fails. A build that compiles or doesn't. A linter that returns zero or non-zero. Not a verifier that has an opinion.

OTHER MEASURED FAILURE MODES

Goal drift

Each summarization is lossy; "don't do X" disappears by turn 47.

→ standing VISION.md / AGENTS.md, reread each run

Self-preferential bias

The agent that wrote the code is too nice grading its own homework.

→ separate verifier, no exposure to the maker's reasoning

Agentic laziness

The loop declares "done enough" at partial completion.

→ /goal with an objective stop condition, checked by a fresh model

THE RISK THAT SHARPENS AS THE LOOP IMPROVES

Comprehension debt and cognitive surrender.

RISK 01

Comprehension debt

The faster the loop ships code you didn't write, the larger the distance between what the repository contains and what you understand.

code in repo

you understand

↑ the gap is the debt

The bill that hurts isn't the token bill — it's the day you debug a system no one on the team has read.

RISK 02

Cognitive surrender

The pull to stop forming an opinion and accept whatever the loop returns.

WITH JUDGMENTdesigning the loop is the cure

TO AVOID THINKINGit's the accelerant

same action · opposite result

THE MITIGATIONS ARE NOT TECHNICAL

Read the diffs

Don't read what the loop ships → you rent comprehension debt at compound interest.

Spot-check the gate

Verify a few PRs the loop opened — does the approving test actually catch the failure you care about? Gates rot.

Block architecture work

Keep it on small, machine-checkable changes. Judgment calls accelerate the debt.

Pair-design loops

A second pair of eyes catches blind spots the loop would otherwise exploit forever.

THE SECURITY TAX

An unattended loop is an unattended attack surface.

The threat model your loop has to defend against — four ways an unattended loop quietly becomes a liability.

Generated code shipping unreviewed

The loop opens PRs faster than a human can read them. Without a gate that includes security checks, insecure code merges automatically.

→ add SAST, dependency audit, secret scanning to the gate

Skills as injection vectors

A loop that auto-installs skills inherits every prompt injection hiding in their descriptions.

→ audit skill sources before installing

Credentials in logs

Debug logging during a long-running loop scatters secrets across logs you don't monitor.

→ disable verbose logging in production; sanitize what is logged

Permission scope creep

A loop tested with read-only permissions gets "just one" write permission added for convenience — then never re-audited.

→ re-audit permissions every 30 days

§ THE LEDGER OF REGRET

The mistakes that turn loops into money pits.

10 WAYS TO LOSE

No 4-condition test

Step 02 exists for a reason. Most developers fail at least one condition.

No objective gate

A second agent asked to "review" without a test or build is just a second optimist.

One agent writes and verifies

Self-preferential bias. The maker grades its own homework — always "A+."

No state file

Tomorrow's run restarts from zero instead of resuming.

Vague stop conditions

"Done when it looks good" never holds. Use a test, a type pass, or a passing build.

No token budget cap

Loops re-read context and retry. Without a cap, they burn 5–10× the tokens you expected.

Heavy loops on a consumer plan

Token bill or rate limit — one of them gets you.

Auto-installing community skills

520 of 17,022 audited skills leak credentials. Read the source before installing.

Loops on judgment-call work

Architecture, auth, payments, vague product. Keep the loop on lint-and-fix, not strategy.

Not reading the diffs

Comprehension debt at compound interest. The day you debug a system no one read costs more than the tokens ever did.

CONCLUSION

The leverage moved. Your job did too.

For two years, the leverage in working with coding agents was at the prompt — better prompts, better context, better one-shot output. That phase is ending. The agents got good enough that the next leverage point is one floor up: the system that decides what they work on, when, with what gate, and what state survives between runs.

But the honest version of this story isn't that everyone should rush to build loops. Most developers don't need one yet — not until the task repeats, verification is automated, the budget can absorb the waste, and the agent has senior-engineer tools. Miss one condition and the loop costs more than it returns.

IF YOU PASS THE TEST — BUILD SMALL

One automation One skill One state file One gate

Get a manual run reliable. Turn it into a skill. Wrap it in a loop. Then schedule it. Order matters — skip ahead and you're paying for a system no one understands.

Cherny's point isn't that the work got easier. It's that the leverage point moved.

A LOOP ENGINEERING FIELD GUIDE

Build the loop. Stay the engineer.

14 STEPS · 3 TIERS · STOP PROMPTING, START DESIGNING

SOURCES: ANTHROPIC ENGINEERING · ADDY OSMANI · MEASUREMENT STUDIES