growthcab PRESENTS A FIELD GUIDE
THE FIELD GUIDE/14 STEPS/3 TIERS

From prompter to loop designer.

9 out of 10 builders have never written a single loop that prompts the agent for them. No automation, no state file, no verifier, no schedule. The leverage point has moved — from typing prompts to designing the systems that prompt.

Sourced from Anthropic's engineering docs · long-form on loop engineering · recent measurement studies.

↻ THE LOOP you design it once · it runs unattended
01 Find work SCAN · TRIGGER
02 Hand off PROMPT AGENT
03 Verify THE GATE
04 Record STATE FILE
05 Decide next CONTINUE · STOP
repeats on a schedule — no human in the chair

The agent was a tool you held the entire time. That part is ending.

Stop prompting. Start designing.
SCROLL

The roadmap at a glance

Three tiers. Figure out if you need a loop → learn the building blocks → build the smallest one that works.

TIER 01 STEPS 01–04

The Why & The Test

Decide whether you actually need a loop before you spend a token building one.

01  Replacing yourself as prompter 02  The 4-condition test 03  Who wins, who loses 04  The 30-second loop check
TIER 02 STEPS 05–09

The 5 Building Blocks

The five primitives every working loop is assembled from — in the two tools that matter.

05  Automations — the heartbeat 06  Worktrees — parallel without chaos 07  Skills — write knowledge once 08  Connectors — touch real tools 09  Sub-agents — maker vs checker
TIER 03 STEPS 10–14

Build It Right or Don't

Assemble the smallest loop that works — and the failure modes that turn loops into money pits.

10  The state file 11  The minimum viable loop 12  The Ralph Wiggum loop 13  Comprehension debt 14  The security tax
PART ONE · THE WHY & THE TEST
growthcab

Do you actually need a loop?

Most developers don't — not yet. The honest version of this guide is that a loop earns its cost only under specific conditions. Miss one and it bleeds tokens. This tier is the decision, before a single line of setup.

01  Replace yourself 02  The 4-condition test 03  Who wins, who loses 04  The 30-second check
01
THE SHIFT

Loop engineering is replacing yourself as the prompter.

For two years the way you got something out of a coding agent was: write a prompt, share context, read what came back, write the next prompt. The agent was a tool and you held it the entire time.

Loop engineering is building a small system that finds the work, hands it to the agent, checks the result, records what happened, and decides the next move — on its own. You design that system once. It prompts the agent from then on.

The leverage moved from typing prompts to designing the loop that prompts.

WHERE YOU SIT
BEFORE — you are the loop
YOU AGENT · type · wait · read · repeat
AFTER — the system is the loop
YOU LOOP AGENT
you design once · the loop prompts from then on

Anthropic engineers now merge roughly eight times as much code per day as in 2024 — a figure Anthropic itself calls "almost certainly an overstatement of the true productivity gain." The number is debated. The mechanism isn't.

02
THE STRATEGIC TEST

Run the 4-condition test before you build anything.

A loop earns its cost only when all four of these hold at once. They are an AND, not a menu. Miss one and the loop costs more than it returns.

CONDITION 01

The task repeats

A loop amortizes setup across many runs. If the work doesn't recur weekly, you don't have a loop — you have a script you ran once.

CONDITION 02

Verification is automated

A test, type checker, linter, or build that can fail the work without you in the room. No automated check → you're back in the chair reading every diff.

CONDITION 03

Budget absorbs the waste

Loops re-read context, retry, explore — burning tokens whether or not a run ships. Obvious on free tokens, reckless on a metered plan.

CONDITION 04

Senior-engineer tools

Logs, a reproduction environment, the ability to run the code it writes and see what breaks. Without that, the loop iterates blind.

ALL FOUR GREEN → build the loop. ANY ONE RED → keep it a one-shot prompt.
03
THE ECONOMICS

Who wins, who loses. Loops favor whoever can spend.

The economics are not universal. People who call loop engineering obvious tend to have unmetered tokens. People for whom it's reckless are usually on a $20 consumer plan trying to run heavy verification loops without hitting limits — or a surprise invoice.

Who benefits today
Repetitive, machine-checkable work + budget

Continuous test triage, dependency bumps, lint-and-fix passes, issue-to-PR drafts on a codebase with strong coverage.

Codebases with strong test suites

If a junior could do the task from a checklist and a suite would catch their mistakes, a loop fits.

Async-first teams already running multi-agent

For these teams, routines are the missing orchestration layer.

Who should skip it
Solo builders on consumer plans

The token bill arrives before the productivity gain does.

Code with no automated verification

A loop with no real check is the agent agreeing with itself on repeat.

Teams bottlenecked on review, not typing

A loop generates more code; if review was already the constraint, it just lengthens the queue.

The honest version of this guide: loop engineering is real — and most developers don't need it yet.

04
THE TACTICAL CHECK · 30 SECONDS

The 30-second loop check.

Step 02 was the strategic decision. This is the one you run on a specific task before turning it into a loop. Miss one box → keep it a manual prompt.

☑ THE FIVE BOXES — ALL MUST CHECK
The task happens at least weekly.
Less than weekly → setup cost never amortizes.
A test, type check, build, or linter can reject bad output.
No automated gate → the agent grades its own homework.
The agent can run the code it changes.
No reproduction environment → iteration is blind.
The loop has a hard stop.
Token budget, iteration count, or time limit — or it runs until someone notices the bill.
A human reviews before merge, deploy, or dependency changes.
Anything irreversible needs a human approval gate before action.
✓ GOOD FIRST LOOPS
CI failure triage — nightly: scan failures, classify causes, draft fix PRs for the easy ones.
Dependency bump PRs — weekly: scan updates, test compatibility, open PRs.
Lint-and-fix passes — on every PR open, apply style fixes automatically.
Flaky test reproduction — loop until a theory survives the test.
Issue-to-PR drafts — on code with strong tests that reject bad output.
✕ BAD FIRST LOOPS — HUMAN IN THE CHAIR
Architecture rewrites
Auth or payments code
Production deploys
Vague product work
Anything where "done" is a judgment call
PART TWO · THE 5 BUILDING BLOCKS
growthcab

Everything is built from five primitives.

Every working loop is assembled from the same five parts. Here's each one, and exactly how it shows up in the two tools that matter — Codex and Claude Code.

05 Automations 06 Worktrees 07 Skills 08 Connectors 09 Sub-agents
05
BLOCK 01 · THE HEARTBEAT

Automations — what makes a loop a loop.

Automations fire on a schedule, an event, or a trigger condition. They're the heartbeat — everything else in the loop hangs off them. Without one, you don't have a loop; you have a run you did once.

fires every 30m · on event · on trigger →
CODEX

The Automations tab — pick a project, set a prompt, set a cadence, choose local checkout or background worktree. Runs that find something land in a Triage inbox; runs that find nothing archive themselves.

CLAUDE CODE

Three primitives that compose into the same shape: /loop for session cadence, Desktop scheduled tasks for restart-survival, Routines for laptop-off cloud runs. Pair with hooks for lifecycle events.

/loop

Re-runs on a cadence. Use it when you want regular checks regardless of state.

/goal

Keeps going until a condition you wrote is actually true — checked by a separate small model, so the agent that wrote the code isn't the one grading it. The maker-vs-checker split, applied to the stop condition itself.

auth-quality-loop · claude code
> /loop 30m /goal All tests in test/auth pass and lint is clean.
  Scan src/auth for new failures, propose fixes in claude/auth-fixes,
  open draft PR when goal condition holds.

▲ Claude
  CronCreate(*/30 * * * * : auth quality loop)
  Stop condition: tests pass + lint clean (verified by checker)
 Scheduled. Will continue past intermediate completions
  until /goal condition is met by independent checker.
06
BLOCK 02 · ISOLATION

Worktrees — parallel without chaos.

The second you run more than one agent, files start colliding. Two agents writing the same file is the same headache as two engineers committing to the same lines without talking first.

A git worktree fixes it — a separate working directory on its own branch sharing the same repo history, so one agent's edits literally cannot touch the other's checkout.

Worktrees take away the mechanical collision — but you're still the ceiling. Your review bandwidth decides how many parallel agents you can run, not the tool.

ONE REPO · SHARED HISTORY
main · git history
agent A → worktree/auth-fixes
agent B → worktree/dep-bumps
agent C → worktree/lint-pass
✓ no two checkouts can touch the same file
CODEX

Builds worktree support in — several threads hit the same repo at once without bumping into each other.

CLAUDE CODE

Exposes git worktree directly, a --worktree flag to open a session in its own checkout, and an isolation: worktree setting on subagents — each helper gets a fresh checkout that cleans itself up after.

07
BLOCK 03 · MEMORY

Skills — write project knowledge once, read it on every run.

A Skill is how you stop re-explaining the same project context every session like a goldfish. Both tools use the same format: a folder with a SKILL.md inside — instructions and metadata, plus optional scripts, references, and assets.

A loop without skills re-derives your whole project context from zero every cycle. With skills, intent compounds — conventions, build steps, "we don't do it like this because of that one incident," written once on the outside and read by every run.

NO SKILL re-derive context every cycle · WITH SKILL resume with intent intact
📄 .claude/skills/ci-triage/SKILL.md
name: ci-triage
description: Classify CI failures by root cause (env,
  flake, real bug, dependency, infra), draft fixes for the
  easy ones, escalate the rest. Trigger on workflow failure
  or the morning triage loop.
---

# CI triage skill

## Classification rules
- env: missing secret / infra not provisioned. # human
- flake: passes on retry, no code change. # retry, file
- bug: deterministic, tied to recent commit. # draft fix
- dependency: tied to a version bump. # draft rollback
- infra: timeout, OOM, runner issue. # escalate

## Never do
- Disable failing tests — file an escalation instead
- Modify CI config without human approval
- Touch src/payments/ or src/billing/

## State
Update STATE.md after each run: paths checked,
classifications, PRs opened, items escalated.
08
BLOCK 04 · REACH · VIA MCP

Connectors — the loop touches your real tools.

A loop that can only see the filesystem is a tiny loop. Connectors, built on the Model Context Protocol (MCP), let the agent read your issue tracker, query a database, hit a staging API, drop a message in Slack. Codex and Claude Code both speak MCP — so a connector you wrote for one usually just works in the other. This is the difference between an agent that says "here is the fix" and a loop that opens the PR, links the ticket, and pings the channel once CI is green.

YOUR LOOP MCP GitHub Linear Slack Sentry
PAY BACK FASTEST — IN ORDER
1
GitHub

Read repos, create branches, open PRs, comment on issues, react to webhook events. The single biggest day-one win for any code loop.

2
Linear or Jira

Update tickets as the loop progresses, link PRs back to issues, close items automatically when verification passes.

3
Slack

Post triage results, ping humans on escalations, summarize overnight runs in the morning.

4
Sentry / your error tracker

Let the loop investigate live alerts and draft fixes for the high-frequency ones.

09
BLOCK 05 · THE SPLIT

Sub-agents — keep the maker away from the checker.

The most useful structural thing in a loop, by far, is splitting the agent that writes from the agent that checks. Osmani's framing is exact: the model that wrote the code is "way too nice grading its own homework." A second agent — different instructions, sometimes a different model — catches the stuff the first one talked itself into.

MAKER
Generates

Writes the code. Optimizes toward the task.

REPEAT
CHECKER
Critiques

Different instructions, no exposure to the maker's reasoning.

the evaluator-optimizer pattern · documented by Anthropic, Dec 2024 · viral 18 months later
CODEX

Spawns subagents only when you ask, runs them at the same time, folds results into one answer. Define agents as TOML in .codex/agents/ — name, description, instructions, optional model and reasoning effort. Security reviewer = strong model on high effort; explorer = fast read-only.

CLAUDE CODE

Same, with subagents in .claude/agents/ and agent teams that pass work between them. The usual split: one explores, one implements, one verifies against the spec.

The loop runs while you're not watching — a verifier you actually trust is the only reason you can walk away. Sub-agents burn more tokens; spend them where a second opinion is worth paying for.

PART THREE · BUILD IT RIGHT — OR DON'T BUILD IT
growthcab

The smallest loop that works.

Persistent state, four parts, and the failure modes that turn loops into money pits. Build small, in order — or pay for a system no one understands.

10 State file 11 Minimum viable loop 12 Ralph Wiggum 13 Comprehension debt 14 Security tax
10
THE SPINE

The state file. The agent forgets — the file does not.

It sounds too dumb to matter and is the spine of every working loop. A markdown file, a Linear board, a JSON state — anything that lives outside the single conversation and holds what's done and what's next.

Agents have short memory by default. What they learn this session is gone tomorrow unless you write it down.

The agent forgets, the repo does not. A loop without state restarts every run; a loop with state resumes.

WHERE STATE LIVES — TWO PATTERNS
Markdown in the repo

STATE.md at root or in .claude/. Version-controlled, diff-readable. Best for solo or small-team work.

External system

Linear, GitHub Issues, a database. Survives across repos, queryable, team-wide visibility. Best for production loops multiple humans watch.

STATE.md · committed to repo
# Loop state · ci-triage

## Last run
2026-06-09 03:30 UTC · 7 classified, 3 drafted, 4 escalated

## In progress
- fix-auth-token-refresh — passing locally, awaiting CI
- fix-flaky-payment-webhook — retry applied, monitoring

## Completed today
- bump-axios-1.7.4 → merged (CI green, deps verified)
- lint-fix-pass-june-9 → merged

## Escalated to humans
- src/billing/refund.ts — failing 3 ways, unclear
- ci/staging-runner — infra timeouts, not code

## Lessons learned (here, not in chat)
- 06-08: PowerShell hits TLS 1.2 on this runner. Use bash.
- 06-07: e2e/checkout needs Stripe secret. Skip if missing.

## Stop conditions met
- /goal "tests pass + lint clean"  commit 3a7b8c1

For long-running loops that drift, pair the state file with a standing high-level spec — VISION.md or AGENTS.md — reread each run. State tells the agent where it is. The spec tells it where to go.

11
THE BUILD · FOUR PARTS, NO SWARM

The minimum viable loop.

Passed the 4-condition test? Build the smallest loop that works before anything fancy. Four parts assemble into one loop.

ONE
Automation

Scheduled run that fires on cadence, stops on a clear condition.

+
TWO
Skill

One SKILL.md storing context it would re-derive every run.

+
THREE
State file

Records done/next so tomorrow resumes, not restarts.

+
FOUR
Gate

The test/build that fails bad work — decides if the loop helps or just spends.

▼   ASSEMBLE INTO   ▼
↻ one working loop
ORDER MATTERS — SKIPPING AHEAD IS HOW LOOPS FAIL IN PRODUCTION
01
Get one manual run reliable
02
Turn it into a skill
03
Wrap it in a loop
04
Then schedule it
THE ONLY METRIC THAT MATTERS
Cost per accepted change

Not tokens spent, not tasks attempted, not loops scheduled. Below a 50% accepted-change rate, you're doing the review work the loop was supposed to save you — and the loop is losing.

50%
MINIMUM ACCEPTED-CHANGE RATE
losingwinning
12
FAILURE MODE · NAMED BY GEOFFREY HUNTLEY

The Ralph Wiggum loop. Loops that fail quietly.

An agent meant to emit a completion token only when finished emits it early — and the loop exits on a half-done job. Without a hard gate, loops fail quietly and keep spending.

⚠ COMPLETION TOKEN EMITTED EARLY
"done" ✓
actually done →
loop exits at 54% · ships a half-finished job · nobody's watching
IT HAPPENS WHEN —
No real verifier

Just a second agent asked to "review," no objective signal. Two optimists agreeing.

Soft completion conditions

"Done" defined by the agent's judgment, not by a test, build, or type check.

No hard stops

Loop runs until something external kills it — a rate limit, you noticing — rather than until success is verified.

THE FIX — THE GATE FROM STEP 11

Something objective that can fail the work. A test that passes or fails. A build that compiles or doesn't. A linter that returns zero or non-zero. Not a verifier that has an opinion.

OTHER MEASURED FAILURE MODES
Goal drift

Each summarization is lossy; "don't do X" disappears by turn 47.

→ standing VISION.md / AGENTS.md, reread each run
Self-preferential bias

The agent that wrote the code is too nice grading its own homework.

→ separate verifier, no exposure to the maker's reasoning
Agentic laziness

The loop declares "done enough" at partial completion.

→ /goal with an objective stop condition, checked by a fresh model
13
THE RISK THAT SHARPENS AS THE LOOP IMPROVES

Comprehension debt and cognitive surrender.

RISK 01

Comprehension debt

The faster the loop ships code you didn't write, the larger the distance between what the repository contains and what you understand.

code in repo
you understand
↑ the gap is the debt

The bill that hurts isn't the token bill — it's the day you debug a system no one on the team has read.

RISK 02

Cognitive surrender

The pull to stop forming an opinion and accept whatever the loop returns.

WITH JUDGMENTdesigning the loop is the cure
TO AVOID THINKINGit's the accelerant
same action · opposite result
THE MITIGATIONS ARE NOT TECHNICAL
Read the diffs

Don't read what the loop ships → you rent comprehension debt at compound interest.

Spot-check the gate

Verify a few PRs the loop opened — does the approving test actually catch the failure you care about? Gates rot.

Block architecture work

Keep it on small, machine-checkable changes. Judgment calls accelerate the debt.

Pair-design loops

A second pair of eyes catches blind spots the loop would otherwise exploit forever.

14
THE SECURITY TAX

An unattended loop is an unattended attack surface.

The threat model your loop has to defend against — four ways an unattended loop quietly becomes a liability.

01
Generated code shipping unreviewed

The loop opens PRs faster than a human can read them. Without a gate that includes security checks, insecure code merges automatically.

→ add SAST, dependency audit, secret scanning to the gate
02
Skills as injection vectors

A loop that auto-installs skills inherits every prompt injection hiding in their descriptions.

→ audit skill sources before installing
03
Credentials in logs

Debug logging during a long-running loop scatters secrets across logs you don't monitor.

→ disable verbose logging in production; sanitize what is logged
04
Permission scope creep

A loop tested with read-only permissions gets "just one" write permission added for convenience — then never re-audited.

→ re-audit permissions every 30 days
§ THE LEDGER OF REGRET

The mistakes that turn loops into money pits.

10 WAYS TO LOSE
01
No 4-condition test

Step 02 exists for a reason. Most developers fail at least one condition.

02
No objective gate

A second agent asked to "review" without a test or build is just a second optimist.

03
One agent writes and verifies

Self-preferential bias. The maker grades its own homework — always "A+."

04
No state file

Tomorrow's run restarts from zero instead of resuming.

05
Vague stop conditions

"Done when it looks good" never holds. Use a test, a type pass, or a passing build.

06
No token budget cap

Loops re-read context and retry. Without a cap, they burn 5–10× the tokens you expected.

07
Heavy loops on a consumer plan

Token bill or rate limit — one of them gets you.

08
Auto-installing community skills

520 of 17,022 audited skills leak credentials. Read the source before installing.

09
Loops on judgment-call work

Architecture, auth, payments, vague product. Keep the loop on lint-and-fix, not strategy.

10
Not reading the diffs

Comprehension debt at compound interest. The day you debug a system no one read costs more than the tokens ever did.

CONCLUSION

The leverage moved. Your job did too.

For two years, the leverage in working with coding agents was at the prompt — better prompts, better context, better one-shot output. That phase is ending. The agents got good enough that the next leverage point is one floor up: the system that decides what they work on, when, with what gate, and what state survives between runs.

But the honest version of this story isn't that everyone should rush to build loops. Most developers don't need one yet — not until the task repeats, verification is automated, the budget can absorb the waste, and the agent has senior-engineer tools. Miss one condition and the loop costs more than it returns.

IF YOU PASS THE TEST — BUILD SMALL
One automation One skill One state file One gate

Get a manual run reliable. Turn it into a skill. Wrap it in a loop. Then schedule it. Order matters — skip ahead and you're paying for a system no one understands.

Cherny's point isn't that the work got easier. It's that the leverage point moved.

growthcab
A LOOP ENGINEERING FIELD GUIDE
Build the loop. Stay the engineer.
growthcab 14 STEPS · 3 TIERS · STOP PROMPTING, START DESIGNING
SOURCES: ANTHROPIC ENGINEERING · ADDY OSMANI · MEASUREMENT STUDIES