Agent Loops: Autonomous AI Coding Workflows
An agent loop is a system where an AI coding agent repeatedly works toward a specified goal without waiting for human approval between each step — you define the trigger and the stopping condition, and the agent runs autonomously.
Why This Matters
The dominant mental model for using AI coding agents in 2025 was the chat loop: you type a prompt, the agent works for a few seconds, produces output, and waits. You review, refine, and prompt again. Each turn requires your attention. You are the bottleneck.
"The last human in the coding-agent loop is a bottleneck pretending to be a checkpoint." — Reddit, r/AI_Agents (2026)
Loop engineering inverts this. Instead of being the person who prompts the agent, you become the person who designs the system that prompts the agent. Popularized by Google engineer Addy Osmani and echoing Boris Cherny (Anthropic Claude Code lead), the shift is:
| Era | What you optimize | Unit of work |
|---|---|---|
| Prompt engineering | How you phrase a single instruction | One turn typed by hand |
| Context engineering | What goes in the context window (docs, history, tool defs) | The conditions around one answer |
| Loop engineering | The system that decides what to prompt, when, and whether the result is acceptable | A self-running cycle across many turns |
This matters because the highest-leverage activity has shifted from writing individual prompts to designing autonomous workflows that multiply your time. A single well-designed loop can work for minutes, hours, or days — testing, fixing, optimizing, and documenting your codebase while you sleep.
Prerequisites
Prerequisites
This article assumes you're familiar with AI coding agents — tools like Claude Code, OpenAI Codex, or Gemini CLI that can read, write, and execute code in your project. Also helpful: understanding LLM function calling — the mechanism by which agents invoke tools (run commands, edit files, call APIs), covered in the Function Calling article. Basic familiarity with git (branches, pull requests) and CI/CD concepts rounds out the prerequisites.
Core Idea
Every agent loop has exactly two ingredients:
Loop = TRIGGER + GOAL
The Trigger — What Kicks It Off
There are three types:
- Manual — You explicitly tell the agent to start (e.g. paste a prompt and hit Go). Simple, reliable, still requires your attention once.
- Schedule — The loop runs at a certain time of day, or on a recurring interval (every night at 2 AM, every weekday morning, every 30 minutes).
- Action — The loop fires in response to an event: a PR being opened, a CI failure, a new GitHub issue, a Slack message.
A fully autonomous system uses schedule or action triggers exclusively — no human needed to start the work. Manual triggers are useful for one-off or high-judgment tasks.
The Goal — When It Stops
The goal is the stopping condition. It comes in two flavours:
| Type | Description | Example |
|---|---|---|
| Verifiable | A concrete, testable, deterministic condition | "Every page loads in under 50 ms" — measurable with a benchmark |
| LLM-as-Judge | The model itself decides when the goal is met | "Refactor until you are happy with the architecture" — satisfaction is subjective |
When to use each
Verifiable goals are more reliable and produce more predictable loops. Use LLM-as-Judge for tasks where "good enough" is inherently subjective (code readability, documentation completeness, architecture quality). The trade-off: verifiable is deterministic, LLM-as-Judge is flexible but can be brittle — the model's taste may drift across iterations.
How It Actually Works
The Inner Loop: ReAct Pattern
Most coding agents already run an internal loop called ReAct (Reason + Act), originating from a 2022 Google Research paper. Inside a single agent invocation, it alternates between:
Thought → Action → Observation → Thought → Action → Observation → ... → Done
A concrete example:
Thought: The test is failing because `parse_date` returns None when input is empty.
Action: Edit lines 42-47 in utils.py to add a null check before parsing.
Observation: File updated successfully.
Thought: I should re-run the test suite to confirm the fix.
Action: Run `pytest tests/test_utils.py`
Observation: 14 passed, 0 failed.
Thought: All tests pass. Goal condition met.
The outer loop (our topic) wraps this inner cycle: it keeps re-invoking the agent (including its internal ReAct loop) until the high-level goal is reached.
Outer Loop Anatomy
┌─────────────────────────────────────────────────┐
│ OUTER LOOP │
│ │
│ ┌──────┐ ┌──────────┐ ┌────────────────┐ │
│ │Trigger│───▶│ Agent │───▶│ Goal Check │ │
│ │(wakes)│ │(ReAct │ │ (verified or │ │
│ │ │ │ inner │ │ LLM-judged) │ │
│ └──────┘ │ loop) │ └───────┬────────┘ │
│ └──────────┘ │ │
│ ◄──────────────┘ │
│ (not met → repeat) │
│ │
│ ┌──────────────────┐ │
│ │ GOAL MET → STOP │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────┘
The Five Building Blocks (plus Memory)
According to the broader loop-engineering community, a production-quality loop sits on top of these foundations:
-
Automations — The scheduler that wakes the loop. Both Claude Code (via
/loop) and OpenAI Codex (via the Automations tab) provide built-in cron-like scheduling. -
Worktrees — Isolated git working directories so parallel agents don't collide. Each agent gets its own branch in its own directory, sharing the repo's history. Both Codex (built-in) and Claude Code (
--worktreeflag) support this. -
Skills — Reusable project-knowledge files (
.claude/skills/or.codex/skills/) that encode conventions, build steps, architecture rules. The agent loads these automatically so it doesn't guess your preferences each run. This is not the same as the agent having memory — skills encode durable knowledge, not changing state. -
Plugins & Connectors — MCP (Model Context Protocol) links to external tools: issue trackers, databases, Slack, CI systems. These let the loop affect the real world — opening PRs, posting results, reading logs.
-
Sub-agents (Maker-Checker Split) — Separate the agent that writes code from the agent that reviews it. As Boris Cherny noted: "The model that wrote the code is far too generous grading its own homework." Using a stronger model for the verifier (e.g. Claude Opus) and a cheaper one for the maker (e.g. Claude Haiku) is a common pattern.
-
Memory — Durable state that persists outside the context window. A markdown file, a Linear board, or a GitHub issue that records what's been done, what's pending, and what failed. The agent forgets everything between turns; memory is the only thing that survives.
Implementation: Claude Code vs. OpenAI Codex
| Feature | Claude Code | OpenAI Codex |
|---|---|---|
| Run-until-done | /goal "condition" — keeps working until the condition is met | codex /goal "condition" (CLI 0.128.0+) |
| Recurring schedule | /loop "prompt" --schedule "cron" | Automations tab — pick project, prompt, cadence |
| Sub-agents | .claude/agents/ — markdown files with name, model, isolation settings | .codex/agents/ — TOML files with name, description, instructions |
| Worktree isolation | --worktree flag, isolation: worktree setting | Built-in worktree support |
| Skills | .claude/skills/ directory | .codex/skills/ directory, $skillname syntax |
How
/goalworks internallyWhen you issue
/goalin Claude Code, the system sets a session-scoped "stop hook condition." After every agent turn, a separate, smaller evaluation model checks whether the condition is met. If not, it appends a system prompt to trigger the next turn automatically — no user input required. The condition must be both concrete enough for the evaluator to verify and loose enough that honest failure modes (external blocker, resource unavailable) are accepted gracefully.
Worked Example: The Loop Library
The Loop Library (curated collection of ready-to-use loop prompts) demonstrates the pattern with concrete tasks:
| # | Loop Name | Trigger | Goal Type | What It Does |
|---|---|---|---|---|
| 1 | Sub-50ms Page Load | Manual | Verifiable | Optimizes every page in the app until all load under 50ms |
| 2 | Overnight Docs Sweep | Schedule (nightly) | LLM-as-Judge | Reviews codebase, updates docs, opens a PR |
| 3 | Architecture Satisfaction | Manual or nightly | LLM-as-Judge | Refactors code until the LLM is "happy" with the architecture |
| 4 | Production Error Sweep | Schedule (nightly) | Verifiable | Reviews logs, traces root causes, fixes them, opens PRs |
| 5 | 100% Test Coverage | Manual | Verifiable | Adds tests until coverage reports hit 100% |
| 6 | SEO/GEO Visibility | Schedule (weekly) | Verifiable | Runs crawl + audit, fixes highest-leverage issue, repeats |
| 7 | Logging Coverage | Manual | LLM-as-Judge | Adds logging until every important path produces useful logs |
| 8 | Full Product Evaluation | Manual (long-running) | LLM-as-Judge | Creates scenarios, runs them, fixes failures, reruns |
| 9 | Quality Streak | Manual | Verifiable | Runs scenarios until N pass in a row |
| 10 | Nightly Changelog | Schedule (nightly) | LLM-as-Judge | Reviews recent changes and updates the changelog |
A Worked Example: The Sub-50ms Page Load Loop
Setup: You have a web app with 12 pages, 3 modals, and 2 admin panels. Some load in 30ms, one takes 320ms.
- Trigger: You copy the prompt, append
/goal "Every page loads in under 50ms", hit Go.- Iteration 1: Agent measures all pages. Finds the 320ms page has an unoptimized database query in the API route.
- Action: Adds eager loading, adds an index, caches the response.
- Measure: That page now loads in 45ms. ✅
- Iteration 2: Agent moves to the next slowest page (180ms)...
- Continues until every single page and modal is under 50ms.
- Stops automatically when the condition is met. Reports results.
Combining Loops
One of the most powerful patterns is chaining loops. For example:
- Logging Coverage Loop (run once manually) → ensures every path produces logs
- Production Error Sweep (run nightly) → reads those logs and fixes errors automatically
The output of one loop creates the conditions that make another loop effective. As your library of loops grows, you can compose them into an increasingly autonomous development pipeline.
Common Misconceptions
"Loops are just like a test suite"
A test suite is deterministic: it passes or fails the same way every time. Loops are fundamentally different because the agent improves the code between iterations. The Full Product Evaluation Loop, for instance, doesn't just run tests — it fixes the underlying causes of failures, reruns affected scenarios, and repeats until everything passes. The agent is both player and referee.
"You can loop-ify feature development"
Building new features from scratch with loops is the hardest use case. Someone tried to clone Excel feature-parity using a loop — it ran for days before he stopped it. The problem: there's no clear stopping condition for "what features should exist" and "when is it done." Features require human judgment about what to build, not just how well it's built. Loops excel at optimization, refactoring, testing, and maintenance — not greenfield feature work.
"Loops are free / cheap"
Loops are the opposite of cheap. They churn through tokens autonomously until the goal is met. A
/goalsession can run for 10 minutes or 10 hours. One user ran a 9-hour/goalsession that produced 45 commits, 14,259 lines of code, and consumed 4.16 million rows of data. Each iteration costs tokens for both the main agent and the verifier model. Start with short timeouts before letting them run unattended.
"The agent's self-assessment is reliable"
When the agent says "Task complete," it may be optimistic. The model that wrote the code is often too generous grading its own homework. Always use either a verifiable stopping condition (metric, test pass, coverage number) or a separate verifier model (the LLM-as-Judge pattern implemented by
/goalinternally). Never trust a single agent turn's "I think I'm done" without an independent check.
Key Takeaways
- An agent loop = Trigger + Goal. The trigger (manual, schedule, action) wakes the loop; the goal (verifiable or LLM-judged) stops it.
- Loop engineering is the new leverage. The highest-value skill is no longer writing good prompts — it's designing systems that prompt agents autonomously.
- Use verifiable goals when you can, LLM-as-Judge when you must. Verifiable goals produce more predictable, cheaper loops.
- Build with the five blocks: automations, worktrees, skills, connectors/MCP, and sub-agents. Add durable memory (file or board) that survives between runs.
- Don't loop feature development. Loops excel at optimization, refactoring, testing, documentation, and monitoring — not greenfield feature work.
- Costs are real and can be large. A loop that runs for hours can consume thousands of tokens per iteration. Monitor, budget, and start conservatively.
- The Loop Library is a starting point. Copy, adapt, combine. The most powerful loops are the ones you chain together: logging coverage → error sweep → changelog update, all running overnight.
Open Questions
Where the evidence is thin
- Multi-agent loop coordination is still nascent. How do you prevent two scheduled loops from conflicting (e.g., one refactoring while another runs tests)?
- Long-running loop degradation is poorly documented. Context windows fill up, agents may "forget" early decisions, and the quality of LLM-as-Judge evaluations can drift over hundreds of iterations.
- Cost benchmarks are anecdotal. The loop library shares inspiring successes, but there's no systematic study of average token burn per loop type.
- The optimal verifier model for LLM-as-Judge loops isn't established. Should you always use a frontier model? Can a small fine-tuned model do the job for specific loop types?
References
- Berman, M. "Loops are emerging as the single biggest unlock..." (YouTube video transcript, ~June 2026). Source material for this article.
- Loop Library — signals.forwardfuture.ai/loop-library. 15 copy-ready loop prompts with attribution and verify/stop conditions.
- LushBinary. "Loop Engineering: The Guide for AI Agents" — June 9, 2026. Detailed guide on the five building blocks.
- MindStudio. "What Is an Agentic Loop? The New Meta for AI Coding Agents" — Covers ReAct pattern, anatomy of a coding agent loop, common mistakes.
- Oracle Developers. "What Is the AI Agent Loop?" — 2026. Production considerations on cost (4x more tokens than standard chat, up to 15x in multi-agent) and observability.
- Addy Osmani. "Loop Engineering" — June 7, 2026. Original blog post establishing the term, the five pieces, and the shift from prompting to system design.
- Griffiths, B. D. "Forget Prompts: 'Loop Engineering' Is All the Rage Now" — Business Insider, June 2026. Practitioner perspectives from Cherny, Steinberger, and Claire Vo on cost, adoption, and caveats.
- DataScienceDojo. "Agentic Loops: From ReAct to Loop Engineering (2026 Guide)" — Identifies four root causes of loop failure in production.
Related
Browser Use: Making Websites Accessible to AI Agents
Browser Use is a family of techniques and tools — led by the open-source Python library browser-use — that lets AI agents control a real web browser the way a human would, all described in natural language.
LLM Function Calling: Giving Language Models a Way to Act
Function calling (also called tool use) is a capability that lets an LLM output structured commands — like get_weather(location='Cairo') — which your own code then executes, bridging the gap between what the model says and what it can do.