Agent Loops: Autonomous AI Coding Workflows

Why This Matters

The dominant mental model for using AI coding agents in 2025 was the chat loop: you type a prompt, the agent works for a few seconds, produces output, and waits. You review, refine, and prompt again. Each turn requires your attention. You are the bottleneck.

"The last human in the coding-agent loop is a bottleneck pretending to be a checkpoint." — Reddit, r/AI_Agents (2026)

Loop engineering inverts this. Instead of being the person who prompts the agent, you become the person who designs the system that prompts the agent. Popularized by Google engineer Addy Osmani and echoing Boris Cherny (Anthropic Claude Code lead), the shift is:

Era	What you optimize	Unit of work
Prompt engineering	How you phrase a single instruction	One turn typed by hand
Context engineering	What goes in the context window (docs, history, tool defs)	The conditions around one answer
Loop engineering	The system that decides what to prompt, when, and whether the result is acceptable	A self-running cycle across many turns

This matters because the highest-leverage activity has shifted from writing individual prompts to designing autonomous workflows that multiply your time. A single well-designed loop can work for minutes, hours, or days — testing, fixing, optimizing, and documenting your codebase while you sleep.

Prerequisites

Prerequisites
This article assumes you're familiar with AI coding agents — tools like Claude Code, OpenAI Codex, or Gemini CLI that can read, write, and execute code in your project. Also helpful: understanding LLM function calling — the mechanism by which agents invoke tools (run commands, edit files, call APIs), covered in the Function Calling article. Basic familiarity with git (branches, pull requests) and CI/CD concepts rounds out the prerequisites.

Core Idea

Every agent loop has exactly two ingredients:

Loop = TRIGGER + GOAL

The Trigger — What Kicks It Off

There are three types:

Manual — You explicitly tell the agent to start (e.g. paste a prompt and hit Go). Simple, reliable, still requires your attention once.
Schedule — The loop runs at a certain time of day, or on a recurring interval (every night at 2 AM, every weekday morning, every 30 minutes).
Action — The loop fires in response to an event: a PR being opened, a CI failure, a new GitHub issue, a Slack message.

A fully autonomous system uses schedule or action triggers exclusively — no human needed to start the work. Manual triggers are useful for one-off or high-judgment tasks.

The Goal — When It Stops

The goal is the stopping condition. It comes in two flavours:

Type	Description	Example
Verifiable	A concrete, testable, deterministic condition	"Every page loads in under 50 ms" — measurable with a benchmark
LLM-as-Judge	The model itself decides when the goal is met	"Refactor until you are happy with the architecture" — satisfaction is subjective

When to use each
Verifiable goals are more reliable and produce more predictable loops. Use LLM-as-Judge for tasks where "good enough" is inherently subjective (code readability, documentation completeness, architecture quality). The trade-off: verifiable is deterministic, LLM-as-Judge is flexible but can be brittle — the model's taste may drift across iterations.

How It Actually Works

The Inner Loop: ReAct Pattern

Most coding agents already run an internal loop called ReAct (Reason + Act), originating from a 2022 Google Research paper. Inside a single agent invocation, it alternates between:

Thought → Action → Observation → Thought → Action → Observation → ... → Done

A concrete example:

Thought: The test is failing because `parse_date` returns None when input is empty.
Action: Edit lines 42-47 in utils.py to add a null check before parsing.
Observation: File updated successfully.
Thought: I should re-run the test suite to confirm the fix.
Action: Run `pytest tests/test_utils.py`
Observation: 14 passed, 0 failed.
Thought: All tests pass. Goal condition met.

The outer loop (our topic) wraps this inner cycle: it keeps re-invoking the agent (including its internal ReAct loop) until the high-level goal is reached.

Outer Loop Anatomy

 ┌─────────────────────────────────────────────────┐
 │                   OUTER LOOP                    │
 │                                                 │
 │  ┌──────┐    ┌──────────┐    ┌────────────────┐ │
 │  │Trigger│───▶│   Agent  │───▶│  Goal Check    │ │
 │  │(wakes)│   │(ReAct    │    │ (verified or   │ │
 │  │      │    │ inner    │    │  LLM-judged)   │ │
 │  └──────┘    │ loop)    │    └───────┬────────┘ │
 │              └──────────┘            │          │
 │                       ◄──────────────┘          │
 │                       (not met → repeat)        │
 │                                                 │
 │              ┌──────────────────┐               │
 │              │  GOAL MET → STOP │               │
 │              └──────────────────┘               │
 └─────────────────────────────────────────────────┘

The Five Building Blocks (plus Memory)

According to the broader loop-engineering community, a production-quality loop sits on top of these foundations:

Automations — The scheduler that wakes the loop. Both Claude Code (via /loop) and OpenAI Codex (via the Automations tab) provide built-in cron-like scheduling.
Worktrees — Isolated git working directories so parallel agents don't collide. Each agent gets its own branch in its own directory, sharing the repo's history. Both Codex (built-in) and Claude Code (--worktree flag) support this.
Skills — Reusable project-knowledge files (.claude/skills/ or .codex/skills/) that encode conventions, build steps, architecture rules. The agent loads these automatically so it doesn't guess your preferences each run. This is not the same as the agent having memory — skills encode durable knowledge, not changing state.
Plugins & Connectors — MCP (Model Context Protocol) links to external tools: issue trackers, databases, Slack, CI systems. These let the loop affect the real world — opening PRs, posting results, reading logs.
Sub-agents (Maker-Checker Split) — Separate the agent that writes code from the agent that reviews it. As Boris Cherny noted: "The model that wrote the code is far too generous grading its own homework." Using a stronger model for the verifier (e.g. Claude Opus) and a cheaper one for the maker (e.g. Claude Haiku) is a common pattern.
Memory — Durable state that persists outside the context window. A markdown file, a Linear board, or a GitHub issue that records what's been done, what's pending, and what failed. The agent forgets everything between turns; memory is the only thing that survives.

Implementation: Claude Code vs. OpenAI Codex

Feature	Claude Code	OpenAI Codex
Run-until-done	`/goal "condition"` — keeps working until the condition is met	`codex /goal "condition"` (CLI 0.128.0+)
Recurring schedule	`/loop "prompt" --schedule "cron"`	Automations tab — pick project, prompt, cadence
Sub-agents	`.claude/agents/` — markdown files with name, model, isolation settings	`.codex/agents/` — TOML files with name, description, instructions
Worktree isolation	`--worktree` flag, `isolation: worktree` setting	Built-in worktree support
Skills	`.claude/skills/` directory	`.codex/skills/` directory, `$skillname` syntax

How /goal works internally
When you issue /goal in Claude Code, the system sets a session-scoped "stop hook condition." After every agent turn, a separate, smaller evaluation model checks whether the condition is met. If not, it appends a system prompt to trigger the next turn automatically — no user input required. The condition must be both concrete enough for the evaluator to verify and loose enough that honest failure modes (external blocker, resource unavailable) are accepted gracefully.

Worked Example: The Loop Library

The Loop Library (curated collection of ready-to-use loop prompts) demonstrates the pattern with concrete tasks:

#	Loop Name	Trigger	Goal Type	What It Does
1	Sub-50ms Page Load	Manual	Verifiable	Optimizes every page in the app until all load under 50ms
2	Overnight Docs Sweep	Schedule (nightly)	LLM-as-Judge	Reviews codebase, updates docs, opens a PR
3	Architecture Satisfaction	Manual or nightly	LLM-as-Judge	Refactors code until the LLM is "happy" with the architecture
4	Production Error Sweep	Schedule (nightly)	Verifiable	Reviews logs, traces root causes, fixes them, opens PRs
5	100% Test Coverage	Manual	Verifiable	Adds tests until coverage reports hit 100%
6	SEO/GEO Visibility	Schedule (weekly)	Verifiable	Runs crawl + audit, fixes highest-leverage issue, repeats
7	Logging Coverage	Manual	LLM-as-Judge	Adds logging until every important path produces useful logs
8	Full Product Evaluation	Manual (long-running)	LLM-as-Judge	Creates scenarios, runs them, fixes failures, reruns
9	Quality Streak	Manual	Verifiable	Runs scenarios until N pass in a row
10	Nightly Changelog	Schedule (nightly)	LLM-as-Judge	Reviews recent changes and updates the changelog

A Worked Example: The Sub-50ms Page Load Loop
Setup: You have a web app with 12 pages, 3 modals, and 2 admin panels. Some load in 30ms, one takes 320ms.

Trigger: You copy the prompt, append /goal "Every page loads in under 50ms", hit Go.

Iteration 1: Agent measures all pages. Finds the 320ms page has an unoptimized database query in the API route.

Action: Adds eager loading, adds an index, caches the response.

Measure: That page now loads in 45ms. ✅

Iteration 2: Agent moves to the next slowest page (180ms)...

Continues until every single page and modal is under 50ms.

Stops automatically when the condition is met. Reports results.

Combining Loops

One of the most powerful patterns is chaining loops. For example:

Logging Coverage Loop (run once manually) → ensures every path produces logs
Production Error Sweep (run nightly) → reads those logs and fixes errors automatically

The output of one loop creates the conditions that make another loop effective. As your library of loops grows, you can compose them into an increasingly autonomous development pipeline.

Common Misconceptions

"Loops are just like a test suite"
A test suite is deterministic: it passes or fails the same way every time. Loops are fundamentally different because the agent improves the code between iterations. The Full Product Evaluation Loop, for instance, doesn't just run tests — it fixes the underlying causes of failures, reruns affected scenarios, and repeats until everything passes. The agent is both player and referee.

"You can loop-ify feature development"
Building new features from scratch with loops is the hardest use case. Someone tried to clone Excel feature-parity using a loop — it ran for days before he stopped it. The problem: there's no clear stopping condition for "what features should exist" and "when is it done." Features require human judgment about what to build, not just how well it's built. Loops excel at optimization, refactoring, testing, and maintenance — not greenfield feature work.

"Loops are free / cheap"
Loops are the opposite of cheap. They churn through tokens autonomously until the goal is met. A /goal session can run for 10 minutes or 10 hours. One user ran a 9-hour /goal session that produced 45 commits, 14,259 lines of code, and consumed 4.16 million rows of data. Each iteration costs tokens for both the main agent and the verifier model. Start with short timeouts before letting them run unattended.

"The agent's self-assessment is reliable"
When the agent says "Task complete," it may be optimistic. The model that wrote the code is often too generous grading its own homework. Always use either a verifiable stopping condition (metric, test pass, coverage number) or a separate verifier model (the LLM-as-Judge pattern implemented by /goal internally). Never trust a single agent turn's "I think I'm done" without an independent check.

Key Takeaways

An agent loop = Trigger + Goal. The trigger (manual, schedule, action) wakes the loop; the goal (verifiable or LLM-judged) stops it.
Loop engineering is the new leverage. The highest-value skill is no longer writing good prompts — it's designing systems that prompt agents autonomously.
Use verifiable goals when you can, LLM-as-Judge when you must. Verifiable goals produce more predictable, cheaper loops.
Build with the five blocks: automations, worktrees, skills, connectors/MCP, and sub-agents. Add durable memory (file or board) that survives between runs.
Don't loop feature development. Loops excel at optimization, refactoring, testing, documentation, and monitoring — not greenfield feature work.
Costs are real and can be large. A loop that runs for hours can consume thousands of tokens per iteration. Monitor, budget, and start conservatively.
The Loop Library is a starting point. Copy, adapt, combine. The most powerful loops are the ones you chain together: logging coverage → error sweep → changelog update, all running overnight.

Open Questions

Where the evidence is thin

Multi-agent loop coordination is still nascent. How do you prevent two scheduled loops from conflicting (e.g., one refactoring while another runs tests)?

Long-running loop degradation is poorly documented. Context windows fill up, agents may "forget" early decisions, and the quality of LLM-as-Judge evaluations can drift over hundreds of iterations.

Cost benchmarks are anecdotal. The loop library shares inspiring successes, but there's no systematic study of average token burn per loop type.

The optimal verifier model for LLM-as-Judge loops isn't established. Should you always use a frontier model? Can a small fine-tuned model do the job for specific loop types?

References

Berman, M. "Loops are emerging as the single biggest unlock..." (YouTube video transcript, ~June 2026). Source material for this article.
Loop Library — signals.forwardfuture.ai/loop-library. 15 copy-ready loop prompts with attribution and verify/stop conditions.
LushBinary. "Loop Engineering: The Guide for AI Agents" — June 9, 2026. Detailed guide on the five building blocks.
MindStudio. "What Is an Agentic Loop? The New Meta for AI Coding Agents" — Covers ReAct pattern, anatomy of a coding agent loop, common mistakes.
Oracle Developers. "What Is the AI Agent Loop?" — 2026. Production considerations on cost (4x more tokens than standard chat, up to 15x in multi-agent) and observability.
Addy Osmani. "Loop Engineering" — June 7, 2026. Original blog post establishing the term, the five pieces, and the shift from prompting to system design.
Griffiths, B. D. "Forget Prompts: 'Loop Engineering' Is All the Rage Now" — Business Insider, June 2026. Practitioner perspectives from Cherny, Steinberger, and Claire Vo on cost, adoption, and caveats.
DataScienceDojo. "Agentic Loops: From ReAct to Loop Engineering (2026 Guide)" — Identifies four root causes of loop failure in production.

Agent Loops: Autonomous AI Coding Workflows

Why This Matters

Prerequisites

Core Idea

The Trigger — What Kicks It Off

The Goal — When It Stops

How It Actually Works

The Inner Loop: ReAct Pattern

Outer Loop Anatomy

The Five Building Blocks (plus Memory)

Implementation: Claude Code vs. OpenAI Codex

Worked Example: The Loop Library

Combining Loops

Common Misconceptions

Key Takeaways

Open Questions

References

Related

Browser Use: Making Websites Accessible to AI Agents

LLM Function Calling: Giving Language Models a Way to Act