这一页按 Gavin 的要求分成两部分:上半部分是中文精翻译文,下半部分保留 Addy Osmani 原文,方便对照阅读。
PART 01

第一部分:中文精翻译文

所谓 loop engineering,本质上是:你不再亲自扮演那个一轮轮提示 agent 的人,而是去设计一套替你完成这件事的系统。 这里的 loop,可以理解为一种递归目标:你先定义目的,再让 AI 不断迭代,直到任务完成。就当前产品能力看,Claude Code 和 Codex 已经都具备构成这套 loop 的关键积木。

我认为,这很可能会成为未来人们与 coding agent 协作的主流方式。不过,现在还很早,我依然保持怀疑,而且你绝对要小心 token 成本 ——如果你是 token 宽裕型用户和 token 紧巴型用户,使用模式会完全不一样。你也仍然需要某种办法确保质量不会滑坡,对“AI 糊活(slop)”的担心并不是杞人忧天。话虽如此,这件事仍然值得认真拆开看看。

@steipete 最近说过一句很有代表性的话:“你已经不该再直接 prompt coding agent 了,你该设计的是会去 prompt agent 的 loop。” 类似地,Anthropic 的 Claude Code 负责人 @bcherny 也说过:“我已经不再直接 prompt Claude。我让 loop 去 prompt Claude,并让它自己判断接下来该做什么。我的工作,是写 loop。”

那么,这些话到底意味着什么?

过去两年里,你从 coding agent 身上获得结果的方式,基本都是:写一个足够好的 prompt,补足足够多的上下文;你敲一句,它回一句;你再补下一句,它再继续。agent 是工具,而你始终握着它,一轮一轮地推动它往前走。这个阶段并没有彻底结束,但至少已经有人认为,它正在结束。

现在更像是:你先搭一个小系统,让它自己去发现工作、分派工作、检查结果、记录完成项、决定下一步,然后由这套系统代替你去“戳”那些 agent。我之前写过它的近亲 —— agent harness engineering,它关注的是单个 agent 所处的工作环境;也写过 factory model,它更像一套造软件的系统。Loop engineering 则位于 harness 之上:它仍然是 harness,但它会按节拍运行、会拉起小助手、会自己喂自己。

最让我意外的一点是:这件事已经不再只是“某个工具的小技巧”了。一年前,如果你想搞 loop,多半得自己堆一大坨 bash,然后长期维护那一坨脚本,而且那往往只属于你自己。现在不同了,这些组件已经开始直接内置进产品里。Steinberger 列出的那套能力,几乎和 Codex App 一一对应;换到 Claude Code,也几乎能一一映射。一旦你意识到这些产品在结构上其实是同构的,你就不会再纠结“到底站哪边”;你会开始去设计一套不依赖单一工具、换到哪边都还能跑的 loop。

五件套,以及外置记忆

一套 loop 至少需要五个组件,再加上一个“记忆位”。先把清单列出来,再逐项展开:

  1. Automations:能按计划自动触发,自己做发现、筛选、分诊。
  2. Worktrees:让多个 agent 并行干活时,彼此不会踩坏对方的工作区。
  3. Skills:把项目知识写下来,别让 agent 每次都靠猜。
  4. Plugins / Connectors:把 agent 接到你真实在用的工具链里。
  5. Sub-agents:让一个负责提出方案,另一个负责审查方案。

第六样东西,就是 memory。它可以是一个 markdown 文件,也可以是一块 Linear 看板,总之必须是活在“单次对话之外”的状态载体,用来记已经做了什么、接下来还要做什么。听起来很笨,但恰恰是所有长时运行 agent 都依赖的那一招:模型在多次运行之间会遗忘,所以记忆必须落到磁盘上,而不是只留在上下文窗口里。agent 会忘,repo 不会忘。

而现在,这五件套加上外置记忆,两个产品其实都已经具备了。

Loop engineering capability map

名字在不同工具里会稍有差异,但能力本身已经是同一类东西。接下来值得逐项讲清楚,因为 loop 最终能不能站得住,往往就决定在这些细节上。

Automations:让循环真正循环起来

Automations 是 loop 成为“真正循环”的关键。没有它,很多所谓 loop 其实只是你某天手动跑过一次的脚本。在 Codex App 里,你可以在 Automations 面板里新建一条自动任务:选项目、写 prompt、设频率,再决定它跑在本地 checkout 还是后台 worktree 上。发现问题的运行结果会进 Triage inbox;什么也没发现的运行会自动归档,这点很不错。OpenAI 内部已经拿它做很多“很无聊但必须有人做”的事:每日 issue 分诊、CI 失败摘要、commit 简报、查上周新引入的 bug。更重要的是,一条 automation 还能调用 skill —— 于是重复性任务终于可以被维护。你不再往定时任务里粘一堵永远没人更新的大 prompt,而是直接触发一个有名字、可复用的 skill。

Claude Code 走向同一能力的路径是 scheduling 和 hooks。你可以用 /loop 按节奏重复跑 prompt 或命令,可以创建 cron 任务,也可以在 agent 生命周期中的特定节点触发 shell hook;如果你想合上电脑后它还继续干活,还能把整件事扔给 GitHub Actions。底层逻辑完全一样:定义一个 autonomous task,给它一个 cadence,让发现结果来找你,而不是你自己来回巡检。

还有一个更贴近本文核心的会话内原语值得单拎出来:/loop 是按时间节奏重跑,/goal 则是“只要目标条件还没满足,就继续干”。每一轮结束后,还会由一个独立的小模型判断任务是否已经完成,所以写代码的 agent 不是给自己打分的那个 agent。你可以把条件写成“test/auth 全部通过且 lint 干净”,然后离开。Codex 里也有同名的 /goal:它会跨多轮持续工作,直到一个可验证的停止条件成立,并支持 pause / resume / clear。相同的原语,同时出现在两边,这几乎就是整篇文章想说明的模式。

这一层负责的是“把工作浮到水面上”。loop 的剩余部分,才负责对这些工作采取行动。

Worktrees:并行不等于混乱

一旦你让两个以上的 agent 同时工作,最先暴露出来的失败模式往往不是模型能力,而是文件冲突。两个 agent 同时改同一份文件,本质上和两个工程师同时往同一行代码上提交、却没有先沟通,是同一种头痛。git worktree 解决的正是这件事:它给你一个独立工作目录、独立分支,但共享同一个 repo 历史,因此一个 agent 的修改在物理层面上就碰不到另一个 agent 的 checkout。

Codex 把 worktree 直接内建进并行工作流里,所以多个线程可以同时打同一个 repo 而不互相撞车。Claude Code 也提供同等隔离:原生的 git worktree、可以在独立 checkout 中开启 session 的 --worktree 选项,以及可以贴到 subagent 上的 isolation: worktree 设置,让每个 helper 自动拿到一个干净 worktree,并在结束后自己清理。我之前在 the orchestration tax 里写过人这一面的代价:worktree 只解决了机械层面的碰撞,但真正的上限依然是你。能同时跑多少 agent,最终取决于你的 review 带宽,而不是工具宣传页写了多少并发。

Skills:别再每次都重新解释项目

Skill 的意义,是让你别再像金鱼一样每次开新 session 都重新解释一遍项目。两边现在基本都采用同一种格式:一个目录,里面有一份 SKILL.md,外加可选的 scripts、references、assets。Codex 里,你可以用 $/skills 主动调用 skill,或者在任务描述命中 skill 时让它自动触发,这也是为什么一个紧凑、无聊、边界清楚的描述,通常比聪明但模糊的描述更好用。Claude Code 的模式几乎一样,我在 agent skills 里专门写过这一点。

Skill 还是“意图债”不再反复付费的地方。我在 the intent debt 里说过:agent 每次开场都是冷启动,任何你没写清楚的意图空洞,它都会用一个自信的猜测去补。Skill 就是把这些意图外置:团队约定、构建步骤、以及“我们之所以不这么做,是因为曾经出过那次事故” —— 这些内容只要写一次,agent 之后每轮都会读。没有 skill,loop 每次都得从零重新推断你的项目;有了 skill,它才开始具备真正的复利。

还有一点要分清:skill 是内容与指令的作者格式,plugin 则是它的分发方式。当你想把 skill 跨 repo 共享,或者把多个 skill 打包一起发出去时,你会把它们做成 plugin。这在 Codex 成立,在 Claude Code 里也成立。

Plugins & Connectors:让 loop 接上真实工具链

一个只能看本地文件系统的 loop,本质上只是个很小的 loop。Connectors —— 今天大多建立在 MCP 之上 —— 让 agent 可以读 issue tracker、查数据库、请求 staging API、往 Slack 丢消息。Codex 和 Claude Code 都会说 MCP,所以你为一边写的 connector,通常也能直接在另一边用。再进一步,plugin 还可以把 connector 和 skill 打包在一起,让你的同事“一次安装,整套上身”,而不是再凭记忆手工重建整套环境。

这就是“agent 告诉你它会怎么修”和“loop 自己开 PR、关联 Linear ticket、CI 绿了再发消息”的差别。Connector 的存在,决定了 loop 能否进入你的真实工作环境里动作,而不只是像一个演讲者那样告诉你:如果它有权限,它本来会怎么做。

Sub-agents:把执行者和检查者拆开

在 loop 里,最有价值的结构设计,几乎就是把“写的人”和“查的人”分开。写代码的那个模型,天生就太容易对自己的作业宽容;换一个拥有不同指令、甚至不同模型的第二个 agent,往往更容易抓出第一个 agent 自己说服自己的那些问题。

Codex 只有在你明确要求时才会拉起 subagent,它们可以并行跑,最后再把结果折叠回一个答案。你甚至可以在 .codex/agents/ 里用 TOML 自定义 agent:名字、描述、指令,外加可选的 model 与 reasoning effort。于是你的安全审查员可以用高推理成本的大模型,而你的 explorer 只是个快速、只读的侦察员。Claude Code 也是同理:在 .claude/agents/ 里定义 subagent,再通过 agent team 在它们之间传递任务。最常见的分工,仍然是一个 agent 做探索,一个做实现,一个按 spec 做验证。

我之前已经从两个角度论证过这件事:一次是在 the code agent orchestra,一次是在 adversarial code review。它在 loop 里尤其重要,因为 loop 往往是在你不盯着屏幕时运行的;如果没有一个你真敢信的 verifier,你根本不可能放心离开。当然,subagent 也会更烧 token,因为每个 agent 都要独立走一遍模型推理和工具调用。所以第二意见要花在真正值得的环节上。Claude Code 的 /goal 其实也在底层做了同一件事:由一个新的模型来决定 loop 是否该停,而不是让做事的模型自己宣布完工 —— 连 stop condition 本身,都贯彻了 maker / checker 分离。

一个 loop 长什么样

把前面的积木拼起来后,一条普通线程就会变成一个小型控制面板。下面是我反复看到的一种典型形状:

每天早上,一条 automation 在 repo 上自动跑起来。它的 prompt 会调用一个 triage skill:去读昨天的 CI 失败、当前未关闭 issue、最近提交记录,再把发现写进一个 markdown 文件或者一块 Linear 看板。对于每个真正值得处理的发现,这条线程会开一个隔离 worktree,派一个 subagent 去起草修复,再派第二个 subagent 按项目 skills 和现有测试去审这份草稿。

Connector 让 loop 自己去开 PR、更新 ticket。任何它自己处理不了的情况,才会进入 triage inbox 交给我。那份状态文件是整套系统的脊柱:它记得已经试过什么、什么已经通过、什么还未关闭,于是第二天早上再次运行时,这套系统能从昨天停下来的地方继续,而不是从零开始。

注意你在这里真正做的事:你只设计了一次系统本身,而不是手动 prompt 它执行每一步。这正是 Steinberger 那句话落到地面的样子。更重要的是,这套 loop 在 Codex 和 Claude Code 里都成立,因为底层积木已经越来越像。

loop 仍然不能替你做什么

Loop 改变的是工作方式,但它不会把你从工作里删除。 而且,loop 越顺滑,有三个问题往往不是变轻,而是变得更尖锐。

第一,验证责任仍然在你。 一个无人值守的 loop,同样也是一个会在无人值守状态下持续犯错的 loop。你之所以要把 verifier 和 maker 分开,就是为了让“它说自己完成了”这件事稍微更可信一点;即便如此,“完成”也依然只是一个声明,不是证明。我一直重复那句在 AI 时代 code review 里同样适用的话:你的工作不是让 AI 产出代码,而是只发布那些你确认过真的能工作的代码。

第二,如果你放任不管,你对系统的理解还是会继续腐烂。 loop 越快地替你交付那些你并没亲手写出的代码,“系统实际是什么”和“你以为系统是什么”之间的差距就越大。我把这叫作 comprehension debt。一个平滑的 loop 并不会自动消灭它,反而可能让它长得更快,除非你真的去读 loop 产出的东西。

第三,最舒服的姿势,往往也是最危险的姿势。 当 loop 已经会自己跑起来时,人最容易犯的错误,就是停止形成判断,只接收它吐回来的结论。我把这叫作 cognitive surrender。如果你带着判断去设计 loop,它会放大你的工程能力;如果你设计 loop 只是为了不再思考,它也会更快地把你推进坑里。动作一样,结果完全相反。

Build the loop. Stay the engineer.

我认为,这大概率是我们工作方式将会演化到的样子。但也必须承认:如果我不亲自 review 代码,或者完全依赖自动 loop 去修东西,我的产品质量一定会往下掉,甚至会掉进一个越修越坏、越跑越深的下行螺旋。

所以,当然可以去搭你的 loops,但也别忘了:直接 prompt 你的 agents 依然是有效的。关键不是“全面替换”,而是找到合适的平衡点。

Loop 也会因为“你是谁”而产生完全不同的结果。两个人可以搭出几乎一模一样的 loop,但最后走向两个相反的方向:一个人拿它去加速自己深刻理解的工作;另一个人拿它去逃避对工作的理解。Loop 自己看不出区别,能看出区别的是你。

这也是为什么 loop design 比 prompt engineering 更难,而不是更简单。Cherny 想表达的,不是工作变容易了,而是“杠杆点变了”。

去 build the loop。但请用一种“我仍然打算当工程师”的方式去 build,而不是只当那个按下开始按钮的人。

PART 02

第二部分:英文原文

Below is the original English text by Addy Osmani, preserved for side-by-side reading.

Loop engineering is replacing yourself as the person who prompts the agent. You design the system that does it instead. A loop here can be thought of a recursive goal where you define a purpose and the AI iterates until complete. It's roughly five building blocks and Claude Code and Codex both have all five now.

I believe this may be the future of how we work with coding agents. However, its still early, I'm skeptical and you absolutely have to be careful about token costs (usage patterns can vary wildly if you are token rich or poor). You also still need some way to ensure quality doesn't drop and concerns re: slop are valid. That said, let's explore what this is all about.

@steipete recently said: “You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents.” Similarly, @bcherny, head of Claude Code at Anthropic, said “I don't prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops”.

Okay, so what does any of that mean?

For like two years the way you got something out of a coding agent was you wrote a good prompt and shared enough context. You type a thing, you read what came back, you type the next thing. The agent is a tool and you are holding it the entire time, one turn after the other. That part is kind of over, or at least some think it's going to be.

Now you build a small system that finds the work, hands it out, checks it, writes down what is done and then decides the next thing, and you let that system poke the agents instead of you. I wrote before about the cousin of this, agent harness engineering, which is making the environment one single agent runs inside and the factory model - the system that builds the software. Loop engineering sits one floor above the harness. The harness but it runs on a timer, it spawns little helpers, and it feeds itself.

The thing that surprised me is this is not really a tool thing anymore. A year ago if you wanted a loop you wrote a pile of bash and you maintained that pile forever and it was yours and only yours. Now the pieces just ship inside the products. Steinberger's list maps almost exactly onto the Codex app, and then almost the same onto Claude Code. And once you notice the shape is the same you stop arguing about which tool, you just design a loop that still works no matter which one you happen to be sitting in.

The five pieces, and then notes

A loop needs five things and then one place to remember stuff. Let me list it first and then map it.

  1. Automations that go off on a schedule and do discovery and triage by themselves.
  2. Worktrees so two agents working in parallel dont step on each other.
  3. Skills to write down the project knowledge the agent would otherwise just guess.
  4. Plugins and connectors to plug the agent into the tools you already use.
  5. Sub-agents so one of them has the idea and a different one checks it.

Then the sixth thing, the memory. A markdown file, or a Linear board, anything that lives outside the single conversation and holds what's done and what is next. Sounds too dumb to matter. But it's the same trick every long running agent depends on and I went into it in long-running agents, the model forgets everything between runs so the memory has to be on disk and not in the context. The agent forgets, the repo doesnt.

Both products have all five now.

Loop engineering capability map

The names are a bit different here and there but the capability is the same thing. Let me go one by one because honestly the details are where a loop either holds together or quietly leaks everywhere.

Automations, this is the heartbeat

Automations are what make a loop an actual loop and not just one run you did once. In the Codex app you make one in the Automations tab and you pick the project, the prompt it will run, how often, and if it runs on your local checkout or on a background worktree. The runs that find something go to a Triage inbox, and the runs that find nothing just archive themselves which is nice. OpenAI uses them internally for boring stuff like daily issue triage, summarizing CI failures, writing commit briefings, hunting bugs somebody added last week. And an automation can call a skill, so you keep the recurring thing maintainable, you fire $skill-name instead of pasting a giant wall of instructions into a schedule that nobody will ever update.

Claude Code gets to the same place but through scheduling and hooks. You can run a prompt or a command on a interval with /loop, you can schedule a cron task, you can fire shell commands at certain points in the agent lifecycle with hooks, or you push the whole thing to GitHub Actions if you want it to keep running after you close the laptop. Same idea exactly, you define an autonomous task, you give it a cadence, and the findings come to you so you are not the one going around checking.

There is a second in-session primitive worth knowing, and it's the one closer to what this whole post is about. /loop re-runs on a cadence. /goal keeps going until a condition you wrote is actually true, and after every turn a separate small model checks whether you are done, so the agent that wrote the code isnt the one grading it. You give it something like "all tests in test/auth pass and lint is clean" and walk away. Codex has the same thing, also called /goal, it keeps working across turns until a verifiable stopping condition holds, with pause and resume and clear. Same primitive, both tools, wich is kind of the pattern for this whole article.

So this is the part that surfaces the work. The rest of the loop is what acts on it.

Worktrees so parallel doesn't turn into chaos

The second you run more than one agent the files start colliding, that becomes the failure. Two agents writing the same file is the exact same headache as two engineers committing to the same lines and nobody talked to each other first. A git worktree fixes it, its a separate working directory on its own branch sharing the same repo history, so one agent's edits literally can not touch the other one's checkout.

Codex builds the worktree support right in so several threads hit the same repo at once and dont bump into each other. Claude Code gives you the same isolation with git worktree, a --worktree flag to open a session in its own checkout, and a isolation: worktree setting you stick on a subagent so each helper gets a fresh checkout that cleans itself up after. I wrote about the human side of all this in the orchestration tax, the worktrees take away the mechanical collision but YOU are still the ceiling, your review bandwith decides how many you can actually run, not the tool.

Skills, so you stop explaining your project every single time

A skill is how you stop re-explaining the same project context every session like a goldfish. Both tools use the same format, a folder with a SKILL.md inside holding instructions and metadata, and then optional scripts, references, assets. Codex runs a skill when you call it with $ or /skills, or by itself when your task matches the skill description, which is the reason a tight boring description beats a clever one. Claude Code does it the same way and I wrote the pattern up in agent skills.

Skills are also where intent stops costing you over and over. I argued in the intent debt that an agent starts every session cold and it will fill any hole in your intent with a confident guess. A skill is that intent written down on the outside, the conventions, the build steps, the “we dont do it like this because of that one incident”, written one time where the agent reads it every run. Without skills the loop re-derives your whole project from zero every cycle, with skills it kind of compounds.

One thing to keep straight, the skill is the authoring format and a plugin is how you ship it. When you want to share a skill across repos or bundle a few together you package them as a plugin. True in Codex, true in Claude Code.

Plugins and connectors, the loop touches your real tools

A loop that can only see the filesystem is a tiny loop. Connectors, which are built on MCP, let the agent read your issue tracker, query a database, hit a staging api, drop a message in Slack. Codex and Claude Code both speak MCP so the connector you wrote for one usually just works in the other. And plugins bundle connectors and skills together so your teammate installs your setup in one go instead of rebuilding the whole thing from memory.

This is the difference between an agent that says “here is the fix” and a loop that opens the PR, links the Linear ticket and pings the channel once CI is green by itself. The connectors are the reason the loop can act inside your actual environment instead of just telling you what it would do if it could.

Sub-agents, keep the maker away from the checker

The most useful structural thing in a loop, by far, is splitting the one who writes from the one who checks. The model that wrote the code is way too nice grading its own homework. A second agent with different instructions and sometimes a different model catches the stuff the first one talked itself into.

Codex only spawns subagents when you ask, runs them at the same time and then folds the results back into one answer. You define your own agents as TOML files in .codex/agents/, each with a name, a description, instructions and optional model and reasoning effort, so your security reviewer can be a strong model on high effort while your explorer is some fast read-only thing. Claude Code does the same with subagents in .claude/agents/ and agent teams that pass work between them. The usual split in both is one agent explores, one implements, one verifies against the spec.

I made this case twice already, once as the code agent orchestra and once as adversarial code review. The reason it matters specifically inside a loop is the loop runs while you are not watching, so a verifier you actually trust is the only reason you can walk away. Subagents do burn more tokens since each one does its own model and tool work, so spend them where a second opinion is worth paying for. This is also basically what Claude Code's /goal does under the hood, a fresh model decides if the loop is done instead of the one that did the work, the maker and checker split applied to the stop condition itself.

What one loop looks like

Stick it together and a single thread turns into a little control panel. Here is one shape I keep using.

An automation runs every morning on the repo. Its prompt calls a triage skill that reads yesterdays CI failures, the open issues, the recent commits, and writes the findings into a markdown file or a Linear board. For each finding that is worth doing the thread opens an isolated worktree and sends a sub-agent to draft the fix, and a second sub-agent reviews that draft against the project skills and the existing tests.

Connectors let the loop open the PR and update the ticket. Anything the loop can not handle lands in the triage inbox for me. The state file is the spine of the whole thing, it remembers what got tried, what passed, what is still open, so tomorrow morning the run picks up where today stopped.

And look at what you actually did there. You designed it one time. You did not prompt any of those steps. Thats Steinberger's whole point made real, and its the same loop in Codex or in Claude Code because the pieces are the same pieces.

What the loop still does not do for you

The loop changes the work, it does not delete you from it. And three problems actually get sharper as the loop gets better, not easier.

Verification is still on you. A loop running unattended is also a loop making mistakes unattended. The whole reason you split the verifier sub-agent from the maker is to make the loop's “its done” mean something, and even then “done” is a claim and not a proof. I keep saying the same line from code review in the age of AI, your job is to ship code you confirmed works.

Your understanding still rots if you allow it. The faster the loop ships code you did not write, the bigger the gap between what exists and what you actually get. Thats comprehension debt and a smooth loop just makes it grow faster unless you read what the loop made.

And yeah, the comfortable posture is probably the risky one. When the loop runs itself its very tempting to stop having an opinion and just take whatever it gives back. I called that cognitive surrender. Designing the loop is the cure when you do it with judgement and the accelerant when you do it to avoid thinking, same action, opposite result.

Build the loop. Stay the engineer.

I think this is a preview of how our work is going to evolve. That said, If I weren't reviewing the code myself or if I relied entirely on automated loops to fix it my product’s quality would suffer. I'd likely end up stuck in a downward spiral, continuously digging myself into a deeper hole.

That said, go ahead and set up your loops, but don't forget that prompting your agents directly is still effective. It's all about finding the right balance.

Loops can also result in different outcomes depending on you. Two people can build the exact same loop and get completely opposite results. One uses it to move faster on work they understand deeply. The other uses it to avoid understanding the work at all. The loop doesn't know the difference. You do.

That's what makes loop design harder than prompt engineering, not easier. Cherny's point isn’t that the work got easier. It's that the leverage point moved.

Build the loop. But build it like someone who intends to stay the engineer, not just the person who presses go.

Appendix

参考与延伸

本页结构

第一部分为中文精翻译文,第二部分保留原文,便于核对概念与语气。

正文内共保留 2 张配图(中英文各一处)。

站内归类

分类:科普

专题:AI 编程智能体

标签:Loop / Claude Code / Codex / Addy Osmani