CAUTION · EXPERIMENT RUNNING · CAUTION · EXPERIMENT RUNNING ·
Opus 4.7 claude-code

Multi-Agent Orchestration with Convergence Loops

captured session · 2 asst turns · 2 tool calls

Created
Updated
2
Turns
2
Tool calls
1
Files touched

Files

Commit

ff1e9a5

Conversation

2 turns. Full text where captured; older traces show only the first ~280 chars.

  1. assistant #1 1 tool
    • Edit
      (input not captured in this trace)
  2. assistant #2 1 tool
    • Edit
      (input not captured in this trace)

Diff

Per-file changes from ff1e9a5.

src/content/posts/deepwork-orchestrator.mdx
diff --git a/src/content/posts/deepwork-orchestrator.mdx b/src/content/posts/deepwork-orchestrator.mdxnew file mode 100644index 0000000..c5fca53--- /dev/null+++ b/src/content/posts/deepwork-orchestrator.mdx@@ -0,0 +1,172 @@+---+title: 'Multi-Agent Orchestration with Convergence Loops'+description: 'Draft, review, revise, repeat. The hard part is not the loop. It is keeping agent sessions coherent across iterations.'+date: 2026-03-14+tags: ['agents', 'architecture', 'systems']+---++import Chart from '../../components/Chart.astro'++You have an agent that produces output. It's not good enough. You add a reviewer, feed the review back, revise. Repeat.++Deepwork is the system I built to handle this pattern generically. You give it a project spec (objective, agents, deliverables, quality criteria) and it runs them through draft/review/revision phases until quality converges or the budget runs out. The orchestration engine doesn't know whether it's producing a consulting memo, a research paper, or a code review. It just knows how to run agents in loops until reviewers are satisfied.++## How the loop actually works++Every project goes through four phases:++1. **Research** (optional): agents gather information, produce a synthesis document that all later agents can reference.+2. **Draft**: each deliverable's owner writes the first version. Multiple owners work in parallel.+3. **Review**: reviewer agents score the draft against a quality framework. If it doesn't meet the bar, the owner revises. This repeats.+4. **Finalize**: read the final content, snapshot everything, done.++The review phase is where the action is. It's a loop with three exit conditions:++**Convergence**: aggregate score meets the minimum (say, 75/100), every quality dimension is above its floor (say, 60/100), and all reviewers approve. All three must hold.++**Plateau**: the scores stopped improving. If the last three rounds are 66, 67, 68, further revision won't help. The system detects this and stops.++**Max revisions**: hard cap, usually 3 or 4 rounds.++```typescript+function checkPlateau(scoreHistory: number[], threshold: number): boolean {+  if (scoreHistory.length < threshold) return false+  const recent = scoreHistory.slice(-threshold)+  return Math.max(...recent) - Math.min(...recent) < 3+}+```++Without plateau detection, the system burns through revision after revision making cosmetic changes that bounce the score between 65 and 67.++<Chart+  id="convergence-chart"+  code={`+const W = 600, H = 260+const canvas = document.createElement('canvas')+canvas.width = W; canvas.height = H+const ctx = canvas.getContext('2d')++const style = getComputedStyle(document.documentElement)+const fg = style.getPropertyValue('--fg').trim() || '#1c1c1c'+const faint = style.getPropertyValue('--fg-faint').trim() || '#999'+const bg = style.getPropertyValue('--bg').trim() || '#faf9f7'++ctx.fillStyle = bg; ctx.fillRect(0, 0, W, H)++const ox = 60, oy = 220, gw = 500, gh = 180+ctx.strokeStyle = fg; ctx.lineWidth = 1+ctx.beginPath(); ctx.moveTo(ox, oy - gh); ctx.lineTo(ox, oy); ctx.lineTo(ox + gw, oy); ctx.stroke()++ctx.fillStyle = faint; ctx.font = '10px JetBrains Mono, monospace'; ctx.textAlign = 'right'+for (let s = 0; s <= 100; s += 25) {+  const y = oy - (s / 100) * gh+  ctx.fillText(s.toString(), ox - 8, y + 4)+  if (s > 0 && s < 100) {+    ctx.strokeStyle = faint; ctx.lineWidth = 0.3+    ctx.beginPath(); ctx.moveTo(ox, y); ctx.lineTo(ox + gw, y); ctx.stroke()+  }+}+ctx.strokeStyle = fg; ctx.lineWidth = 1++const threshY = oy - (75 / 100) * gh+ctx.strokeStyle = faint; ctx.lineWidth = 1; ctx.setLineDash([6, 4])+ctx.beginPath(); ctx.moveTo(ox, threshY); ctx.lineTo(ox + gw, threshY); ctx.stroke()+ctx.setLineDash([])+ctx.fillStyle = faint; ctx.font = '10px JetBrains Mono, monospace'; ctx.textAlign = 'left'+ctx.fillText('convergence threshold', ox + gw - 150, threshY - 6)++const scores1 = [42, 61, 73, 78]+const scores2 = [38, 55, 64, 67, 68, 69]+const scores3 = [51, 70, 80]++function plotLine(scores, shade, label, labelY) {+  const step = gw / 6+  ctx.strokeStyle = shade; ctx.lineWidth = 2+  ctx.beginPath()+  scores.forEach((s, i) => {+    const x = ox + (i + 1) * step+    const y = oy - (s / 100) * gh+    if (i === 0) ctx.moveTo(x, y); else ctx.lineTo(x, y)+  })+  ctx.stroke()+  scores.forEach((s, i) => {+    const x = ox + (i + 1) * step+    const y = oy - (s / 100) * gh+    ctx.fillStyle = shade+    ctx.beginPath(); ctx.arc(x, y, 3, 0, Math.PI * 2); ctx.fill()+  })+  const lastX = ox + scores.length * step+  const lastY = oy - (scores[scores.length - 1] / 100) * gh+  ctx.fillStyle = shade; ctx.font = '10px JetBrains Mono, monospace'; ctx.textAlign = 'left'+  ctx.fillText(label, lastX + 8, lastY + labelY)+}++plotLine(scores1, fg, 'memo (converged r4)', 4)+plotLine(scores2, faint, 'analysis (plateau r6)', 4)+plotLine(scores3, fg + '88', 'brief (converged r3)', 4)++ctx.fillStyle = faint; ctx.font = '10px JetBrains Mono, monospace'; ctx.textAlign = 'center'+ctx.fillText('review round', ox + gw / 2, oy + 22)++container.appendChild(canvas)+  `}+/>++Three deliverables from a real run. The memo converges at 78 in round 4. The brief hits 80 by round 3. The analysis plateaus around 68: the reviewer keeps asking for stronger evidence, the writer keeps making marginal improvements, and the score barely moves. The system catches this and stops burning tokens.++## Session continuity is the real problem++Session continuity was the hardest part to get right.++When a writer agent drafts a memo, it builds up context over several turns. It understands the objective, it's made structural decisions, it has reasoning about what evidence to include. When the reviewer comes back and says "section 3 needs stronger supporting data," the writer needs all that prior context to make a targeted fix rather than rewriting from scratch.++The solution is dead simple in concept: every agent gets a session ID that persists across all its tasks. Draft turn 1, draft turn 2, revision turn 1, revision turn 2: same session. The sandbox preserves the full conversation history.++```typescript+function createSession(agentId: string, sandboxId: string): Session {+  return {+    id: `dw-${agentId}-${randomUUID().slice(0, 8)}`,+    agentId,+    sandboxId,+    turns: 0,+    totalUsage: { inputTokens: 0, outputTokens: 0 },+    lastActiveAt: new Date(),+    history: [],+  }+}+```++The implementation is where it gets tricky. We have checkpoint/resume (because a 6-agent project might run 20+ minutes). When you restore from checkpoint, the session ID must be preserved exactly. Generate a new one and the agent starts fresh, losing everything.++We had a bug where checkpoint restore called `createSession()` instead of `restoreSessionFromCheckpoint()` in one code path. Agents would work beautifully for the first three rounds, checkpoint would trigger, and suddenly the writer would produce a completely unrelated draft because it had lost its entire conversation history. Took three days to find because the symptom (bad output) looked like a model quality problem, not a plumbing problem.++## Quality frameworks are pluggable++The review prompts aren't freeform. A quality framework generates them. The McKinsey consulting framework scores on Pyramid Principle (20%), Evidence Quality (25%), So-What Factor (25%), Actionability (15%), and Communication Clarity (15%). BCG swaps in Hypothesis-Driven and Pragmatism. An academic framework has completely different dimensions.++The framework produces three things: a review prompt (what to evaluate), a revision prompt (what to fix, including the weakest dimensions and specific suggestions), and a parser (extracting structured scores from natural language). Same orchestration engine, different quality criteria. I haven't had to touch the loop logic when adding a new framework.++## Agents coordinate through files++There's no message-passing between agents. The writer writes `drafts/memo.md`. The reviewer reads it, writes a structured review to `.reviews/review-{id}-r{round}.json`. The orchestrator reads the review, checks convergence, and if revision is needed, gives the writer the review feedback as a prompt. The writer revises the file.++File-based coordination is debuggable: you can look at the workspace at any point and see what every agent produced. No message queues, no pub/sub.++## The API++Four methods:++```typescript+class Deepwork {+  async run(project: ProjectConfig): Promise<ProjectResult>+  async start(project: ProjectConfig): Promise<ProjectHandle>+  async resume(checkpoint: CheckpointData): Promise<ProjectHandle>+  on(handler: DeepworkEventHandler): () => void+}+```++`run()` blocks. `start()` returns a handle for async control. `resume()` restores from checkpoint (with session IDs preserved, obviously). `on()` gives you events for observability.++Everything else is configuration. Agent specs, deliverable specs, framework selection, sandbox strategy (shared, per-agent, or grouped). The orchestrator handles sequencing, parallel dispatch, convergence detection, and cleanup. You describe what you want. It figures out how to get there.++Any task where you can express "better" as a function is a candidate for this loop.