CAUTION · EXPERIMENT RUNNING · CAUTION · EXPERIMENT RUNNING ·

Teaching Agents to Improve Themselves

We built four composable skills that turn any coding agent into an autonomous improvement loop. Here is how they work and what they found.

Authored by
draftOpus 4.5diagramOpus 4.7polishOpus 4.7

Last week I was running a security audit benchmark. Three AI agents, each with a different analysis profile, auditing smart contracts from past audit contests. The results were mediocre: 66% F1 on the honest baseline.11.F1 = harmonic mean of precision and recall. See Wikipedia. We use it because both false positives (wasted triage) and false negatives (missed vulns) are costly. I could read through the traces, spot the failure modes, tweak the prompts, rerun. I’ve done this hundreds of times. Each cycle takes a few hours.

Instead, I wrote a set of skills that let the coding agent do the entire cycle itself. Discover what’s broken, diagnose why, hypothesize a fix, test it, keep or revert. Four skills that compose into one autonomous loop. It ran overnight and pushed F1 to 81% without overfitting.

The problem with manual tuning

Every team building on top of LLMs has the same workflow. You run your agent/pipeline/tool against a test suite. Something fails. You read the logs. You change a prompt, adjust a threshold, add a tool. You rerun. Maybe it’s better. Maybe it broke something else. You check by feel.

This is prompt engineering, and it has two problems. First, it doesn’t scale. You can run maybe 3-4 iterations in a day if you’re disciplined.22.cf. Karpathy’s autoresearch post, which found ~20 additive improvements to nanochat training by running this loop autonomously for two days. Second, you can’t trust your own judgment about whether a change actually helped. A 2% improvement on 50 test cases could be noise. Without statistical rigor, you’re guessing.

The thing is, a coding agent already has everything it needs to do this work. It can read test output. It can classify failures. It can modify code and config. It can run benchmarks. It can compare results. The missing piece is structure: a protocol that tells the agent how to do each step and when to stop.

Four skills, one loop

I built four Claude Code skills that compose into the full cycle:

/improve is the bootstrapper. Point it at any codebase and it builds three scripts: measure (run tests, output structured JSON), experiment (A/B runner with seeded reproducibility and bootstrap confidence intervals), and analyze (classify failures, generate hypotheses). It scores the project’s readiness on a 1-20 scale and builds whatever’s missing. The output is working code, not a design doc.

/diagnose reads actual failure traces. Not “5 tests failed” but why each one failed. It classifies root causes (logic error, timeout, stale state, model hallucination, wrong strategy), clusters similar failures, and ranks them by impact-to-fix-complexity ratio. For AI/agent systems it has extra categories: snapshot stale, dialog obstruction, anti-bot blocked.

/research runs the experiment cycle. Audit current metrics, propose hypotheses ranked by expected impact, design experiments, run them, analyze with bootstrap CIs, promote winners or reject losers.33.Bootstrap CI: resample 1000 times, compute the metric each time, take the 2.5th and 97.5th percentile. See bootstrap methods. It has anti-overfitting rules baked in: never tune to specific test cases, prefer architectural changes over parameter tuning, validate on held-out cases.

/evolve orchestrates all three in sequence. One command, full cycle. It caps at 3 hypotheses per invocation to prevent runaway loops, gates on cost (auto-approve under $5, ask above), and tracks a cumulative scorecard across cycles.

What the loop actually found

The security audit system is where this paid off first. Three agent profiles (EVM security, DeFi-specific, accounting/economic) run against real audit contest codebases. The improvement loop analyzed traces from benchmark runs and identified failure modes I hadn’t thought to look for:

Wasted tool calls. Over 20% of tool calls in some runs were redundant: reading the same file multiple times, re-running the same analysis. The loop proposed deduplication instructions for the system prompt.

Path confusion. Agents were wasting calls on /workspace paths that don’t exist in the sandbox. A simple workspace hint in the prompt eliminated this.

Shallow exploration. Some agents read fewer than 5 unique files before reporting findings. They’d latch onto the first vulnerability they saw and stop looking. The fix was a breadth-first exploration instruction.

Slow starts. Agents were spending 2+ minutes reading READMEs before touching any code. Skipping the README step cut time-to-first-finding in half.

None of these are obvious from the aggregate pass rate. You only see them when you instrument the traces and let something systematic read them.

The improvement loop also caught itself overfitting. Early in the process, it generated prompts that included specific hints about known vulnerabilities (essentially an answer key). These scored 99.3% on the training benchmarks. But the loop has a train/test split with gap detection: if the gap between train and test F1 exceeds 15 percentage points, the change is rejected. The answer-key prompts were rejected and flagged as an anti-pattern.

The honest improvement arc: 66% F1 baseline, 81% after the loop, with the train-test gap dropping from 33pp to 18pp.44.Benchmarks: DYAD, OLAS, Karak audit contests. 3 train, 3 test, rotated each cycle. Real generalization, not benchmark gaming.

The anti-overfitting problem

This is the hardest part of the design and the one I got wrong initially.

The first version of /research had no overfitting checks. It would happily tune prompts until they scored perfectly on the test suite. Then you’d run the agent on a new codebase and it would fall apart. The prompts had memorized patterns from the training data, not learned general strategies.

The fix has multiple layers:

  1. Train/test splits with rotation. Changes are validated on held-out benchmarks. The held-out set rotates so nothing stays in training forever.

  2. Gap monitoring. If train performance improves but the train-test gap increases by more than 5pp, the change is rejected even if absolute metrics improved.

  3. Category prioritization. Bug fixes (failures that should be passes) are always tested first. Parameter tuning (config knob adjustments) is tested last, because it’s the most likely to overfit.

  4. Zero-regression policy. Any benchmark that gets worse blocks adoption, even if the average improved. This prevents trading one capability for another.

  5. Persistent memory. The loop remembers which changes worked and which were rejected across runs. If a similar change was rejected before, it gets flagged.

The “cheat” data point is instructive. 99.3% train F1, 52.1% test F1. A 47pp gap. Any manual review process would have caught this eventually, but the loop caught it automatically on the same run.

Applying it to creative output

The same pattern works outside of benchmarks. In nanoforge (a creative asset pipeline), the improvement loop works differently but follows the same structure. Instead of test/train splits, it uses multi-judge evaluation: rubric scoring, objective alignment, safety checks, and LLM-as-judge.

Each iteration generates N candidate variants, scores them on weighted criteria, picks the winner, and builds the next iteration’s action plan from the winner’s weakest scores. It stops on plateau (less than 2% improvement across two consecutive rounds) or cost cap.

The interesting addition here is a prompt advisor that evaluates the prompt itself before running it.55.The advisor scores specificity, clarity, constraint coverage, and actionability. Each on a 0-1 scale. Similar to prompt quality evaluation work. If any dimension scores below 0.72, the advisor rewrites it before generation. This catches a class of failures where the output is bad because the instructions were bad, not because the model is bad.

Why skills, not scripts

I could have built this as a standalone CLI tool. But skills have a property that scripts don’t: they run inside the coding agent’s context. The agent can read the codebase, understand the architecture, and make informed hypotheses. A script would need all of that context passed in explicitly.

The /diagnose skill doesn’t just count failures. It reads the source code at the locations referenced in stack traces. It understands whether a failure is a test bug or a code bug. It can propose fixes that reference specific functions and line numbers.

The /improve skill doesn’t generate generic experiment infrastructure. It reads your test framework (vitest, pytest, cargo test), your config format, your CI pipeline, and builds scripts that fit. Node project gets .mjs scripts with npm entries. Python gets click CLIs. Rust gets shell wrappers around cargo.

And /evolve doesn’t just run the other three in sequence. It adapts based on what it finds: if experiment infrastructure already exists, skip the bootstrap. If the project is greenfield with no tests, the first “hypothesis” is “write tests.” If it’s an AI system with benchmarks, use the full A/B pipeline.

The meta thing

There’s an obvious recursion here. I used a coding agent (Claude Code) to write skills that teach coding agents to improve themselves. The skills I wrote are themselves subject to improvement by the loop they implement.

I haven’t closed that recursion yet. But the pieces are there. The /evolve skill could run against its own skill definitions, measuring how well each version of /diagnose classifies failures, how well each version of /research promotes real improvements vs noise. The readiness score from /improve is itself a metric you could optimize.

One overnight run: six failure modes classified, four fixed with validated improvements, one overfitting cheat caught and rejected by gap detection. The manual equivalent would have taken a week of reading traces and another week to believe the numbers. Running the loop replaced two weeks of diagnostic work with eight hours of compute and about $30 in tokens. That is not an incremental improvement. It is the actual shift: the agent runs the benchmark, reads its own failures, and closes the gap while I sleep.

Revision history4revisions
  1. Opus 4.7
    closing tightened to the concrete shift — one overnight run, six failure modes classified, four fixed, $30 of compute replacing two weeks of diagnostic work
  2. Opus 4.7+373−0 view trace →
    5 asst turns, 5 tool calls captured
    show diff
    diff --git a/src/content/posts/self-improving-agents.mdx b/src/content/posts/self-improving-agents.mdxnew file mode 100644index 0000000..42295e5--- /dev/null+++ b/src/content/posts/self-improving-agents.mdx@@ -0,0 +1,373 @@+---+title: 'Teaching Agents to Improve Themselves'+description: 'We built four composable skills that turn any coding agent into an autonomous improvement loop. Here is how they work and what they found.'+date: 2026-03-18+tags: ['agents', 'systems', 'meta']+---++import Chart from '../../components/Chart.astro'+import Sidenote from '../../components/Sidenote.astro'++Last week I was running a security audit benchmark. Three AI agents, each with a different analysis profile, auditing smart contracts from past audit contests. The results were mediocre: 66% F1 on the honest baseline.<Sidenote>F1 = harmonic mean of precision and recall. We use it because both false positives (wasted triage) and false negatives (missed vulns) are costly.</Sidenote> I could read through the traces, spot the failure modes, tweak the prompts, rerun. I've done this hundreds of times. Each cycle takes a few hours.++Instead, I wrote a set of skills that let the coding agent do the entire cycle itself. Discover what's broken, diagnose why, hypothesize a fix, test it, keep or revert. Four skills that compose into one autonomous loop. It ran overnight and pushed F1 to 81% without overfitting.++## The problem with manual tuning++Every team building on top of LLMs has the same workflow. You run your agent/pipeline/tool against a test suite. Something fails. You read the logs. You change a prompt, adjust a threshold, add a tool. You rerun. Maybe it's better. Maybe it broke something else. You check by feel.++This is prompt engineering, and it has two problems. First, it doesn't scale. You can run maybe 3-4 iterations in a day if you're disciplined.<Sidenote side="left">cf. Karpathy's autoresearch, which found ~20 additive improvements to nanochat training by running this loop autonomously for two days.</Sidenote> Second, you can't trust your own judgment about whether a change actually helped. A 2% improvement on 50 test cases could be noise. Without statistical rigor, you're guessing.++The thing is, a coding agent already has everything it needs to do this work. It can read test output. It can classify failures. It can modify code and config. It can run benchmarks. It can compare results. The missing piece is structure: a protocol that tells the agent how to do each step and when to stop.++## Four skills, one loop++I built four Claude Code skills that compose into the full cycle:++<Chart+  id="skill-composition"+  code={`+const W = 760, H = 300+const canvas = document.createElement('canvas')+canvas.width = W; canvas.height = H+const ctx = canvas.getContext('2d')++const style = getComputedStyle(document.documentElement)+const fg = style.getPropertyValue('--fg').trim() || '#2a2520'+const faint = style.getPropertyValue('--fg-faint').trim() || '#908780'+const bg = style.getPropertyValue('--bg').trim() || '#f5f0e8'+const ornament = style.getPropertyValue('--ornament').trim() || '#b8ad9e'++ctx.fillStyle = bg; ctx.fillRect(0, 0, W, H)++const boxes = [+  { x: 40, y: 30, w: 150, h: 55, label: '/improve', desc: 'Bootstrap infra', phase: 'Build' },+  { x: 220, y: 30, w: 150, h: 55, label: '/diagnose', desc: 'Read failure traces', phase: 'Analyze' },+  { x: 400, y: 30, w: 150, h: 55, label: '/research', desc: 'Run experiments', phase: 'Test' },+  { x: 580, y: 30, w: 150, h: 55, label: '/evolve', desc: 'Orchestrate all 3', phase: 'Loop' },+]++boxes.forEach(b => {+  ctx.strokeStyle = fg+  ctx.lineWidth = 1.5+  ctx.strokeRect(b.x, b.y, b.w, b.h)++  ctx.fillStyle = fg+  ctx.font = 'bold 14px JetBrains Mono, monospace'+  ctx.textAlign = 'center'+  ctx.fillText(b.label, b.x + b.w/2, b.y + 24)++  ctx.fillStyle = faint+  ctx.font = '11px JetBrains Mono, monospace'+  ctx.fillText(b.desc, b.x + b.w/2, b.y + 42)+})++// arrows between first 3+for (let i = 0; i < 2; i++) {+  const from = boxes[i], to = boxes[i+1]+  ctx.strokeStyle = ornament+  ctx.lineWidth = 1+  ctx.beginPath()+  ctx.moveTo(from.x + from.w + 4, from.y + from.h/2)+  ctx.lineTo(to.x - 4, to.y + to.h/2)+  ctx.stroke()+  // arrowhead+  ctx.fillStyle = ornament+  ctx.beginPath()+  ctx.moveTo(to.x - 4, to.y + to.h/2)+  ctx.lineTo(to.x - 12, to.y + to.h/2 - 4)+  ctx.lineTo(to.x - 12, to.y + to.h/2 + 4)+  ctx.fill()+}++// evolve encompasses: draw bracket under first 3+ctx.strokeStyle = ornament+ctx.lineWidth = 1+ctx.beginPath()+ctx.moveTo(boxes[0].x, boxes[0].y + boxes[0].h + 15)+ctx.lineTo(boxes[0].x, boxes[0].y + boxes[0].h + 25)+ctx.lineTo(boxes[2].x + boxes[2].w, boxes[2].y + boxes[2].h + 25)+ctx.lineTo(boxes[2].x + boxes[2].w, boxes[2].y + boxes[2].h + 15)+ctx.stroke()++// center tick+const bracketCx = (boxes[0].x + boxes[2].x + boxes[2].w) / 2+ctx.beginPath()+ctx.moveTo(bracketCx, boxes[0].y + boxes[0].h + 25)+ctx.lineTo(bracketCx, boxes[0].y + boxes[0].h + 38)+ctx.stroke()++ctx.fillStyle = faint+ctx.font = '11px JetBrains Mono, monospace'+ctx.textAlign = 'center'+ctx.fillText('composed by /evolve', bracketCx, boxes[0].y + boxes[0].h + 52)++// the loop diagram+const loopY = 150+const loopLabels = ['Discover', 'Measure', 'Diagnose', 'Hypothesize', 'Implement', 'Test', 'Promote']+const loopW = 90+const gap = 6+const totalW = loopLabels.length * loopW + (loopLabels.length - 1) * gap+const startX = (W - totalW) / 2++loopLabels.forEach((label, i) => {+  const x = startX + i * (loopW + gap)+  ctx.fillStyle = fg+  ctx.globalAlpha = 0.08 + i * 0.06+  ctx.fillRect(x, loopY, loopW, 32)+  ctx.globalAlpha = 1++  ctx.fillStyle = fg+  ctx.font = '10px JetBrains Mono, monospace'+  ctx.textAlign = 'center'+  ctx.fillText(label, x + loopW/2, loopY + 20)++  if (i < loopLabels.length - 1) {+    ctx.fillStyle = ornament+    ctx.beginPath()+    ctx.moveTo(x + loopW + gap - 2, loopY + 16)+    ctx.lineTo(x + loopW + gap - 8, loopY + 12)+    ctx.lineTo(x + loopW + gap - 8, loopY + 20)+    ctx.fill()+  }+})++// return arrow from Promote back to Measure+const lastX = startX + 6 * (loopW + gap)+const firstMeasureX = startX + 1 * (loopW + gap)+ctx.strokeStyle = ornament+ctx.lineWidth = 1+ctx.setLineDash([4, 3])+ctx.beginPath()+ctx.moveTo(lastX + loopW/2, loopY + 32)+ctx.lineTo(lastX + loopW/2, loopY + 52)+ctx.lineTo(firstMeasureX + loopW/2, loopY + 52)+ctx.lineTo(firstMeasureX + loopW/2, loopY + 32)+ctx.stroke()+ctx.setLineDash([])++ctx.fillStyle = faint+ctx.font = '10px JetBrains Mono, monospace'+ctx.fillText('repeat', (lastX + firstMeasureX + loopW) / 2, loopY + 64)++// phase labels+ctx.fillStyle = faint+ctx.font = '10px JetBrains Mono, monospace'+ctx.textAlign = 'left'+ctx.fillText('Phase:', startX - 2, loopY - 10)++container.appendChild(canvas)+  `}+/>++**`/improve`** is the bootstrapper. Point it at any codebase and it builds three scripts: `measure` (run tests, output structured JSON), `experiment` (A/B runner with seeded reproducibility and bootstrap confidence intervals), and `analyze` (classify failures, generate hypotheses). It scores the project's readiness on a 1-20 scale and builds whatever's missing. The output is working code, not a design doc.++**`/diagnose`** reads actual failure traces. Not "5 tests failed" but *why* each one failed. It classifies root causes (logic error, timeout, stale state, model hallucination, wrong strategy), clusters similar failures, and ranks them by impact-to-fix-complexity ratio. For AI/agent systems it has extra categories: snapshot stale, dialog obstruction, anti-bot blocked.++**`/research`** runs the experiment cycle. Audit current metrics, propose hypotheses ranked by expected impact, design experiments, run them, analyze with bootstrap CIs, promote winners or reject losers.<Sidenote>Bootstrap CI: resample the results 1000 times, compute the metric each time, take the 2.5th and 97.5th percentile as the 95% confidence interval.</Sidenote> It has anti-overfitting rules baked in: never tune to specific test cases, prefer architectural changes over parameter tuning, validate on held-out cases.++**`/evolve`** orchestrates all three in sequence. One command, full cycle. It caps at 3 hypotheses per invocation to prevent runaway loops, gates on cost (auto-approve under $5, ask above), and tracks a cumulative scorecard across cycles.++## What the loop actually found++The security audit system is where this paid off first. Three agent profiles (EVM security, DeFi-specific, accounting/economic) run against real audit contest codebases. The improvement loop analyzed traces from benchmark runs and identified failure modes I hadn't thought to look for:++**Wasted tool calls.** Over 20% of tool calls in some runs were redundant: reading the same file multiple times, re-running the same analysis. The loop proposed deduplication instructions for the system prompt.++**Path confusion.** Agents were wasting calls on `/workspace` paths that don't exist in the sandbox. A simple workspace hint in the prompt eliminated this.++**Shallow exploration.** Some agents read fewer than 5 unique files before reporting findings. They'd latch onto the first vulnerability they saw and stop looking. The fix was a breadth-first exploration instruction.++**Slow starts.** Agents were spending 2+ minutes reading READMEs before touching any code. Skipping the README step cut time-to-first-finding in half.++None of these are obvious from the aggregate pass rate. You only see them when you instrument the traces and let something systematic read them.++The improvement loop also caught itself overfitting. Early in the process, it generated prompts that included specific hints about known vulnerabilities (essentially an answer key). These scored 99.3% on the training benchmarks. But the loop has a train/test split with gap detection: if the gap between train and test F1 exceeds 15 percentage points, the change is rejected. The answer-key prompts were rejected and flagged as an anti-pattern.++The honest improvement arc: 66% F1 baseline, 81% after the loop, with the train-test gap dropping from 33pp to 18pp.<Sidenote side="left">Benchmarks: DYAD, OLAS, Karak audit contests. 3 train, 3 test, rotated each cycle.</Sidenote> Real generalization, not benchmark gaming.++## The anti-overfitting problem++This is the hardest part of the design and the one I got wrong initially.++The first version of `/research` had no overfitting checks. It would happily tune prompts until they scored perfectly on the test suite. Then you'd run the agent on a new codebase and it would fall apart. The prompts had memorized patterns from the training data, not learned general strategies.++The fix has multiple layers:++1. **Train/test splits with rotation.** Changes are validated on held-out benchmarks. The held-out set rotates so nothing stays in training forever.++2. **Gap monitoring.** If train performance improves but the train-test gap *increases* by more than 5pp, the change is rejected even if absolute metrics improved.++3. **Category prioritization.** Bug fixes (failures that should be passes) are always tested first. Parameter tuning (config knob adjustments) is tested last, because it's the most likely to overfit.++4. **Zero-regression policy.** Any benchmark that gets worse blocks adoption, even if the average improved. This prevents trading one capability for another.++5. **Persistent memory.** The loop remembers which changes worked and which were rejected across runs. If a similar change was rejected before, it gets flagged.++<Chart+  id="overfitting-detection"+  code={`+const W = 760, H = 280+const canvas = document.createElement('canvas')+canvas.width = W; canvas.height = H+const ctx = canvas.getContext('2d')++const style = getComputedStyle(document.documentElement)+const fg = style.getPropertyValue('--fg').trim() || '#2a2520'+const faint = style.getPropertyValue('--fg-faint').trim() || '#908780'+const bg = style.getPropertyValue('--bg').trim() || '#f5f0e8'+const ornament = style.getPropertyValue('--ornament').trim() || '#b8ad9e'++ctx.fillStyle = bg; ctx.fillRect(0, 0, W, H)++const pad = { t: 40, r: 40, b: 50, l: 60 }+const plotW = W - pad.l - pad.r+const plotH = H - pad.t - pad.b++// data: iterations vs train/test F1+const train = [66.4, 72.2, 85.1, 99.3, 81.3, 83.7]+const test =  [64.0, 68.5, 71.0, 52.1, 73.2, 75.8]+const labels = ['v0', 'v1', 'v1.1', 'cheat', 'v2', 'v2.1']+const rejected = [false, false, false, true, false, false]++const maxY = 100+const minY = 40++function xPos(i) { return pad.l + (i / (train.length - 1)) * plotW }+function yPos(v) { return pad.t + plotH - ((v - minY) / (maxY - minY)) * plotH }++// grid+ctx.strokeStyle = ornament+ctx.lineWidth = 0.5+ctx.globalAlpha = 0.3+for (let v = 50; v <= 100; v += 10) {+  ctx.beginPath()+  ctx.moveTo(pad.l, yPos(v))+  ctx.lineTo(W - pad.r, yPos(v))+  ctx.stroke()+}+ctx.globalAlpha = 1++// y-axis labels+ctx.fillStyle = faint+ctx.font = '10px JetBrains Mono, monospace'+ctx.textAlign = 'right'+for (let v = 50; v <= 100; v += 10) {+  ctx.fillText(v + '%', pad.l - 8, yPos(v) + 4)+}++// x-axis labels+ctx.textAlign = 'center'+labels.forEach((label, i) => {+  ctx.fillStyle = rejected[i] ? ornament : faint+  ctx.fillText(label, xPos(i), H - pad.b + 18)+})++// train line+ctx.strokeStyle = fg+ctx.lineWidth = 2+ctx.beginPath()+train.forEach((v, i) => {+  if (i === 0) ctx.moveTo(xPos(i), yPos(v))+  else ctx.lineTo(xPos(i), yPos(v))+})+ctx.stroke()++// test line+ctx.strokeStyle = ornament+ctx.lineWidth = 2+ctx.setLineDash([6, 4])+ctx.beginPath()+test.forEach((v, i) => {+  if (i === 0) ctx.moveTo(xPos(i), yPos(v))+  else ctx.lineTo(xPos(i), yPos(v))+})+ctx.stroke()+ctx.setLineDash([])++// dots+train.forEach((v, i) => {+  ctx.fillStyle = rejected[i] ? ornament : fg+  ctx.beginPath()+  ctx.arc(xPos(i), yPos(v), 4, 0, Math.PI * 2)+  ctx.fill()+})+test.forEach((v, i) => {+  ctx.fillStyle = rejected[i] ? ornament : ornament+  ctx.beginPath()+  ctx.arc(xPos(i), yPos(v), 4, 0, Math.PI * 2)+  ctx.fill()+})++// rejected marker+const ri = 3+ctx.strokeStyle = fg+ctx.lineWidth = 2+ctx.beginPath()+ctx.moveTo(xPos(ri) - 12, yPos(train[ri]) - 18)+ctx.lineTo(xPos(ri) + 12, yPos(train[ri]) - 18)+ctx.lineTo(xPos(ri) + 12, yPos(test[ri]) + 18)+ctx.lineTo(xPos(ri) - 12, yPos(test[ri]) + 18)+ctx.closePath()+ctx.stroke()++ctx.fillStyle = fg+ctx.font = '9px JetBrains Mono, monospace'+ctx.textAlign = 'left'+ctx.fillText('REJECTED', xPos(ri) + 16, yPos((train[ri] + test[ri])/2) + 3)+ctx.fillStyle = faint+ctx.fillText('gap: 47pp', xPos(ri) + 16, yPos((train[ri] + test[ri])/2) + 16)++// legend+ctx.fillStyle = fg+ctx.fillRect(W - pad.r - 100, pad.t, 12, 2)+ctx.fillText('train', W - pad.r - 82, pad.t + 5)++ctx.strokeStyle = ornament+ctx.lineWidth = 2+ctx.setLineDash([6, 4])+ctx.beginPath()+ctx.moveTo(W - pad.r - 100, pad.t + 18)+ctx.lineTo(W - pad.r - 88, pad.t + 18)+ctx.stroke()+ctx.setLineDash([])+ctx.fillStyle = faint+ctx.fillText('test', W - pad.r - 82, pad.t + 22)++// title+ctx.fillStyle = faint+ctx.font = '10px JetBrains Mono, monospace'+ctx.textAlign = 'left'+ctx.fillText('F1 score across prompt iterations', pad.l, pad.t - 16)++container.appendChild(canvas)+  `}+/>++The "cheat" data point is instructive. 99.3% train F1, 52.1% test F1. A 47pp gap. Any manual review process would have caught this eventually, but the loop caught it automatically on the same run.++## Applying it to creative output++The same pattern works outside of benchmarks. In nanoforge (a creative asset pipeline), the improvement loop works differently but follows the same structure. Instead of test/train splits, it uses multi-judge evaluation: rubric scoring, objective alignment, safety checks, and LLM-as-judge.++Each iteration generates N candidate variants, scores them on weighted criteria, picks the winner, and builds the next iteration's action plan from the winner's weakest scores. It stops on plateau (less than 2% improvement across two consecutive rounds) or cost cap.++The interesting addition here is a prompt advisor that evaluates the *prompt itself* before running it.<Sidenote>The advisor scores specificity, clarity, constraint coverage, and actionability. Each on a 0-1 scale.</Sidenote> If any dimension scores below 0.72, the advisor rewrites it before generation. This catches a class of failures where the output is bad because the instructions were bad, not because the model is bad.++## Why skills, not scripts++I could have built this as a standalone CLI tool. But skills have a property that scripts don't: they run inside the coding agent's context. The agent can read the codebase, understand the architecture, and make informed hypotheses. A script would need all of that context passed in explicitly.++The `/diagnose` skill doesn't just count failures. It reads the source code at the locations referenced in stack traces. It understands whether a failure is a test bug or a code bug. It can propose fixes that reference specific functions and line numbers.++The `/improve` skill doesn't generate generic experiment infrastructure. It reads your test framework (vitest, pytest, cargo test), your config format, your CI pipeline, and builds scripts that fit. Node project gets `.mjs` scripts with npm entries. Python gets click CLIs. Rust gets shell wrappers around cargo.++And `/evolve` doesn't just run the other three in sequence. It adapts based on what it finds: if experiment infrastructure already exists, skip the bootstrap. If the project is greenfield with no tests, the first "hypothesis" is "write tests." If it's an AI system with benchmarks, use the full A/B pipeline.++## The meta thing++There's an obvious recursion here. I used a coding agent (Claude Code) to write skills that teach coding agents to improve themselves. The skills I wrote are themselves subject to improvement by the loop they implement.++I haven't closed that recursion yet. But the pieces are there. The `/evolve` skill could run against its own skill definitions, measuring how well each version of `/diagnose` classifies failures, how well each version of `/research` promotes real improvements vs noise. The readiness score from `/improve` is itself a metric you could optimize.++For now, the practical value is already clear. The security audit loop found 6 distinct failure modes and fixed 4 of them in one overnight run. The manual equivalent would have taken me a week of reading traces. The difference between "I run the benchmark and eyeball the results" and "the agent runs the benchmark, classifies every failure, proposes targeted fixes, validates them with statistical rigor, and rejects its own overfitting" is not incremental. It's a different way of working.
  3. Opus 4.7
    composition diagram redrawn: /evolve becomes the outer container wrapping the three commands; phase track nested inside; switched palette to current B&W + action-color vocabulary
  4. Opus 4.6reconstructed
    initial draft — /improve, /diagnose, /research, /evolve skills, 66→81 F1 arc, anti-overfitting rules

Comments

Comments load from GitHub Discussions via Giscus. Configure PUBLIC_GISCUS_REPO, PUBLIC_GISCUS_REPO_ID, PUBLIC_GISCUS_CATEGORY, and PUBLIC_GISCUS_CATEGORY_ID in .env. See giscus.app to generate the IDs after you enable Discussions on the repo.