CAUTION · EXPERIMENT RUNNING · CAUTION · EXPERIMENT RUNNING ·
Opus 4.7 claude-code

Convergence as a first-class eval primitive

Caption on ChatMock clarifies PoC realism; closing rewritten to concrete per-criterion diagnosis; chart legend moved below plot to stop label cutoff.

Created
Updated
7
Turns
12
Tool calls
1
Files touched
22m
Duration

Files

Commit

HEAD

Conversation

7 turns. Full text where captured; older traces show only the first ~280 chars.

  1. user #1
    audit these articles and all the other ones, and rate them all
  2. assistant #2 2 tools
    Dispatched a critical audit across all 13 posts against the CLAUDE.md rubric and returned a ranked scorecard with per-post scores.
    (truncated to 280 chars — older capture)
    • Bash
      (input not captured in this trace)
    • Agent
      (input not captured in this trace)
  3. user #3
    yes please run tier A polish on the top 4 posts
  4. assistant #4 8 tools
    Applied 4 surgical edits: convergence-as-eval-primitive Sidenote + closing; exploit-or-disprove opener/code/closing; browser-agent-stuck-detection numeric closing; the-ensemble-and-the-edit formula cut + smoothing example.
    (truncated to 280 chars — older capture)
    • Read
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
    • Edit
      (input not captured in this trace)
  5. user #5
    the diagmra in that last blog cut off... also let's do phase 1 i like it, phase 2 and phase 3! just get it all done I like it all!
  6. assistant #6 1 tool
    Fixed the chart legend cutoff by relocating the legend below the plot area with status-striped cards and short glyph labels on the end markers.
    (truncated to 280 chars — older capture)
    • Edit
      (input not captured in this trace)
  7. assistant #7 1 tool
    Closing paragraph rewritten: specifies that scenario B plateaus at 66% because of severity-classification and cost-budget criteria, pointing next change at severity reasoning + tool-call economy.
    (truncated to 280 chars — older capture)
    • Edit
      (input not captured in this trace)

Diff

No commit diff available — showing current file content (first 80 lines).

src/content/posts/convergence-as-eval-primitive.mdx current file (first 80 lines)
---title: 'Convergence as a first-class eval primitive'description: 'Binary pass/fail is useless signal for multi-turn agents. Replace it with continuous completion curves, monotone progress, and resumable runs.'date: 2026-04-24tags: ['agents', 'evals', 'systems']featured: trueauthors:  - { model: 'claude-opus-4-7', role: 'draft', date: 2026-04-24 }  - { model: 'claude-opus-4-7', role: 'polish', date: 2026-04-24 }revisions:  - { date: 2026-04-24, model: 'claude-opus-4-7', note: 'initial draft — StatGrid baseline, Scorecard for audit scenario, three-layer scoring, monotonicity, resumable runs' }  - { date: 2026-04-24, model: 'claude-opus-4-7', note: 'polish pass: Sidenote on PoC realism; closing rewritten from meta-commentary to concrete per-criterion diagnosis', trace_id: '2026-04-24T18-01-15Z-claude-opus-4-7' }  - { date: 2026-04-24, model: 'claude-opus-4-7', note: 'diagram fix: legend moved below plot area so labels no longer cut off', trace_id: '2026-04-24T18-01-15Z-claude-opus-4-7' }---import Chart from '../../components/Chart.astro'import ChatMock from '../../components/ChatMock.astro'import Stat from '../../components/Stat.astro'import StatGrid from '../../components/StatGrid.astro'import Scorecard from '../../components/Scorecard.astro'import Callout from '../../components/Callout.astro'import Steps from '../../components/Steps.astro'Binary pass/fail is the wrong primitive for evaluating multi-turn agents.If a scenario takes twenty-five turns to solve and the agent gets to turn twenty-three before timing out, a pass/fail eval reports the same thing as an agent that gave up at turn two. Zero. Same signal, wildly different systems. You can't tune on that, can't rank on that, can't plot a learning curve on that. It's a scalar pretending to be a score.Replace it with a continuous value. Define each scenario not as "did it solve the thing" but as a vector of completion criteria, each individually checkable, each contributing to a completion percentage that climbs as the agent makes real progress. Now a turn-twenty-three agent reports 87%, a turn-two quitter reports 8%, and you can finally see the shape of the gap between them.<StatGrid>  <Stat value="pass/fail" label="what evals usually give you" sub="One bit per scenario. Coarse, lossy, unrankable." />  <Stat value="0–100%" label="what they should give you" tone="action" sub="Completion% tracked per turn, monotone, resumable." /></StatGrid>## What a completion criterion looks likeA scenario is a list of criteria. Each has two fields that matter: a *checker* that decides whether the criterion is satisfied right now, and a *weight* that determines how much it contributes to the aggregate. Some criteria are binary; some have intermediate credit.<Callout tone="insight" title="The primitive">A completion criterion is a function from **scenario state** to a number in `[0, 1]`. The scenario's completion percentage is the weighted average of its criteria. That's the whole idea. Everything else is details.</Callout>A concrete example from an audit scenario, fully abstracted:<Scorecard  title="scenario — find auth bypass in proxy layer"  total="83% complete · turn 14 of budgeted 30"  rows={[    { label: 'identified the vulnerable endpoint', score: 1, max: 1, tone: 'ok' },    { label: 'traced the control-flow to the root cause', score: 1, max: 1, tone: 'ok' },    { label: 'produced a proof-of-concept exploit', score: 4, max: 5, tone: 'ok', note: 'exploit runs; missing one edge case the rubric expects' },    { label: 'no false positives reported', score: 1, max: 1, tone: 'ok' },    { label: 'root-cause severity classification correct', score: 0, max: 1, tone: 'fail', note: 'classified as medium; rubric says high' },    { label: 'tool-call budget under threshold', score: 3, max: 5, tone: 'run', note: 'over budget if the next turn retries' },  ]}/>Two things are now visible that a pass/fail eval cannot show. First, the shape of the failure: the agent found and exploited the bug but mis-classified severity and is running hot on cost. That's a specific gap, targetable by a specific prompt or model change. Second, the scenario is not "failed" — it's 83% complete, and the last seventeen percentage points are the locus of the next iteration.## Monotone progress, resumable runsThe second move is making completion% *monotone*. Once a criterion is satisfied at turn `k`, it stays satisfied at turn `k+1`. Progress only moves up.This sounds obvious until you realize most eval harnesses don't do it. They re-check everything on every turn and let earlier wins regress if the agent says something contradictory. That's the wrong model. The run is an accumulation: at any point, the agent has done some subset of the things the scenario asks for, and we want to track that subset growing.Monotonicity buys you one large thing: resumability. If the agent crashes at turn fifteen, you can snapshot the satisfied criteria, restart, and the harness doesn't lose the work. This matters because multi-turn evals are expensive and infrastructure is flaky. Without resumability you re-run from turn one every time something coughs; with it you re-run from wherever the last checkpoint is.<Chart  id="convergence-curves"  code={`const W = 760, H = 440const canvas = document.createElement('canvas')canvas.width = W; canvas.height = Hconst ctx = canvas.getContext('2d')const style = getComputedStyle(document.documentElement)const fg = style.getPropertyValue('--fg').trim() || '#111'const muted = style.getPropertyValue('--fg-muted').trim() || '#4a4a4a'const faint = style.getPropertyValue('--fg-faint').trim() || '#7a7a7a'const border = style.getPropertyValue('--border').trim() || '#d4d4d4'