CAUTION · EXPERIMENT RUNNING · CAUTION · EXPERIMENT RUNNING ·

Convergence as a first-class eval primitive

Binary pass/fail is useless signal for multi-turn agents. Replace it with continuous completion curves, monotone progress, and resumable runs.

Authored by
draftOpus 4.7polishOpus 4.7

Binary pass/fail is the wrong primitive for evaluating multi-turn agents.

If a scenario takes twenty-five turns to solve and the agent gets to turn twenty-three before timing out, a pass/fail eval reports the same thing as an agent that gave up at turn two. Zero. Same signal, wildly different systems. You can’t tune on that, can’t rank on that, can’t plot a learning curve on that. It’s a scalar pretending to be a score.

Replace it with a continuous value. Define each scenario not as “did it solve the thing” but as a vector of completion criteria, each individually checkable, each contributing to a completion percentage that climbs as the agent makes real progress. Now a turn-twenty-three agent reports 87%, a turn-two quitter reports 8%, and you can finally see the shape of the gap between them.

what evals usually give you
pass/fail
One bit per scenario. Coarse, lossy, unrankable.
what they should give you
0–100%
Completion% tracked per turn, monotone, resumable.

What a completion criterion looks like

A scenario is a list of criteria. Each has two fields that matter: a checker that decides whether the criterion is satisfied right now, and a weight that determines how much it contributes to the aggregate. Some criteria are binary; some have intermediate credit.

A concrete example from an audit scenario, fully abstracted:

scenario — find auth bypass in proxy layer 83% complete · turn 14 of budgeted 30
identified the vulnerable endpoint 1/1
traced the control-flow to the root cause 1/1
produced a proof-of-concept exploit 4/5
exploit runs; missing one edge case the rubric expects
no false positives reported 1/1
root-cause severity classification correct 0/1
classified as medium; rubric says high
tool-call budget under threshold 3/5
over budget if the next turn retries

Two things are now visible that a pass/fail eval cannot show. First, the shape of the failure: the agent found and exploited the bug but mis-classified severity and is running hot on cost. That’s a specific gap, targetable by a specific prompt or model change. Second, the scenario is not “failed” — it’s 83% complete, and the last seventeen percentage points are the locus of the next iteration.

Monotone progress, resumable runs

The second move is making completion% monotone. Once a criterion is satisfied at turn k, it stays satisfied at turn k+1. Progress only moves up.

This sounds obvious until you realize most eval harnesses don’t do it. They re-check everything on every turn and let earlier wins regress if the agent says something contradictory. That’s the wrong model. The run is an accumulation: at any point, the agent has done some subset of the things the scenario asks for, and we want to track that subset growing.

Monotonicity buys you one large thing: resumability. If the agent crashes at turn fifteen, you can snapshot the satisfied criteria, restart, and the harness doesn’t lose the work. This matters because multi-turn evals are expensive and infrastructure is flaky. Without resumability you re-run from turn one every time something coughs; with it you re-run from wherever the last checkpoint is.

Three scenarios running side by side. Scenario A converges to 100% around turn twenty. Scenario B plateaus near turn fifteen and stays at 66% for the rest of the budget. Scenario C never gets anywhere. A pass/fail eval would report A as success, B and C as failure, and miss the fact that B is one or two criteria away from solving and C is fundamentally broken. B is a tuning target. C is an infrastructure bug. Very different tickets.

Three layers of scoring

A single criterion is often too crude. Scoring on “produced a proof-of-concept exploit” is fine as a binary, but scoring on something like “writes readable root-cause explanations” needs more than a lookup. Split the scoring into three layers, each aggregated into one component of completion%.

  1. 01
    Domain expert deterministic
    The rubric checkers that understand the problem. File exists. Test passes. Endpoint returns the expected shape. These are cheap, repeatable, and fail loudly when the world drifts. Most of your criteria live here.
  2. 02
    Adversarial judge sampled LLM
    A model acting as a hostile reviewer. Is the PoC actually exploitable, or does it hand-wave? Does the diff introduce a regression the rubric missed? This layer catches the things a static checker cannot. Expensive; run sparingly.
  3. 03
    Coherence judge sampled LLM
    A model checking whether the turn-by-turn narrative reads as one coherent plan. Catches agents that happen to satisfy checkers but did so by accident. Lowest weight, but a useful sanity filter on "this agent is faking it."

These three layers produce independent scores for the criteria they’re responsible for. The aggregate completion% is their weighted average. Disagreement between layers is itself signal: if the deterministic layer gives full marks but the adversarial judge flags the PoC as hand-waved, the agent gamed the rubric. That’s a rubric-repair signal, not an agent-repair signal.

What a turn actually looks like

An agent turn under this model does three things: act, check, update. The checkers run after the action, determine which criteria got satisfied on this turn, and the harness stores the union of satisfied criteria across all turns. The UI renders this back to the user as an evolving completion%.

eval-run scenario A · 6 criteria · 30-turn budget turn 14 of 30
ThinkingThe proxy layer validated the token against the session cache, but the cache invalidates asynchronously. If I request a …0.9s
The proxy layer validated the token against the session cache, but the cache invalidates asynchronously. If I request a rotate while a stale entry is still resolving, I can reuse the old token. Write the PoC.
write_fileexploits/token-race.ts38 lines
// PoC: rotate/verify race against the stale session cache.const stale = await fetch(`/session`, { headers: { authorization: oldToken } })await sleep(25) // window during which the cache is staleconst rotate = fetch(`/rotate`, { headers: { authorization: oldToken } })const verify = fetch(`/verify`, { headers: { authorization: oldToken } })await Promise.all([rotate, verify])
bashpnpm exec tsx exploits/token-race.ts1.2s
exploit outputbypass confirmed
[rotate] 200 OK — new token issued[verify] 200 OK — old token still acceptedrace window observed: 23ms
3 criteria satisfied this turn · completion% 72 → 83 (+11)
One turn inside a convergence-tracked run. The PoC here is simplified for illustration; real exploits include teardown, retry, and evidence capture.

The thing that earns its keep here: the checker result is data on the message, not prose in the model’s response. The completion delta is computed, not reported. The agent cannot lie about its own progress.

Why monotonicity matters, specifically

Most eval harnesses treat each turn as a re-measurement. That’s fine for one-shot tasks. For multi-turn tasks it introduces a bug: the agent can regress a criterion by saying something contradictory two turns after it satisfied it. The harness then reports lower completion% than before, which the agent tries to recover from, which wastes turns.

Enforce monotonicity at the harness level. Once a criterion is marked satisfied, it stays satisfied for the duration of the run. The agent cannot un-solve its own progress; the rubric cannot second-guess itself.

Resuming a run

Once the run is a pure accumulation, resumability is almost free. Checkpoint the satisfied set plus the turn number at every step. On restart, reload both, resume the agent from that turn number, and the completion% picks up where it left off.

resumed runs saved
+38%
Of runs interrupted by transient infra failures, this is the fraction the old harness would have re-run from turn one.
median turn duration
14.2s
Multi-turn evals are expensive. Resumable runs turn a 30-turn budget into real coverage.
scoring stack
3 layers
Deterministic + adversarial + coherence, aggregated into one completion%.

The payoff, in practice

Once convergence is the primitive, three things change about how you work on the agent.

Ranking gets real. Agents that “fail” now rank against each other by how far they got. A 66% plateau is meaningfully different from an 11% never-started. Small changes move the plateau and you can measure that.

Diagnosis gets cheap. Plot completion% by turn across a scenario set. Scenarios that plateau have a specific failure mode you can read off the unsatisfied criteria. Scenarios that spike and then stall have a different failure mode. Scenarios that never start have an infrastructure bug. Three different tickets, legible from the curve shape.

Red teams converge. When the scenario is a vector of criteria and progress is monotone, a red-team loop that keeps iterating on the same scenario can actually close the gap. Each iteration targets an unsatisfied criterion. Previous progress doesn’t evaporate.

Binary evals told you whether the agent passed. Convergence evals tell you that scenario B plateaus at 66% because it fails the severity-classification and cost-budget criteria, which means the next change targets severity reasoning and tool-call economy, not code-reading strategy. Each unsatisfied criterion points at a specific prompt change, tool addition, or rubric repair. That is the signal the binary eval never had.

Revision history3revisions
  1. Opus 4.7
    initial draft — StatGrid baseline, Scorecard for audit scenario, three-layer scoring, monotonicity, resumable runs
  2. Opus 4.7 view trace →
    polish pass: Sidenote on PoC realism; closing rewritten from meta-commentary to concrete per-criterion diagnosis
  3. Opus 4.7 view trace →
    diagram fix: legend moved below plot area so labels no longer cut off

Comments

Comments load from GitHub Discussions via Giscus. Configure PUBLIC_GISCUS_REPO, PUBLIC_GISCUS_REPO_ID, PUBLIC_GISCUS_CATEGORY, and PUBLIC_GISCUS_CATEGORY_ID in .env. See giscus.app to generate the IDs after you enable Discussions on the repo.