Exploit-or-Disprove: Adversarial Validation of Security Findings
Automated security auditing produces false positives. The fix is a second agent whose only job is to write a working exploit or downgrade the finding.
The first audit I shipped had 24 findings. Four were real vulnerabilities with exploitable paths. Twenty were theoretical patterns the auditor latched onto without checking whether the execution context actually allowed the exploit. The human reviewer had to read all twenty to triage the four. That ratio is the problem automated security auditing needs to solve before it is useful.
The fix is not a smarter auditor. The fix is a second agent whose only job is to write a working proof-of-concept exploit or prove the finding cannot be exploited.
The protocol
When the primary auditor produces findings, any finding rated High or Critical triggers a validation pass:
- A specialized agent (the “pentester”) receives the finding + the codebase
- It attempts to write a proof-of-concept exploit, a test that demonstrates the vulnerability
- If the PoC compiles, runs, and demonstrates the claimed impact → confirmed
- If after a full attempt the agent cannot produce a working PoC → downgraded to Medium or Low
- Unproven findings are explicitly tagged
[unproven]in the final report
This is adversarial by design. The auditor’s incentive is to find issues. The pentester’s incentive is to prove or disprove them. The final report reflects the intersection of both perspectives.
Deduplication via similarity scoring
Before validation, we need to merge duplicate findings. Multiple auditor agents working in parallel will often flag the same issue in slightly different words. Naive exact matching misses these. We use a combination of Jaccard similarity on token sets and n-gram overlap:
where is the token set, is the set of character n-grams (typically ), and balances the two. Findings with within the same vulnerability category are merged.
The category constraint is important: a SQL injection finding and an XSS finding might share boilerplate language about “user input” and “sanitization.” Without category gating, they’d incorrectly merge.
// merge only if sim crosses threshold AND categories agree
if (sim(a, b) > 0.6 && a.category === b.category) merge(a, b)
In a typical audit of a DeFi protocol: 24 raw findings → 16 after dedup → 8 high/critical sent to validation → 4 confirmed with working PoCs, 3 downgraded (unproven), 1 rejected outright. The 8 medium/low findings pass through without validation because the cost of validating everything isn’t worth it.
Why this works
The key insight is that writing an exploit is a fundamentally different task than identifying a vulnerability. The auditor reasons abstractly about code paths and invariants. The pentester has to make something compile and run. Many “vulnerabilities” that look plausible in abstract reasoning fall apart when you try to construct actual calldata that triggers them.
This is especially true for reentrancy in modern Solidity. The auditor sees a state change after an external call and flags it, but the actual contract might have a reentrancy guard, or the callback context might not allow the reentrant path. The pentester discovers this by trying and failing.
The cost tradeoff
Validation roughly doubles the compute cost per high-severity finding. Human triage drops about 80%. The reviewer stops asking “is this finding real” and starts asking “is this PoC realistic.” The first question is unbounded; the second has a concrete answer at the end of a short read.
An audit that used to take an hour to triage now takes ten minutes. The reviewer spends that budget on the findings that matter, and the pass-through medium/low tier sits in the report for later without stealing attention. The math works out on the first audit.
Revision history2revisions
- 3 asst turns, 3 tool calls captured
show diff
diff --git a/src/content/posts/exploit-or-disprove.mdx b/src/content/posts/exploit-or-disprove.mdxnew file mode 100644index 0000000..6730f79--- /dev/null+++ b/src/content/posts/exploit-or-disprove.mdx@@ -0,0 +1,130 @@+---+title: 'Exploit-or-Disprove: Adversarial Validation of Security Findings'+description: 'Automated security auditing produces false positives. The fix is a second agent whose only job is to write a working exploit or downgrade the finding.'+date: 2026-03-03+tags: ['security', 'agents', 'systems']+---++import Chart from '../../components/Chart.astro'++Automated security auditing has a false positive problem. Run an LLM-based auditor over a smart contract and it'll flag 20 vulnerabilities. Maybe 6 are real. The other 14 are theoretical, impossible given the actual execution context, or outright hallucinated.++The standard fix is manual triage: a human reads each finding and decides. This doesn't scale. Here's a better approach: a second agent whose sole job is to *exploit* each high-severity finding or *prove it can't be exploited*.++## The protocol++When the primary auditor produces findings, any finding rated High or Critical triggers a validation pass:++1. A specialized agent (the "pentester") receives the finding + the codebase+2. It attempts to write a proof-of-concept exploit, a test that demonstrates the vulnerability+3. If the PoC compiles, runs, and demonstrates the claimed impact → **confirmed**+4. If after a full attempt the agent cannot produce a working PoC → **downgraded** to Medium or Low+5. Unproven findings are explicitly tagged `[unproven]` in the final report++This is adversarial by design. The auditor's incentive is to find issues. The pentester's incentive is to prove or disprove them. The final report reflects the *intersection* of both perspectives.++## Deduplication via similarity scoring++Before validation, we need to merge duplicate findings. Multiple auditor agents working in parallel will often flag the same issue in slightly different words. Naive exact matching misses these. We use a combination of Jaccard similarity on token sets and n-gram overlap:++$$+\text{sim}(a, b) = \alpha \cdot J(T_a, T_b) + (1 - \alpha) \cdot \frac{|N_a \cap N_b|}{|N_a \cup N_b|}+$$++where $T$ is the token set, $N$ is the set of character n-grams (typically $n=3$), and $\alpha = 0.5$ balances the two. Findings with $\text{sim} > 0.6$ within the same vulnerability category are merged.++The category constraint is important: a SQL injection finding and an XSS finding might share boilerplate language about "user input" and "sanitization." Without category gating, they'd incorrectly merge.++<Chart+ id="validation-sankey"+ code={`+const W = 700, H = 360+const canvas = document.createElement('canvas')+canvas.width = W; canvas.height = H+const ctx = canvas.getContext('2d')++const style = getComputedStyle(document.documentElement)+const fg = style.getPropertyValue('--fg').trim() || '#111'+const faint = style.getPropertyValue('--fg-faint').trim() || '#999'+const bg = style.getPropertyValue('--bg').trim() || '#fff'++ctx.fillStyle = bg; ctx.fillRect(0, 0, W, H)++const cols = [90, 260, 430, 610]+const top = 80++function drawBox(x, y, w, h, label, count, shade) {+ ctx.fillStyle = shade || fg+ ctx.fillRect(x, y, w, h)+ ctx.fillStyle = bg; ctx.font = 'bold 14px JetBrains Mono, monospace'+ ctx.textAlign = 'center'; ctx.textBaseline = 'middle'+ ctx.fillText(count, x + w/2, y + h/2 - 8)+ ctx.font = '10px JetBrains Mono, monospace'+ ctx.fillText(label, x + w/2, y + h/2 + 9)+}++function drawFlow(x1, y1, h1, x2, y2, h2, shade) {+ ctx.globalAlpha = 0.12+ ctx.fillStyle = shade || fg+ ctx.beginPath()+ ctx.moveTo(x1, y1); ctx.lineTo(x2, y2)+ ctx.lineTo(x2, y2 + h2); ctx.lineTo(x1, y1 + h1)+ ctx.closePath(); ctx.fill()+ ctx.globalAlpha = 1+}++// labels+ctx.fillStyle = faint; ctx.font = '12px JetBrains Mono, monospace'; ctx.textAlign = 'center'+ctx.fillText('Raw findings', cols[0], top - 25)+ctx.fillText('After dedup', cols[1], top - 25)+ctx.fillText('By severity', cols[2], top - 25)+ctx.fillText('After validation', cols[3], top - 25)++// Column 1: Raw+drawBox(cols[0] - 35, top, 70, 200, 'raw', '24', '#666')++// Column 2: Deduped+drawBox(cols[1] - 35, top + 20, 70, 160, 'deduped', '16', '#888')+drawFlow(cols[0] + 35, top, 200, cols[1] - 35, top + 20, 160, '#888')++// Column 3: By severity+drawBox(cols[2] - 35, top, 70, 45, 'critical', '3', '#333')+drawBox(cols[2] - 35, top + 55, 70, 45, 'high', '5', '#555')+drawBox(cols[2] - 35, top + 110, 70, 40, 'medium', '4', '#888')+drawBox(cols[2] - 35, top + 160, 70, 40, 'low', '4', '#bbb')+drawFlow(cols[1] + 35, top + 20, 35, cols[2] - 35, top, 45, '#333')+drawFlow(cols[1] + 35, top + 55, 45, cols[2] - 35, top + 55, 45, '#555')+drawFlow(cols[1] + 35, top + 105, 35, cols[2] - 35, top + 110, 40, '#888')+drawFlow(cols[1] + 35, top + 145, 35, cols[2] - 35, top + 160, 40, '#bbb')++// Column 4: Validated+drawBox(cols[3] - 35, top, 70, 40, 'confirmed', '4', '#333')+drawBox(cols[3] - 35, top + 50, 70, 40, 'unproven', '3', '#888')+drawBox(cols[3] - 35, top + 100, 70, 35, 'rejected', '1', '#bbb')+drawBox(cols[3] - 35, top + 150, 70, 50, 'pass-thru', '8', '#aaa')++// flows from severity to validation+drawFlow(cols[2] + 35, top, 45, cols[3] - 35, top, 40, '#333')+drawFlow(cols[2] + 35, top, 45, cols[3] - 35, top + 50, 40, '#888')+drawFlow(cols[2] + 35, top + 55, 45, cols[3] - 35, top, 40, '#333')+drawFlow(cols[2] + 35, top + 55, 45, cols[3] - 35, top + 50, 40, '#888')+drawFlow(cols[2] + 35, top + 55, 45, cols[3] - 35, top + 100, 35, '#bbb')+drawFlow(cols[2] + 35, top + 110, 90, cols[3] - 35, top + 150, 50, '#aaa')++container.appendChild(canvas)+ `}+/>++In a typical audit of a DeFi protocol: 24 raw findings → 16 after dedup → 8 high/critical sent to validation → 4 confirmed with working PoCs, 3 downgraded (unproven), 1 rejected outright. The 8 medium/low findings pass through without validation because the cost of validating everything isn't worth it.++## Why this works++The key insight is that **writing an exploit is a fundamentally different task than identifying a vulnerability**. The auditor reasons abstractly about code paths and invariants. The pentester has to make something *compile and run*. Many "vulnerabilities" that look plausible in abstract reasoning fall apart when you try to construct actual calldata that triggers them.++This is especially true for reentrancy in modern Solidity. The auditor sees a state change after an external call and flags it, but the actual contract might have a reentrancy guard, or the callback context might not allow the reentrant path. The pentester discovers this by trying and failing.++## The cost tradeoff++Validation roughly doubles the compute cost for high-severity findings. But it dramatically reduces the human triage burden. If you're producing audit reports that a human needs to act on, the difference between "20 findings, figure out which matter" and "4 confirmed with PoCs, 4 unproven, 12 low-risk" is the difference between a useful tool and noise.++For smart contract audits where a single confirmed critical finding might prevent a multi-million dollar exploit, the compute cost of validation is negligible against the cost of false positives. - Opus 4.6reconstructedinitial draft — full trace lost, entry reconstructed from git metadata
Comments
PUBLIC_GISCUS_REPO,PUBLIC_GISCUS_REPO_ID,PUBLIC_GISCUS_CATEGORY, andPUBLIC_GISCUS_CATEGORY_IDin.env. See giscus.app to generate the IDs after you enable Discussions on the repo.