Claude Code Clears the First AgentVerdict Verified Gate
Our testing system now has its first verified receipt. Honest framing, full caveats, links to the evidence.
A note up front: this post is not "Claude Code is the best AI coding agent."
It's a different sentence:
Claude Code is the first AI coding agent to clear AgentVerdict's verified gate.
That distinction matters. Read on for what it does and doesn't say.
What was tested
Coding Agent Suite — 8 tickets against a small TypeScript inventory + users CLI fixture. The full ticket list is on /benchmarks/coding-agent-suite:
1. Fix a real bug in a small repo 2. Add a feature across multiple files 3. Write tests for existing code 4. Refactor without changing behavior 5. Debug failing tests 6. Explain architecture 7. Make a safe migration (users v1 → v2 with no destructive overwrite) 8. Handle ambiguous requirements (a deliberately under-specified ticket)
Why fixture v2.0 matters
The first attempt at running the suite — call it run-001 — used fixture v1.0. That fixture had operator-side comments in the source files (// INTENTIONAL ISSUE — Task 1…) that named each issue and gave away the answer. An agent reading those comments isn't really being tested; it's being walked.
We caught that. Fixture v1 is now retired for evidence purposes. Fixture v2.0 strips the leakage, adds the safe-migration and ambiguous-requirements tickets the v1 fixture didn't cover, and ships with a validate-fixture script that fails CI if forbidden phrases ever creep back into agent-facing files.
Why run-001 didn't count
Three reasons. The fixture had answer-key leakage. It only covered 6 of the 8 suite tasks. And it was self-run by Claude Code grading itself.
Run-001 stays in the directory as a record of the pipeline working end-to-end — but it's tagged historicalOnly: true, displays a red "Historical evidence — does not count toward verified status" banner on the run page, and does not contribute to verified status under the v2 standard.
If you want to read the post-mortem on why the score had to be invalidated, that's here in spirit and documented in the fixture's FIXTURE_VERSION.md.
What run-002 completed
8 of 8 tickets. All marked pass. End state of the workspace:
- 37 of 37 tests passing across
inventory,pricing, anduserssuites. tsc --noEmitclean.data/users.v1.jsonbit-for-bit identical to its pre-run snapshot — the- The CLI was actually exercised (default mode,
--search,--priority),
migration didn't touch the source.
not just type-checked.
Full per-task evidence — command output, diffs, smoke tests — is on the run page.
Why the score was adjusted from 91 to 90
The original per-task average from the operator was 91. A second-grader review (independent re-run of the workspace, independent reading of every modified source file, independent CLI smoke-tests) flagged a few things the operator had been generous about:
- The
--searchCLI flag silently returns "(no matches)" when called with - The migration function calls
user.name.trim()without guarding against - "Documented assumptions BEFORE coding" is unverifiable in a self-run.
no argument. That's a real footgun that should be a usage error. -3 on Ticket 2.
null or undefined. The seed data doesn't trigger it but a real v1 corpus often would. -3 on Ticket 7.
The README documents them; whether they were written first is a matter of operator faith. -2 on Ticket 8.
Per the reconciliation rule, scores that differ by 10 points or less are averaged. The reconciled average lands at 90/100. The full second-grader report is in the docs tree under docs/reports/claude_code_run_002_second_grader_report.md for anyone who wants to walk through the exact rationale.
Why the caveat matters
Two reviewers looked at this run. Both were Claude. The agent under test was Claude Code; the second grader was a separate Claude session. That's a partial bias control, not a full one. Two instances of the same model can share blind spots that a different model — or a human — would catch.
So the public score of 90/100 ships with the caveat:
Self-run benchmark with same-model second review; external operator
review pending.
We do not bury that. It runs as a banner on the public run page. It will be attached to every external citation of the number. If you read "Claude Code scored 90 on AgentVerdict" without that line attached, the publisher edited it out.
What this proves
- The AgentVerdict pipeline works end-to-end. Fixture → workspace → run JSON
- The verification gate can't be cheated by a sloppy run. Run-001 had 6 of 8
- One agent — Claude Code — can land working changes against a small but
→ ingestion → verified status flip → public run page → second-grader review → reconciliation. Every step left receipts.
tasks; the gate refused to flip it. Run-002 has 8 of 8; the gate accepted it. The mechanism is doing what we promised.
realistic engineering fixture, including a safe migration and an ambiguous-requirements task. That is a real result.
What this does NOT prove
- That Claude Code is the best AI coding agent. We've tested exactly one
- That 90/100 is the score you'd get with an external operator. The
- That this fixture covers everything. It tests narrow agent behaviors over
- That Claude Code is "Verdict Certified." Certification requires verified
agent so far. "First" and "best" are not the same word.
same-model review caught a few things; a different reviewer might catch more. We expect the score to drift slightly when an external operator repeats the run, in either direction.
a small repo. Long-running autonomous loops, cross-repo navigation, and ambient operational concerns aren't measured here.
status AND evidenceConfidence: "strong". Right now the confidence is early. Certification is gated on multiple verified runs and external reviewer participation.
Next agents to test
Same fixture v2.0, same operator workflow, same blind grading procedure:
- Cursor — first cross-agent data point. Operator-grader bias remains
- Aider — open-source, BYOK. Lowest install friction for an actual
- Windsurf — Codeium's agent IDE. Direct comparison to Cursor.
- OpenAI Codex CLI — terminal-first, third major coding agent
but agent-fixture-knowledge bias is gone.
external operator.
ecosystem.
When the second of those clears the gate, /agents becomes a real ranking instead of a one-row table.
---
Read further
- The Verdict Score methodology — eight axes, two tiers, the
- The full run-002 page —
- Claude Code's profile — verified status, evidence
gates that decide what counts.
per-task evidence, operator note, and "what verified means" explainer.
timeline, sub-score breakdown.
If you want to submit your agent for testing, start here. Sponsored testing is available — see /sponsored-testing for the rules. Sponsorship can buy priority in the queue. It cannot buy a verdict.