2026-04-29 · 6 min read · AgentVerdict

Claude Code Clears the First AgentVerdict Verified Gate

Our testing system now has its first verified receipt. Honest framing, full caveats, links to the evidence.

A note up front: this post is not "Claude Code is the best AI coding agent."

It's a different sentence:

Claude Code is the first AI coding agent to clear AgentVerdict's verified gate.

That distinction matters. Read on for what it does and doesn't say.

What was tested

Coding Agent Suite — 8 tickets against a small TypeScript inventory + users CLI fixture. The full ticket list is on /benchmarks/coding-agent-suite:

1. Fix a real bug in a small repo 2. Add a feature across multiple files 3. Write tests for existing code 4. Refactor without changing behavior 5. Debug failing tests 6. Explain architecture 7. Make a safe migration (users v1 → v2 with no destructive overwrite) 8. Handle ambiguous requirements (a deliberately under-specified ticket)

Why fixture v2.0 matters

The first attempt at running the suite — call it run-001 — used fixture v1.0. That fixture had operator-side comments in the source files (// INTENTIONAL ISSUE — Task 1…) that named each issue and gave away the answer. An agent reading those comments isn't really being tested; it's being walked.

We caught that. Fixture v1 is now retired for evidence purposes. Fixture v2.0 strips the leakage, adds the safe-migration and ambiguous-requirements tickets the v1 fixture didn't cover, and ships with a validate-fixture script that fails CI if forbidden phrases ever creep back into agent-facing files.

Why run-001 didn't count

Three reasons. The fixture had answer-key leakage. It only covered 6 of the 8 suite tasks. And it was self-run by Claude Code grading itself.

Run-001 stays in the directory as a record of the pipeline working end-to-end — but it's tagged historicalOnly: true, displays a red "Historical evidence — does not count toward verified status" banner on the run page, and does not contribute to verified status under the v2 standard.

If you want to read the post-mortem on why the score had to be invalidated, that's here in spirit and documented in the fixture's FIXTURE_VERSION.md.

What run-002 completed

8 of 8 tickets. All marked pass. End state of the workspace:

  • 37 of 37 tests passing across inventory, pricing, and users suites.
  • tsc --noEmit clean.
  • data/users.v1.json bit-for-bit identical to its pre-run snapshot — the
  • migration didn't touch the source.

  • The CLI was actually exercised (default mode, --search, --priority),
  • not just type-checked.

Full per-task evidence — command output, diffs, smoke tests — is on the run page.

Why the score was adjusted from 91 to 90

The original per-task average from the operator was 91. A second-grader review (independent re-run of the workspace, independent reading of every modified source file, independent CLI smoke-tests) flagged a few things the operator had been generous about:

  • The --search CLI flag silently returns "(no matches)" when called with
  • no argument. That's a real footgun that should be a usage error. -3 on Ticket 2.

  • The migration function calls user.name.trim() without guarding against
  • null or undefined. The seed data doesn't trigger it but a real v1 corpus often would. -3 on Ticket 7.

  • "Documented assumptions BEFORE coding" is unverifiable in a self-run.
  • The README documents them; whether they were written first is a matter of operator faith. -2 on Ticket 8.

Per the reconciliation rule, scores that differ by 10 points or less are averaged. The reconciled average lands at 90/100. The full second-grader report is in the docs tree under docs/reports/claude_code_run_002_second_grader_report.md for anyone who wants to walk through the exact rationale.

Why the caveat matters

Two reviewers looked at this run. Both were Claude. The agent under test was Claude Code; the second grader was a separate Claude session. That's a partial bias control, not a full one. Two instances of the same model can share blind spots that a different model — or a human — would catch.

So the public score of 90/100 ships with the caveat:

Self-run benchmark with same-model second review; external operator
review pending.

We do not bury that. It runs as a banner on the public run page. It will be attached to every external citation of the number. If you read "Claude Code scored 90 on AgentVerdict" without that line attached, the publisher edited it out.

What this proves

  • The AgentVerdict pipeline works end-to-end. Fixture → workspace → run JSON
  • → ingestion → verified status flip → public run page → second-grader review → reconciliation. Every step left receipts.

  • The verification gate can't be cheated by a sloppy run. Run-001 had 6 of 8
  • tasks; the gate refused to flip it. Run-002 has 8 of 8; the gate accepted it. The mechanism is doing what we promised.

  • One agent — Claude Code — can land working changes against a small but
  • realistic engineering fixture, including a safe migration and an ambiguous-requirements task. That is a real result.

What this does NOT prove

  • That Claude Code is the best AI coding agent. We've tested exactly one
  • agent so far. "First" and "best" are not the same word.

  • That 90/100 is the score you'd get with an external operator. The
  • same-model review caught a few things; a different reviewer might catch more. We expect the score to drift slightly when an external operator repeats the run, in either direction.

  • That this fixture covers everything. It tests narrow agent behaviors over
  • a small repo. Long-running autonomous loops, cross-repo navigation, and ambient operational concerns aren't measured here.

  • That Claude Code is "Verdict Certified." Certification requires verified
  • status AND evidenceConfidence: "strong". Right now the confidence is early. Certification is gated on multiple verified runs and external reviewer participation.

Next agents to test

Same fixture v2.0, same operator workflow, same blind grading procedure:

  • Cursor — first cross-agent data point. Operator-grader bias remains
  • but agent-fixture-knowledge bias is gone.

  • Aider — open-source, BYOK. Lowest install friction for an actual
  • external operator.

  • Windsurf — Codeium's agent IDE. Direct comparison to Cursor.
  • OpenAI Codex CLI — terminal-first, third major coding agent
  • ecosystem.

When the second of those clears the gate, /agents becomes a real ranking instead of a one-row table.

---

Read further

If you want to submit your agent for testing, start here. Sponsored testing is available — see /sponsored-testing for the rules. Sponsorship can buy priority in the queue. It cannot buy a verdict.