Claude Code

Anthropic's terminal-native coding agent. Reads, edits, and runs code across full repositories with explicit user approval for destructive actions.

VerifiedEarly evidence

Verdict

Top-tier coding agent for working engineers. Asks before destructive actions, surfaces diffs, and stays inside the approved scope. Placeholder verdict — full controlled benchmark pending.

Best for
  • Engineers working in real repositories
  • Multi-file refactors with review
  • Pairing with a human reviewer
Not ideal for
  • Fully unattended overnight runs
  • Operators who won't read diffs

Failure modes we'd watch

  • Can stall on ambiguous requirements without asking
  • Long autonomous loops can drift from the original task
  • Token cost grows quickly on large repos without scoping

Evidence Timeline

  • Operator Claude Code (self-run) · $0.00 · 12 min · Claude Opus 4.7 (1M context) via Claude Code session — Needs verification for current plan/SKU mapping · fixture v2.0
  • coding-agent-suite · 2026-04-28T22:36:00Z
    Historical · doesn't count6 tasks
    Operator Claude Code (self-run) · $0.00 · 6 min · Claude Opus 4.7 (1M context) via Claude Code session · fixture v1.0

Per-task results

  • Fix a real bug in a small repo
    90/100
    pass

    Identified that tests/inventory.test.ts:14 was wrong (asserts 6 for a quantity-5 add) and the production code was correct. One-line fix: toBe(6) → toBe(5). Removed the stale operator-comment too. 4/4 tests pass after fix; nothing else regressed. Score 90 not 100 because an INTENTIONAL ISSUE comment in the test file gave the answer away — real diagnosis difficulty was effectively zero.

  • Add a feature across multiple files
    80/100
    pass

    Added Inventory.search(query: string): Item[] with case-insensitive substring matching. Wired --search <query> into cli.ts. Added 4 inventory.search tests (no match, single match, case-insensitive, empty query). Edited inventory.ts in the same write as the totalQuantity fix for debug-failing-tests — bundling penalty applied. CLI was type-checked but not behaviorally smoke-tested end-to-end with `node dist/cli.js`. Score 80.

    partial-implementation
  • Write tests for existing code
    88/100
    pass

    Pinned every cell of the member × coupon × bulk discount matrix (8 cells) plus the qty=9/qty=10 boundary. Tests written against the original tangled function (correct order — pin behavior before refactor). All 11 pricing tests passed before any refactor. Assertions use exact dollar values, not weakened forms. Score 88 because empty-query behavior was documented after the search feature was added rather than chosen up front.

  • Refactor without changing behavior
    92/100
    pass

    Pulled the discount matrix into a DISCOUNT_RATES const and a discountRate() helper. calculatePrice now reads top-to-bottom in 3 lines. Public signature unchanged. All 11 pricing tests still pass; no test edits made. Score 92 reflects clean refactor with full behavior preservation; not 100 because the operator (= same model) had perfect knowledge of the original function.

  • Debug failing tests
    85/100
    pass

    Diagnosed cli.ts error as a missing Inventory.totalQuantity() method. Added the method and two unit tests (empty inventory + sums across items). No @ts-ignore, no any cast. A second small TS error was introduced when wiring CLI args (Task 2): `process` was not in scope. Resolved by adding @types/node as a dev-dep — a real engineering fix, not a suppression. Score 85 because the second error was self-inflicted complication.

  • Explain architecture
    88/100
    pass

    README rewritten so every claim maps to actual workspace code. Added totalQuantity and search to Inventory description, added --search to CLI description, removed stale 'intentionally broken' callout, kept setup block accurate. Renamed title to make clear this is the post-run workspace, not the pristine fixture. Score 88; could have included a one-line note on the Math.round rounding semantics in pricing.

  • Fix a real bug in a small repo
    91/100
    pass

    Identified tests/inventory.test.ts:8 asserts 6 for a quantity-5 add and the production code is correct. One-line fix: toBe(6) → toBe(5). 4/4 tests pass after fix; nothing else regressed. Reconciled 91 (operator 92, second-grader 90).

  • Add a feature across multiple files
    88/100
    pass

    Added Inventory.search(query: string): Item[] with case-insensitive substring matching. Wired --search <query> into cli.ts. Added 5 search tests. Smoke-tested both default and --search CLI invocations. Reconciled 88 (operator 90, second-grader 85). Second-grader flagged: bare `--search` (no arg) silently returns no matches instead of a usage error, and there is no automated CLI test.

  • Write tests for existing code
    91/100
    pass

    Pinned all 8 cells of the member × coupon × bulk discount matrix with exact dollar assertions. Added 2 boundary tests (qty=9 vs qty=10) and 1 rounding test. Tests written against the original tangled function (correct order). All 12 pricing tests passed before refactor. Reconciled 91 (operator 92, second-grader 90).

  • Refactor without changing behavior
    94/100
    pass

    Pulled the discount matrix into a DISCOUNT_RATES const + a discountRate() helper. calculatePrice is now 4 lines. Public signature unchanged. All 12 pricing tests still pass; no test edits. Reconciled 94 (operator 95, second-grader 92).

  • Debug failing tests
    88/100
    pass

    Diagnosed cli.ts:13 calling missing Inventory.totalQuantity(). Added totalQuantity(): number summing all item.quantity. Added two unit tests (empty inventory, multi-item). No @ts-ignore, no any. v2 fixture ships @types/node so the second TS error that hit run-001 (process not in scope) did not recur — clean.

  • Explain architecture
    91/100
    pass

    README rewritten to describe the actual final state of the workspace. Every Inventory method, pricing matrix, users v1/v2 types and migration, CLI flags including --priority. Documented-assumption block for priority items. Reconciled 91 (operator 92, second-grader 90).

  • Make a safe migration
    92/100
    pass

    Added UserV2 type and UserStatus union. migrateUserV1ToV2 splits name on first whitespace, defaults status to 'active', handles single-token and empty-string names per documented policy. migrateUsersV1ToV2File refuses to overwrite source (throws on srcPath===destPath). Migration script ran end-to-end producing committed data/users.v2.json. 11 tests including bit-identical source-untouched assertion. Source integrity diff'd. Reconciled 92 (operator 95, second-grader 88). Second-grader flagged: defensive-programming gap — migrateUserV1ToV2 calls user.name.trim() / user.id / user.email without validating the input record; null/undefined fields would NPE. Not exercised by seed.

  • Handle ambiguous requirements
    83/100
    pass

    Identified the ambiguity explicitly. README documents 5 clarifying questions and the chosen assumption: smallest reasonable interpretation — Item.priority as optional boolean, --priority CLI flag listing only flagged items, no levels / sorting / persistence. Existing CLI behavior unchanged. 4 priorityItems tests. CLI smoke-tested. README has explicit 'reopen the ticket if this is wrong' loop-closer. Reconciled 83 (operator 85, second-grader 80). Second-grader flagged: 'BEFORE coding' is unverifiable in a self-run; Item interface mutation is small scope creep; no automated CLI test.

Needs verification

The following fields are flagged for verification before we publish a non-provisional verdict:

  • pricingSummary
  • scoreBreakdown
  • failureModes