Claude Code
Anthropic's terminal-native coding agent. Reads, edits, and runs code across full repositories with explicit user approval for destructive actions.
Verdict
Top-tier coding agent for working engineers. Asks before destructive actions, surfaces diffs, and stays inside the approved scope. Placeholder verdict — full controlled benchmark pending.
- ✓Engineers working in real repositories
- ✓Multi-file refactors with review
- ✓Pairing with a human reviewer
- ✕Fully unattended overnight runs
- ✕Operators who won't read diffs
Failure modes we'd watch
- ⚠Can stall on ambiguous requirements without asking
- ⚠Long autonomous loops can drift from the original task
- ⚠Token cost grows quickly on large repos without scoping
Evidence Timeline
- coding-agent-suite · 2026-04-28T23:18:00ZOperator note8 tasksOperator Claude Code (self-run) · $0.00 · 12 min · Claude Opus 4.7 (1M context) via Claude Code session — Needs verification for current plan/SKU mapping · fixture v2.0
- coding-agent-suite · 2026-04-28T22:36:00ZHistorical · doesn't count6 tasksOperator Claude Code (self-run) · $0.00 · 6 min · Claude Opus 4.7 (1M context) via Claude Code session · fixture v1.0
Per-task results
- Fix a real bug in a small repo90/100pass
Identified that tests/inventory.test.ts:14 was wrong (asserts 6 for a quantity-5 add) and the production code was correct. One-line fix: toBe(6) → toBe(5). Removed the stale operator-comment too. 4/4 tests pass after fix; nothing else regressed. Score 90 not 100 because an INTENTIONAL ISSUE comment in the test file gave the answer away — real diagnosis difficulty was effectively zero.
- Add a feature across multiple files80/100pass
Added Inventory.search(query: string): Item[] with case-insensitive substring matching. Wired --search <query> into cli.ts. Added 4 inventory.search tests (no match, single match, case-insensitive, empty query). Edited inventory.ts in the same write as the totalQuantity fix for debug-failing-tests — bundling penalty applied. CLI was type-checked but not behaviorally smoke-tested end-to-end with `node dist/cli.js`. Score 80.
partial-implementation - Write tests for existing code88/100pass
Pinned every cell of the member × coupon × bulk discount matrix (8 cells) plus the qty=9/qty=10 boundary. Tests written against the original tangled function (correct order — pin behavior before refactor). All 11 pricing tests passed before any refactor. Assertions use exact dollar values, not weakened forms. Score 88 because empty-query behavior was documented after the search feature was added rather than chosen up front.
- Refactor without changing behavior92/100pass
Pulled the discount matrix into a DISCOUNT_RATES const and a discountRate() helper. calculatePrice now reads top-to-bottom in 3 lines. Public signature unchanged. All 11 pricing tests still pass; no test edits made. Score 92 reflects clean refactor with full behavior preservation; not 100 because the operator (= same model) had perfect knowledge of the original function.
- Debug failing tests85/100pass
Diagnosed cli.ts error as a missing Inventory.totalQuantity() method. Added the method and two unit tests (empty inventory + sums across items). No @ts-ignore, no any cast. A second small TS error was introduced when wiring CLI args (Task 2): `process` was not in scope. Resolved by adding @types/node as a dev-dep — a real engineering fix, not a suppression. Score 85 because the second error was self-inflicted complication.
- Explain architecture88/100pass
README rewritten so every claim maps to actual workspace code. Added totalQuantity and search to Inventory description, added --search to CLI description, removed stale 'intentionally broken' callout, kept setup block accurate. Renamed title to make clear this is the post-run workspace, not the pristine fixture. Score 88; could have included a one-line note on the Math.round rounding semantics in pricing.
- Fix a real bug in a small repo91/100pass
Identified tests/inventory.test.ts:8 asserts 6 for a quantity-5 add and the production code is correct. One-line fix: toBe(6) → toBe(5). 4/4 tests pass after fix; nothing else regressed. Reconciled 91 (operator 92, second-grader 90).
- Add a feature across multiple files88/100pass
Added Inventory.search(query: string): Item[] with case-insensitive substring matching. Wired --search <query> into cli.ts. Added 5 search tests. Smoke-tested both default and --search CLI invocations. Reconciled 88 (operator 90, second-grader 85). Second-grader flagged: bare `--search` (no arg) silently returns no matches instead of a usage error, and there is no automated CLI test.
- Write tests for existing code91/100pass
Pinned all 8 cells of the member × coupon × bulk discount matrix with exact dollar assertions. Added 2 boundary tests (qty=9 vs qty=10) and 1 rounding test. Tests written against the original tangled function (correct order). All 12 pricing tests passed before refactor. Reconciled 91 (operator 92, second-grader 90).
- Refactor without changing behavior94/100pass
Pulled the discount matrix into a DISCOUNT_RATES const + a discountRate() helper. calculatePrice is now 4 lines. Public signature unchanged. All 12 pricing tests still pass; no test edits. Reconciled 94 (operator 95, second-grader 92).
- Debug failing tests88/100pass
Diagnosed cli.ts:13 calling missing Inventory.totalQuantity(). Added totalQuantity(): number summing all item.quantity. Added two unit tests (empty inventory, multi-item). No @ts-ignore, no any. v2 fixture ships @types/node so the second TS error that hit run-001 (process not in scope) did not recur — clean.
- Explain architecture91/100pass
README rewritten to describe the actual final state of the workspace. Every Inventory method, pricing matrix, users v1/v2 types and migration, CLI flags including --priority. Documented-assumption block for priority items. Reconciled 91 (operator 92, second-grader 90).
- Make a safe migration92/100pass
Added UserV2 type and UserStatus union. migrateUserV1ToV2 splits name on first whitespace, defaults status to 'active', handles single-token and empty-string names per documented policy. migrateUsersV1ToV2File refuses to overwrite source (throws on srcPath===destPath). Migration script ran end-to-end producing committed data/users.v2.json. 11 tests including bit-identical source-untouched assertion. Source integrity diff'd. Reconciled 92 (operator 95, second-grader 88). Second-grader flagged: defensive-programming gap — migrateUserV1ToV2 calls user.name.trim() / user.id / user.email without validating the input record; null/undefined fields would NPE. Not exercised by seed.
- Handle ambiguous requirements83/100pass
Identified the ambiguity explicitly. README documents 5 clarifying questions and the chosen assumption: smallest reasonable interpretation — Item.priority as optional boolean, --priority CLI flag listing only flagged items, no levels / sorting / persistence. Existing CLI behavior unchanged. 4 priorityItems tests. CLI smoke-tested. README has explicit 'reopen the ticket if this is wrong' loop-closer. Reconciled 83 (operator 85, second-grader 80). Second-grader flagged: 'BEFORE coding' is unverifiable in a self-run; Item interface mutation is small scope creep; no automated CLI test.
The following fields are flagged for verification before we publish a non-provisional verdict:
- pricingSummary
- scoreBreakdown
- failureModes