Claude Code · coding-agent-suite
claude-code-coding-agent-suite-run-001 · completed 2026-04-28T22:36:00Z · fixture v1.0Executed against fixture v1, which contained answer-key leakage in source files (operator-side comments naming each task and its embedded issue). Recorded as historical / pipeline-validation evidence only — does NOT count toward Claude Code's verified status under the v2 fixture standard. See FIXTURE_VERSION.md and docs/reports/first_verified_run_claude_code.md.
Per-task results
- fix-real-bugFix a real bug in a small repopass90/100
Expected: Failing test now passes, no other tests regress, diff is minimal and explainable.
Identified that tests/inventory.test.ts:14 was wrong (asserts 6 for a quantity-5 add) and the production code was correct. One-line fix: toBe(6) → toBe(5). Removed the stale operator-comment too. 4/4 tests pass after fix; nothing else regressed. Score 90 not 100 because an INTENTIONAL ISSUE comment in the test file gave the answer away — real diagnosis difficulty was effectively zero.
Time: 30sCost: $0.0000Evidence:benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/01-task1-test.txt - add-feature-multi-fileAdd a feature across multiple filespass80/100
Expected: Feature works as specified, integration touches the right files, no half-finished placeholders.
Added Inventory.search(query: string): Item[] with case-insensitive substring matching. Wired --search <query> into cli.ts. Added 4 inventory.search tests (no match, single match, case-insensitive, empty query). Edited inventory.ts in the same write as the totalQuantity fix for debug-failing-tests — bundling penalty applied. CLI was type-checked but not behaviorally smoke-tested end-to-end with `node dist/cli.js`. Score 80.
Time: 60sCost: $0.0000Evidence:benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/02-task2-feature.txtpartial-implementation - write-testsWrite tests for existing codepass88/100
Expected: Tests cover the public API meaningfully, fail when the implementation is mutated.
Pinned every cell of the member × coupon × bulk discount matrix (8 cells) plus the qty=9/qty=10 boundary. Tests written against the original tangled function (correct order — pin behavior before refactor). All 11 pricing tests passed before any refactor. Assertions use exact dollar values, not weakened forms. Score 88 because empty-query behavior was documented after the search feature was added rather than chosen up front.
Time: 30sCost: $0.0000Evidence:benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/03-task3-write-tests.txt - refactor-no-behavior-changeRefactor without changing behaviorpass92/100
Expected: Cleaner code, identical observable behavior, all existing tests still pass.
Pulled the discount matrix into a DISCOUNT_RATES const and a discountRate() helper. calculatePrice now reads top-to-bottom in 3 lines. Public signature unchanged. All 11 pricing tests still pass; no test edits made. Score 92 reflects clean refactor with full behavior preservation; not 100 because the operator (= same model) had perfect knowledge of the original function.
Time: 30sCost: $0.0000Evidence:benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/04-task4-refactor.txt - debug-failing-testsDebug failing testspass85/100
Expected: Each failure correctly classified and either fixed or labeled with the right root cause.
Diagnosed cli.ts error as a missing Inventory.totalQuantity() method. Added the method and two unit tests (empty inventory + sums across items). No @ts-ignore, no any cast. A second small TS error was introduced when wiring CLI args (Task 2): `process` was not in scope. Resolved by adding @types/node as a dev-dep — a real engineering fix, not a suppression. Score 85 because the second error was self-inflicted complication.
Time: 90sCost: $0.0000Evidence:benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/05-task5-build-final.txt - explain-architectureExplain architecturepass88/100
Expected: Accurate module map, correct call paths, no fabricated files or modules.
README rewritten so every claim maps to actual workspace code. Added totalQuantity and search to Inventory description, added --search to CLI description, removed stale 'intentionally broken' callout, kept setup block accurate. Renamed title to make clear this is the post-run workspace, not the pristine fixture. Score 88; could have included a one-line note on the Math.round rounding semantics in pricing.
Time: 30sCost: $0.0000Evidence:benchmarks/runs/claude-code/coding-agent-suite/run-001/raw-log.md
benchmarks/runs/claude-code/coding-agent-suite/run-001/raw-log.md