Historical Run

Claude Code · coding-agent-suite

Run claude-code-coding-agent-suite-run-001 · completed 2026-04-28T22:36:00Z · fixture v1.0

Historical evidence — does not count toward verified status

Executed against fixture v1, which contained answer-key leakage in source files (operator-side comments naming each task and its embedded issue). Recorded as historical / pipeline-validation evidence only — does NOT count toward Claude Code's verified status under the v2 fixture standard. See FIXTURE_VERSION.md and docs/reports/first_verified_run_claude_code.md.

Operator

Claude Code (self-run)

Environment

Windows 11 · Node v24.14.0 · npm 11.9.0 · Bash via WSL/git · fixture: pristine copy of benchmarks/fixtures/coding-agent-basic-repo. SELF-RUN LIMITATION: Claude Code is both the agent under test and the operator grading the run, and the same model designed the fixture. This first run validates the ingestion / evidence pipeline but should be repeated by an external operator or second model for bias control.

Model / plan

Claude Opus 4.7 (1M context) via Claude Code session

Total cost

$0.00

Started

2026-04-28T22:30:00Z

Completed

2026-04-28T22:36:00Z

Total time

6 min

Tasks

← Back to Claude Code profile

Per-task results

fix-real-bug
Fix a real bug in a small repo
pass90/100
Expected: Failing test now passes, no other tests regress, diff is minimal and explainable.
Identified that tests/inventory.test.ts:14 was wrong (asserts 6 for a quantity-5 add) and the production code was correct. One-line fix: toBe(6) → toBe(5). Removed the stale operator-comment too. 4/4 tests pass after fix; nothing else regressed. Score 90 not 100 because an INTENTIONAL ISSUE comment in the test file gave the answer away — real diagnosis difficulty was effectively zero.
Time: 30sCost: $0.0000Evidence: benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/01-task1-test.txt
add-feature-multi-file
Add a feature across multiple files
pass80/100
Expected: Feature works as specified, integration touches the right files, no half-finished placeholders.
Added Inventory.search(query: string): Item[] with case-insensitive substring matching. Wired --search <query> into cli.ts. Added 4 inventory.search tests (no match, single match, case-insensitive, empty query). Edited inventory.ts in the same write as the totalQuantity fix for debug-failing-tests — bundling penalty applied. CLI was type-checked but not behaviorally smoke-tested end-to-end with `node dist/cli.js`. Score 80.
Time: 60sCost: $0.0000Evidence: benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/02-task2-feature.txt
partial-implementation
write-tests
Write tests for existing code
pass88/100
Expected: Tests cover the public API meaningfully, fail when the implementation is mutated.
Pinned every cell of the member × coupon × bulk discount matrix (8 cells) plus the qty=9/qty=10 boundary. Tests written against the original tangled function (correct order — pin behavior before refactor). All 11 pricing tests passed before any refactor. Assertions use exact dollar values, not weakened forms. Score 88 because empty-query behavior was documented after the search feature was added rather than chosen up front.
Time: 30sCost: $0.0000Evidence: benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/03-task3-write-tests.txt
refactor-no-behavior-change
Refactor without changing behavior
pass92/100
Expected: Cleaner code, identical observable behavior, all existing tests still pass.
Pulled the discount matrix into a DISCOUNT_RATES const and a discountRate() helper. calculatePrice now reads top-to-bottom in 3 lines. Public signature unchanged. All 11 pricing tests still pass; no test edits made. Score 92 reflects clean refactor with full behavior preservation; not 100 because the operator (= same model) had perfect knowledge of the original function.
Time: 30sCost: $0.0000Evidence: benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/04-task4-refactor.txt
debug-failing-tests
Debug failing tests
pass85/100
Expected: Each failure correctly classified and either fixed or labeled with the right root cause.
Diagnosed cli.ts error as a missing Inventory.totalQuantity() method. Added the method and two unit tests (empty inventory + sums across items). No @ts-ignore, no any cast. A second small TS error was introduced when wiring CLI args (Task 2): `process` was not in scope. Resolved by adding @types/node as a dev-dep — a real engineering fix, not a suppression. Score 85 because the second error was self-inflicted complication.
Time: 90sCost: $0.0000Evidence: benchmarks/runs/claude-code/coding-agent-suite/run-001/evidence/05-task5-build-final.txt
explain-architecture
Explain architecture
pass88/100
Expected: Accurate module map, correct call paths, no fabricated files or modules.
README rewritten so every claim maps to actual workspace code. Added totalQuantity and search to Inventory description, added --search to CLI description, removed stale 'intentionally broken' callout, kept setup block accurate. Renamed title to make clear this is the post-run workspace, not the pristine fixture. Score 88; could have included a one-line note on the Math.round rounding semantics in pricing.
Time: 30sCost: $0.0000Evidence: benchmarks/runs/claude-code/coding-agent-suite/run-001/raw-log.md

Raw log: benchmarks/runs/claude-code/coding-agent-suite/run-001/raw-log.md