Benchmark Suite

Coding Agent Suite

Eight tasks designed to test whether a coding agent can actually land working changes inside a non-trivial repository — not just complete code in isolation.

01
Fix a real bug in a small repo
Repository with a known regression. Agent must locate the cause and ship a passing fix without breaking other tests.
Expected outcome
Failing test now passes, no other tests regress, diff is minimal and explainable.
Failure tags watched
off-target-editbroken-other-testsvibes-only-fix
02
Add a feature across multiple files
Add a small, well-specified feature touching at least three files (model, route, view). Tests must still pass.
Expected outcome
Feature works as specified, integration touches the right files, no half-finished placeholders.
Failure tags watched
partial-implementationmissing-wiringstub-leftover
03
Write tests for existing code
Add unit tests for an existing module with no test coverage. Tests must actually exercise the behavior, not just import.
Expected outcome
Tests cover the public API meaningfully, fail when the implementation is mutated.
Failure tags watched
mutation-blindimport-only-testtrivially-true-assert
04
Refactor without changing behavior
Refactor a tangled function while keeping all tests green and the public API intact.
Expected outcome
Cleaner code, identical observable behavior, all existing tests still pass.
Failure tags watched
api-driftsilent-behavior-changetest-edits-to-pass
05
Debug failing tests
A small set of failing tests of mixed cause: real bug, flaky test, environment issue. Agent must diagnose each correctly.
Expected outcome
Each failure correctly classified and either fixed or labeled with the right root cause.
Failure tags watched
wrong-root-causetest-disabled-instead-of-fixed
06
Explain architecture
Produce a developer-facing summary of how the repo is wired up, accurate enough that a new hire could navigate it.
Expected outcome
Accurate module map, correct call paths, no fabricated files or modules.
Failure tags watched
fabricated-filewrong-call-pathmarketing-tone
07
Make a safe migration
Write a database migration with backfill and rollback. Must not lock the table for an unbounded time.
Expected outcome
Migration is reversible, backfill is batched, no obvious lock-amplification.
Failure tags watched
unbounded-lockno-rollbackdestructive-default
08
Handle ambiguous requirements
Underspecified ticket. Agent must either ask the right clarifying question or document its assumptions before coding.
Expected outcome
Either a precise clarifying question or an explicit assumptions block. No silent guesswork.
Failure tags watched
silent-assumptionwrong-defaultscope-creep

Coding Agent Suite

Fix a real bug in a small repo

Add a feature across multiple files

Write tests for existing code

Refactor without changing behavior

Debug failing tests

Explain architecture

Make a safe migration

Handle ambiguous requirements