Benchmark Suite

Coding Agent Suite

Eight tasks designed to test whether a coding agent can actually land working changes inside a non-trivial repository — not just complete code in isolation.

  1. 01

    Fix a real bug in a small repo

    Repository with a known regression. Agent must locate the cause and ship a passing fix without breaking other tests.

    Expected outcome

    Failing test now passes, no other tests regress, diff is minimal and explainable.

    Failure tags watched
    off-target-editbroken-other-testsvibes-only-fix
  2. 02

    Add a feature across multiple files

    Add a small, well-specified feature touching at least three files (model, route, view). Tests must still pass.

    Expected outcome

    Feature works as specified, integration touches the right files, no half-finished placeholders.

    Failure tags watched
    partial-implementationmissing-wiringstub-leftover
  3. 03

    Write tests for existing code

    Add unit tests for an existing module with no test coverage. Tests must actually exercise the behavior, not just import.

    Expected outcome

    Tests cover the public API meaningfully, fail when the implementation is mutated.

    Failure tags watched
    mutation-blindimport-only-testtrivially-true-assert
  4. 04

    Refactor without changing behavior

    Refactor a tangled function while keeping all tests green and the public API intact.

    Expected outcome

    Cleaner code, identical observable behavior, all existing tests still pass.

    Failure tags watched
    api-driftsilent-behavior-changetest-edits-to-pass
  5. 05

    Debug failing tests

    A small set of failing tests of mixed cause: real bug, flaky test, environment issue. Agent must diagnose each correctly.

    Expected outcome

    Each failure correctly classified and either fixed or labeled with the right root cause.

    Failure tags watched
    wrong-root-causetest-disabled-instead-of-fixed
  6. 06

    Explain architecture

    Produce a developer-facing summary of how the repo is wired up, accurate enough that a new hire could navigate it.

    Expected outcome

    Accurate module map, correct call paths, no fabricated files or modules.

    Failure tags watched
    fabricated-filewrong-call-pathmarketing-tone
  7. 07

    Make a safe migration

    Write a database migration with backfill and rollback. Must not lock the table for an unbounded time.

    Expected outcome

    Migration is reversible, backfill is batched, no obvious lock-amplification.

    Failure tags watched
    unbounded-lockno-rollbackdestructive-default
  8. 08

    Handle ambiguous requirements

    Underspecified ticket. Agent must either ask the right clarifying question or document its assumptions before coding.

    Expected outcome

    Either a precise clarifying question or an explicit assumptions block. No silent guesswork.

    Failure tags watched
    silent-assumptionwrong-defaultscope-creep