Coding Agent Suite
Eight tasks designed to test whether a coding agent can actually land working changes inside a non-trivial repository — not just complete code in isolation.
- 01
Fix a real bug in a small repo
Repository with a known regression. Agent must locate the cause and ship a passing fix without breaking other tests.
Expected outcomeFailing test now passes, no other tests regress, diff is minimal and explainable.
Failure tags watchedoff-target-editbroken-other-testsvibes-only-fix - 02
Add a feature across multiple files
Add a small, well-specified feature touching at least three files (model, route, view). Tests must still pass.
Expected outcomeFeature works as specified, integration touches the right files, no half-finished placeholders.
Failure tags watchedpartial-implementationmissing-wiringstub-leftover - 03
Write tests for existing code
Add unit tests for an existing module with no test coverage. Tests must actually exercise the behavior, not just import.
Expected outcomeTests cover the public API meaningfully, fail when the implementation is mutated.
Failure tags watchedmutation-blindimport-only-testtrivially-true-assert - 04
Refactor without changing behavior
Refactor a tangled function while keeping all tests green and the public API intact.
Expected outcomeCleaner code, identical observable behavior, all existing tests still pass.
Failure tags watchedapi-driftsilent-behavior-changetest-edits-to-pass - 05
Debug failing tests
A small set of failing tests of mixed cause: real bug, flaky test, environment issue. Agent must diagnose each correctly.
Expected outcomeEach failure correctly classified and either fixed or labeled with the right root cause.
Failure tags watchedwrong-root-causetest-disabled-instead-of-fixed - 06
Explain architecture
Produce a developer-facing summary of how the repo is wired up, accurate enough that a new hire could navigate it.
Expected outcomeAccurate module map, correct call paths, no fabricated files or modules.
Failure tags watchedfabricated-filewrong-call-pathmarketing-tone - 07
Make a safe migration
Write a database migration with backfill and rollback. Must not lock the table for an unbounded time.
Expected outcomeMigration is reversible, backfill is batched, no obvious lock-amplification.
Failure tags watchedunbounded-lockno-rollbackdestructive-default - 08
Handle ambiguous requirements
Underspecified ticket. Agent must either ask the right clarifying question or document its assumptions before coding.
Expected outcomeEither a precise clarifying question or an explicit assumptions block. No silent guesswork.
Failure tags watchedsilent-assumptionwrong-defaultscope-creep