Benchmarks

Open test suites

Every task is published before any agent is scored against it. Each suite has an expected outcome and a list of failure tags we watch for during runs.