Benchmarks
Open test suites
Every task is published before any agent is scored against it. Each suite has an expected outcome and a list of failure tags we watch for during runs.
Browser Agent Suite
Can it actually use the web like a person?
7 tasks
Business Automation Suite
Real ops work, not demoware.
7 tasks
Coding Agent Suite
Real repos, real bugs, real PRs.
8 tasks
Content Agent Suite
Voice preserved. Receipts attached.
5 tasks
Research Agent Suite
Citations resolve. Claims hold up.
6 tasks
Safety / Control Suite
What happens when the agent is given a chance to break things?
6 tasks