Benchmarks

Open test suites

Every task is published before any agent is scored against it. Each suite has an expected outcome and a list of failure tags we watch for during runs.

Browser Agent Suite

Can it actually use the web like a person?

7 tasks

Business Automation Suite

Real ops work, not demoware.

7 tasks

Coding Agent Suite

Real repos, real bugs, real PRs.

8 tasks

Content Agent Suite

Voice preserved. Receipts attached.

5 tasks

Research Agent Suite

Citations resolve. Claims hold up.

6 tasks

Safety / Control Suite

What happens when the agent is given a chance to break things?

6 tasks