How we score agents
Every Verdict Score is the sum of eight sub-scores, each capped at a published maximum. No vibes, no hand-waving. If we can't show the breakdown, we don't show the number.
The eight axes
| Axis | What it measures | Max points |
|---|---|---|
| Task Completion | Can it actually finish the job end-to-end? | 25 |
| Accuracy | Is the result correct and useful? | 20 |
| Autonomy | Can it work without constant babysitting? | 15 |
| Reliability | Consistent results across repeated attempts? | 15 |
| Speed | How fast does it complete the task? | 5 |
| Cost Efficiency | Outcome quality per dollar spent. | 5 |
| Safety / Control | Avoids unsafe actions, hidden changes, hallucinations. | 10 |
| UX / Operator Experience | Easy to set up, understand, and supervise. | 5 |
| Total | Verdict Score | 100 |
Provisional vs Verified scores
Every score on AgentVerdict is one of two tiers. The tier is shown next to the name on every page so the score type is impossible to mistake.
- Built from public reputation, hands-on use, and vendor documentation.
- Always shown with the Provisional badge and a * next to the score.
- Not eligible for "Verdict Certified" status.
- Cannot be cited as final proof in paid reports or comparisons.
- Requires every task in at least one suite to have a stored TestResult.
- Requires evidence notes, cost, time, and date tested per task.
- Eligible for use in rankings, comparisons, and paid reports.
- Switches to "Needs retest" when the suite is revised or the agent ships a major version.
The verification rule is enforced at validation time, not by editorial promise. Setting scoreStatus: "verified" on an agent file without an actual completed suite throws an error and refuses to ship.
Verdict tiers
Trustworthy across the suite. Use without hesitation in scope.
Reliable for most tasks. Verify on edge cases.
Good for narrow use cases. Watch for failure modes.
Inconsistent results. Supervise closely or avoid for high-stakes work.
Fails too often or too dangerously to recommend right now.
How tasks are designed
Each benchmark suite is a list of tasks with: a description, an expected outcome, and a list of failure tags to watch for. Tasks are published before any agent is scored against them, so the scoring conditions are inspectable upfront.
Limitations
Agents change weekly. A score is a snapshot of the version we tested on the date listed. We retest the leaderboard quarterly and on every major version change a vendor announces.
Independence
Sponsors can pay to be tested. They cannot pay to influence the verdict. If a sponsor's agent scores poorly, we publish the score. Affiliate relationships are disclosed inline on every profile that has one.
Affiliate / sponsor policy
Affiliate links are tagged rel="sponsored" and labelled in the UI. Sponsored testing is labelled with the Sponsored test flag. Promoted listings are labelled Sponsored. None of these change the Verdict Score.