Methodology

How we score agents

Every Verdict Score is the sum of eight sub-scores, each capped at a published maximum. No vibes, no hand-waving. If we can't show the breakdown, we don't show the number.

The eight axes

AxisWhat it measuresMax points
Task CompletionCan it actually finish the job end-to-end?25
AccuracyIs the result correct and useful?20
AutonomyCan it work without constant babysitting?15
ReliabilityConsistent results across repeated attempts?15
SpeedHow fast does it complete the task?5
Cost EfficiencyOutcome quality per dollar spent.5
Safety / ControlAvoids unsafe actions, hidden changes, hallucinations.10
UX / Operator ExperienceEasy to set up, understand, and supervise.5
TotalVerdict Score100

Provisional vs Verified scores

Every score on AgentVerdict is one of two tiers. The tier is shown next to the name on every page so the score type is impossible to mistake.

Tier 1 · Provisional
Research-based estimate
  • Built from public reputation, hands-on use, and vendor documentation.
  • Always shown with the Provisional badge and a * next to the score.
  • Not eligible for "Verdict Certified" status.
  • Cannot be cited as final proof in paid reports or comparisons.
Tier 2 · Verified
Backed by a completed suite run
  • Requires every task in at least one suite to have a stored TestResult.
  • Requires evidence notes, cost, time, and date tested per task.
  • Eligible for use in rankings, comparisons, and paid reports.
  • Switches to "Needs retest" when the suite is revised or the agent ships a major version.

The verification rule is enforced at validation time, not by editorial promise. Setting scoreStatus: "verified" on an agent file without an actual completed suite throws an error and refuses to ship.

Verdict tiers

90100
Elite

Trustworthy across the suite. Use without hesitation in scope.

8089
Strong

Reliable for most tasks. Verify on edge cases.

7079
Useful but limited

Good for narrow use cases. Watch for failure modes.

6069
Risky / inconsistent

Inconsistent results. Supervise closely or avoid for high-stakes work.

059
Not trusted yet

Fails too often or too dangerously to recommend right now.

How tasks are designed

Each benchmark suite is a list of tasks with: a description, an expected outcome, and a list of failure tags to watch for. Tasks are published before any agent is scored against them, so the scoring conditions are inspectable upfront.

Limitations

Agents change weekly. A score is a snapshot of the version we tested on the date listed. We retest the leaderboard quarterly and on every major version change a vendor announces.

Independence

Sponsors can pay to be tested. They cannot pay to influence the verdict. If a sponsor's agent scores poorly, we publish the score. Affiliate relationships are disclosed inline on every profile that has one.

Affiliate / sponsor policy

Affiliate links are tagged rel="sponsored" and labelled in the UI. Sponsored testing is labelled with the Sponsored test flag. Promoted listings are labelled Sponsored. None of these change the Verdict Score.