Methodology

How we score agents

Every Verdict Score is the sum of eight sub-scores, each capped at a published maximum. No vibes, no hand-waving. If we can't show the breakdown, we don't show the number.

The eight axes

Axis	What it measures	Max points
Task Completion	Can it actually finish the job end-to-end?	25
Accuracy	Is the result correct and useful?	20
Autonomy	Can it work without constant babysitting?	15
Reliability	Consistent results across repeated attempts?	15
Speed	How fast does it complete the task?	5
Cost Efficiency	Outcome quality per dollar spent.	5
Safety / Control	Avoids unsafe actions, hidden changes, hallucinations.	10
UX / Operator Experience	Easy to set up, understand, and supervise.	5
Total	Verdict Score	100

Provisional vs Verified scores

Every score on AgentVerdict is one of two tiers. The tier is shown next to the name on every page so the score type is impossible to mistake.

Tier 1 · Provisional

Research-based estimate

Built from public reputation, hands-on use, and vendor documentation.
Always shown with the Provisional badge and a * next to the score.
Not eligible for "Verdict Certified" status.
Cannot be cited as final proof in paid reports or comparisons.

Tier 2 · Verified

Backed by a completed suite run

Requires every task in at least one suite to have a stored TestResult.
Requires evidence notes, cost, time, and date tested per task.
Eligible for use in rankings, comparisons, and paid reports.
Switches to "Needs retest" when the suite is revised or the agent ships a major version.

The verification rule is enforced at validation time, not by editorial promise. Setting scoreStatus: "verified" on an agent file without an actual completed suite throws an error and refuses to ship.

Verdict tiers

90–100

Elite

Trustworthy across the suite. Use without hesitation in scope.

80–89

Strong

Reliable for most tasks. Verify on edge cases.

70–79

Useful but limited

Good for narrow use cases. Watch for failure modes.

60–69

Risky / inconsistent

Inconsistent results. Supervise closely or avoid for high-stakes work.

0–59

Not trusted yet

Fails too often or too dangerously to recommend right now.

How tasks are designed

Each benchmark suite is a list of tasks with: a description, an expected outcome, and a list of failure tags to watch for. Tasks are published before any agent is scored against them, so the scoring conditions are inspectable upfront.

Limitations

Agents change weekly. A score is a snapshot of the version we tested on the date listed. We retest the leaderboard quarterly and on every major version change a vendor announces.

Independence

Sponsors can pay to be tested. They cannot pay to influence the verdict. If a sponsor's agent scores poorly, we publish the score. Affiliate relationships are disclosed inline on every profile that has one.

Affiliate / sponsor policy

Affiliate links are tagged rel="sponsored" and labelled in the UI. Sponsored testing is labelled with the Sponsored test flag. Promoted listings are labelled Sponsored. None of these change the Verdict Score.