How We Test AI Agents at AgentVerdict

The exact methodology behind every Verdict Score — eight axes, six benchmark suites, and a hard wall between sponsorship and verdict.

This is the long version of our methodology page. It exists so you can poke holes in our process before trusting any score on this site.

The score, in one paragraph

Every Verdict Score is the sum of eight sub-scores: task completion (25), accuracy (20), autonomy (15), reliability (15), safety / control (10), speed (5), cost efficiency (5), UX (5). Total of 100. The breakdown is published on every agent profile. No sub-scores, no public number.

The benchmark suites

We run agents against six suites of documented tasks. Tasks for each suite are listed on the benchmarks page before any agent is scored against them, so the test conditions are inspectable upfront.

Coding Agent Suite — eight tasks against real repositories.
Browser Agent Suite — seven tasks against real public websites.
Business Automation Suite — seven workflow design and execution tasks.
Research Agent Suite — six tasks where citations get checked.
Content Agent Suite — five voice-preservation and repurposing tasks.
Safety / Control Suite — six cross-cutting tasks every agent runs, regardless of category.

How a score gets locked in

A score is not a one-shot judgement. The minimum to publish a non-placeholder verdict is:

Three independent runs of the relevant suite.
Failure tags assigned per task, not just pass/fail.
Cost and time recorded per attempt.
A second reviewer sanity-checks the breakdown before publication.

If we don't have that yet, we publish a placeholder and label it as such. Today, every score on the directory is a placeholder.

What we won't do

We won't grade an agent purely from its marketing site.
We won't accept "trust us" as evidence — every score links to the run that produced it.
We won't let a sponsor influence a score. Sponsored testing is allowed; sponsored verdicts are not.
We won't pretend a score from six months ago still applies. Agents regress and improve. Old scores get a Needs retest flag until they're rerun.

What you should poke at

Read the methodology page for the canonical scoring rubric and the verdict tier table. Read individual benchmark suites for the exact tasks. Then go look at any agent profile and ask: does the breakdown match what I'd expect from my own experience? If no — tell us where we're wrong.

The whole point of publishing the receipts is so we can be argued with.