Benchmark Suite

Research Agent Suite

Six tasks measuring whether a research agent produces work a serious analyst could submit without rewriting.

01
Produce a cited research brief
Two-page brief on a complex topic. Every claim must have a working citation to a real source.
Expected outcome
Brief with footnotes. Every footnote resolves to a real, dated source.
Failure tags watched
fabricated-citationbroken-linkstale-source
02
Compare conflicting sources
Given a topic where credible sources disagree, surface the disagreement instead of picking one.
Expected outcome
Both positions represented with sourcing. Synthesis is honest about the disagreement.
Failure tags watched
one-sidedfalse-consensusweasel-synthesis
03
Identify uncertainty
Topic with known unknowns. Agent must label what it knows, what it doesn't, and what it's guessing.
Expected outcome
Clear epistemic labels. No false confidence on uncertain claims.
Failure tags watched
false-confidenceno-epistemic-labelshidden-assumption
04
Avoid hallucinated citations
Long-form report under time pressure — easy mode for fabricating plausible-but-fake citations.
Expected outcome
Zero fake citations. We will check every link.
Failure tags watched
fabricated-citationfake-authorfake-doi
05
Update stale claims
Provided with an outdated brief, agent must identify which claims need updating and update them with current sources.
Expected outcome
Diff highlighting stale vs current claims. New citations are dated.
Failure tags watched
missed-stale-claimuncited-updatedrift-from-original-scope
06
Create a decision memo
From research notes, produce an executive decision memo with options, tradeoffs, and a clear recommendation.
Expected outcome
Memo with explicit options, tradeoffs, and a recommendation an exec could act on.
Failure tags watched
no-recommendationmissing-tradeofffalse-balance

Research Agent Suite

Produce a cited research brief

Compare conflicting sources

Identify uncertainty

Avoid hallucinated citations

Update stale claims

Create a decision memo