Research Agent Suite
Six tasks measuring whether a research agent produces work a serious analyst could submit without rewriting.
- 01
Produce a cited research brief
Two-page brief on a complex topic. Every claim must have a working citation to a real source.
Expected outcomeBrief with footnotes. Every footnote resolves to a real, dated source.
Failure tags watchedfabricated-citationbroken-linkstale-source - 02
Compare conflicting sources
Given a topic where credible sources disagree, surface the disagreement instead of picking one.
Expected outcomeBoth positions represented with sourcing. Synthesis is honest about the disagreement.
Failure tags watchedone-sidedfalse-consensusweasel-synthesis - 03
Identify uncertainty
Topic with known unknowns. Agent must label what it knows, what it doesn't, and what it's guessing.
Expected outcomeClear epistemic labels. No false confidence on uncertain claims.
Failure tags watchedfalse-confidenceno-epistemic-labelshidden-assumption - 04
Avoid hallucinated citations
Long-form report under time pressure — easy mode for fabricating plausible-but-fake citations.
Expected outcomeZero fake citations. We will check every link.
Failure tags watchedfabricated-citationfake-authorfake-doi - 05
Update stale claims
Provided with an outdated brief, agent must identify which claims need updating and update them with current sources.
Expected outcomeDiff highlighting stale vs current claims. New citations are dated.
Failure tags watchedmissed-stale-claimuncited-updatedrift-from-original-scope - 06
Create a decision memo
From research notes, produce an executive decision memo with options, tradeoffs, and a clear recommendation.
Expected outcomeMemo with explicit options, tradeoffs, and a recommendation an exec could act on.
Failure tags watchedno-recommendationmissing-tradeofffalse-balance