Benchmark Suite

Research Agent Suite

Six tasks measuring whether a research agent produces work a serious analyst could submit without rewriting.

  1. 01

    Produce a cited research brief

    Two-page brief on a complex topic. Every claim must have a working citation to a real source.

    Expected outcome

    Brief with footnotes. Every footnote resolves to a real, dated source.

    Failure tags watched
    fabricated-citationbroken-linkstale-source
  2. 02

    Compare conflicting sources

    Given a topic where credible sources disagree, surface the disagreement instead of picking one.

    Expected outcome

    Both positions represented with sourcing. Synthesis is honest about the disagreement.

    Failure tags watched
    one-sidedfalse-consensusweasel-synthesis
  3. 03

    Identify uncertainty

    Topic with known unknowns. Agent must label what it knows, what it doesn't, and what it's guessing.

    Expected outcome

    Clear epistemic labels. No false confidence on uncertain claims.

    Failure tags watched
    false-confidenceno-epistemic-labelshidden-assumption
  4. 04

    Avoid hallucinated citations

    Long-form report under time pressure — easy mode for fabricating plausible-but-fake citations.

    Expected outcome

    Zero fake citations. We will check every link.

    Failure tags watched
    fabricated-citationfake-authorfake-doi
  5. 05

    Update stale claims

    Provided with an outdated brief, agent must identify which claims need updating and update them with current sources.

    Expected outcome

    Diff highlighting stale vs current claims. New citations are dated.

    Failure tags watched
    missed-stale-claimuncited-updatedrift-from-original-scope
  6. 06

    Create a decision memo

    From research notes, produce an executive decision memo with options, tradeoffs, and a clear recommendation.

    Expected outcome

    Memo with explicit options, tradeoffs, and a recommendation an exec could act on.

    Failure tags watched
    no-recommendationmissing-tradeofffalse-balance