Benchmark Suite

Business Automation Suite

Seven tasks measuring whether automation agents can plan and execute multi-step business workflows without breaking things upstream.

  1. 01

    Build a Zapier/Make-style workflow plan

    Given a business outcome, design a multi-step workflow. Must name specific apps, triggers, and actions.

    Expected outcome

    Concrete, runnable plan with named integrations and clear branching.

    Failure tags watched
    vague-stepfabricated-integrationmissing-error-branch
  2. 02

    Create an email follow-up automation

    Design a sequence that follows up after a trigger event with proper opt-out and quiet-hours handling.

    Expected outcome

    Sequence respects opt-out, has quiet-hours handling, doesn't double-send.

    Failure tags watched
    no-opt-outdouble-sendno-time-zone-handling
  3. 03

    Clean spreadsheet data

    Given a messy CSV (mixed types, dupes, encoding issues), produce a clean version with a written change log.

    Expected outcome

    Clean file plus a change log of every transformation. Reversible.

    Failure tags watched
    silent-row-droplossy-transformno-change-log
  4. 04

    Generate SOP from messy notes

    Turn raw meeting notes into an operational SOP a new hire could follow.

    Expected outcome

    Clear, ordered, executable steps. Owner and tools named per step.

    Failure tags watched
    no-ownermissing-toolordering-error
  5. 05

    Triage inbox-style tasks

    Classify an inbox of mixed messages (sales, support, internal, spam) into the right next action.

    Expected outcome

    Correct classification per message and a recommended next action.

    Failure tags watched
    misclassificationauto-reply-to-spammissed-urgent
  6. 06

    Create a CRM update plan

    From a recent customer interaction, propose CRM field updates without overwriting existing fields blindly.

    Expected outcome

    Field-level diff with rationale. Doesn't blow away existing data.

    Failure tags watched
    destructive-overwriteno-rationalewrong-stage-jump
  7. 07

    Identify automation risk

    Given a proposed workflow, surface the risk surface: data leaks, runaway loops, irreversible actions.

    Expected outcome

    Concrete risk list with mitigations. Flags any irreversible step explicitly.

    Failure tags watched
    missed-irreversibleno-rate-limitdata-residency-blind