WORK Methodology

The world's most-trusted AI benchmark, built by business operators.

We're recruiting senior practitioners to evaluate models against hundreds of real workplace tasks. Not "solve this equation" but "build a business case for a new brand launch." Not academic abstractions but actual work.

Transparent Methodology

The first benchmark for operators, utilizing expert's subjective ratings across multiple domains.

Monthly Updates. Regular releases with updated leaderboards, capability cards, and methodology notes.

Transparent Methodology. Published evaluation rubrics, scoring weights, and versioned methodology updates.

Named Evaluators. Expert reviewers identified and their backgrounds disclosed.

WORK Score by Model

Every model is evaluated using blind, multi-rater expert judging against human gold deliverables. We capture win-rate and written rationales from expert reviewers.

Composite score (0-100) combines rubric quality, win-rate vs human, time, and cost. Weights are published and versioned monthly.

Reliability metrics include inter-rater agreement reporting, cataloged failure modes, and published notable examples.

Outputs include leaderboards, Capability Cards, and a Failure-Mode Library with month-over-month deltas.

Monthly WORK Index & Releases

Each month, we publish a comprehensive WORK Index with overall and domain leaderboards, month-over-month deltas, and notable wins and failures.

Capability Cards for each model and domain detail strengths, gaps, exemplar tasks, and recommended fit for specific use cases.

Change Log documents models covered, task updates, and rubric version notes.

Evaluator Roster Updates track additions, departures, and methodology notes.

Example Tasks

WORK evaluates models on hundreds of real workplace tasks across multiple domains. Here are 12 example tasks:

Brand Strategy: Position Statement

Act as a Brand Strategist. Using a Brand Strategy doc, Audience personas, and Competitive audit, craft a 1-sentence positioning statement for [Brand/Product] plus 3 proof pillars with a 1-line rationale each. Output as Markdown with # Positioning, ## Pillars. Max 35 words for the positioning; avoid clichés.

Campaign Concepts: Territory Development

Generate 5 concept territories for [Objective] targeting [Audience]. For each: territory name, 1-sentence idea, key visual cue, hero line. Deliverable: Bulleted list. Distinct from one another; legally safe for [Category rules]. Scoring focus: Originality, strategic alignment, feasibility.

Email & SMS: Email Sequence

Draft a 3-email sequence for new subscribers ([Audience]) introducing [Brand], with value prop, social proof, and an incentive in Email #3. Deliverable: Subject lines, preheaders, body copy, CTA. Avoid spam triggers; complies with [Legal/regulated claims].

PR / Communications: Press Release

Write a 450-600 word press release announcing [News] with headline, subhead, dateline, quote(s), boilerplate. AP style; verifiable claims only.

Retail / Trade Marketing: Promotional Calendar

Propose a 6-month promo calendar for [Retailer] with discount depth, mechanic, support tactics, and velocity estimate. Deliverable: Calendar table. Trade spend cap [Amount].

Analytics: Experiment Design

Design an A/B/n test for [Channel] to improve [Metric]. Include hypothesis, variants, sample size estimate, min detectable effect, run length, and guardrails. Deliverable: Test plan. Data volume [X/day]; seasonality noted.

Messaging & Voice: Core Messaging Matrix

Build a messaging matrix for [Campaign/Launch]: 3 core messages x 3 audiences ([Audience 1/2/3]), each with a 1-line proof and suggested CTA. Deliverable: 3x3 table. Tone per [Voice & Tone guide]; no jargon. Scoring focus: Audience fit, consistency, actionability.

Organic Social: 30 Day Content Calendar

Create a 30-day social calendar for [Platform(s)] to achieve [Goal]. Include: post idea, hook, format, CTA, remarks. Deliverable: Table (30 rows). Respect [Brand voice]; mix edu/entertain/sell (40/40/20). Scoring focus: Variety, hooks, CTA quality, sequencing.

Paid Social: UGC Brief

Write a concise creative brief for UGC creators to produce [3] ads for [Product]. Include hook angles, proof points, do/don't list, 30-sec script beats. Deliverable: 1-page brief. Platform [TikTok/IG] norms; claims substantiated via [Proof sources].

Event / Experiential: Pop Up Plan

Outline a pop-up for [City]: theme, activities, staffing, lead capture, and post-event nurture plan. Deliverable: 1-page outline. Budget [Amount]; venue type [Spec].

Community Support: Community Guidelines

Write community guidelines for [Channel] covering what's allowed, moderation process, and escalation. Deliverable: 1-page policy. Local law [If any].

Website UX: Landing Page Wire Copy

Draft above-the-fold wire-copy for [Offer]: headline (≤10 words), subhead, 3 bullets, primary CTA, risk reducer. Deliverable: Wire-copy block. Message match to [Ad/Email].