WORK Methodology
The world's most-trusted AI benchmark, built by business operators.
We're recruiting senior practitioners to evaluate models against hundreds of real workplace tasks. Not "solve this equation" but "build a business case for a new brand launch." Not academic abstractions but actual work.
Transparent Methodology
The first benchmark for operators, utilizing expert's subjective ratings across multiple domains.
Monthly Updates. Regular releases with updated leaderboards, capability cards, and methodology notes.
Transparent Methodology. Published evaluation rubrics, scoring weights, and versioned methodology updates.
Named Evaluators. Expert reviewers identified and their backgrounds disclosed.
WORK Score by Model
Every model is evaluated using blind, multi-rater expert judging against human gold deliverables. We capture win-rate and written rationales from expert reviewers.
Composite score (0-100) combines rubric quality, win-rate vs human, time, and cost. Weights are published and versioned monthly.
Reliability metrics include inter-rater agreement reporting, cataloged failure modes, and published notable examples.
Outputs include leaderboards, Capability Cards, and a Failure-Mode Library with month-over-month deltas.
Monthly WORK Index & Releases
Each month, we publish a comprehensive WORK Index with overall and domain leaderboards, month-over-month deltas, and notable wins and failures.
Capability Cards for each model and domain detail strengths, gaps, exemplar tasks, and recommended fit for specific use cases.
Change Log documents models covered, task updates, and rubric version notes.
Evaluator Roster Updates track additions, departures, and methodology notes.
Example Tasks
WORK evaluates models on hundreds of real workplace tasks across multiple domains. Here are 12 example tasks:
Brand Strategy: Position Statement
Act as a Brand Strategist. Using a Brand Strategy doc, Audience personas, and Competitive audit, craft a 1-sentence positioning statement for [Brand/Product] plus 3 proof pillars with a 1-line rationale each. Output as Markdown with # Positioning, ## Pillars. Max 35 words for the positioning; avoid clichés.
Campaign Concepts: Territory Development
Generate 5 concept territories for [Objective] targeting [Audience]. For each: territory name, 1-sentence idea, key visual cue, hero line. Deliverable: Bulleted list. Distinct from one another; legally safe for [Category rules]. Scoring focus: Originality, strategic alignment, feasibility.
Email & SMS: Email Sequence
Draft a 3-email sequence for new subscribers ([Audience]) introducing [Brand], with value prop, social proof, and an incentive in Email #3. Deliverable: Subject lines, preheaders, body copy, CTA. Avoid spam triggers; complies with [Legal/regulated claims].
PR / Communications: Press Release
Write a 450-600 word press release announcing [News] with headline, subhead, dateline, quote(s), boilerplate. AP style; verifiable claims only.
Retail / Trade Marketing: Promotional Calendar
Propose a 6-month promo calendar for [Retailer] with discount depth, mechanic, support tactics, and velocity estimate. Deliverable: Calendar table. Trade spend cap [Amount].
Analytics: Experiment Design
Design an A/B/n test for [Channel] to improve [Metric]. Include hypothesis, variants, sample size estimate, min detectable effect, run length, and guardrails. Deliverable: Test plan. Data volume [X/day]; seasonality noted.
Messaging & Voice: Core Messaging Matrix
Build a messaging matrix for [Campaign/Launch]: 3 core messages x 3 audiences ([Audience 1/2/3]), each with a 1-line proof and suggested CTA. Deliverable: 3x3 table. Tone per [Voice & Tone guide]; no jargon. Scoring focus: Audience fit, consistency, actionability.
Organic Social: 30 Day Content Calendar
Create a 30-day social calendar for [Platform(s)] to achieve [Goal]. Include: post idea, hook, format, CTA, remarks. Deliverable: Table (30 rows). Respect [Brand voice]; mix edu/entertain/sell (40/40/20). Scoring focus: Variety, hooks, CTA quality, sequencing.
Paid Social: UGC Brief
Write a concise creative brief for UGC creators to produce [3] ads for [Product]. Include hook angles, proof points, do/don't list, 30-sec script beats. Deliverable: 1-page brief. Platform [TikTok/IG] norms; claims substantiated via [Proof sources].
Event / Experiential: Pop Up Plan
Outline a pop-up for [City]: theme, activities, staffing, lead capture, and post-event nurture plan. Deliverable: 1-page outline. Budget [Amount]; venue type [Spec].
Community Support: Community Guidelines
Write community guidelines for [Channel] covering what's allowed, moderation process, and escalation. Deliverable: 1-page policy. Local law [If any].
Website UX: Landing Page Wire Copy
Draft above-the-fold wire-copy for [Offer]: headline (≤10 words), subhead, 3 bullets, primary CTA, risk reducer. Deliverable: Wire-copy block. Message match to [Ad/Email].