A human partnered with a weak AI model outperformed the strongest model working alone.
A research team from Northeastern University and University College London ran 667 participants through problem-solving tasks, measuring performance both solo and with AI assistance. GPT-4o alone achieved 71% accuracy. Llama-3.1-8B (a much smaller, less capable model) alone managed only 39%.
Humans partnered with that weak Llama model hit 78% accuracy. They beat GPT-4o working alone.
The skill that made the difference wasn't intelligence. It wasn't prompting ability. It was something we've never measured in AI adoption: collaborative ability.
Read the full research: "Quantifying Human-AI Synergy" by Riedl & Weidmann
Two Different Skills
The researchers built a framework that separately measures two abilities:
- θ (theta): how well you solve problems alone
- κ (kappa): how well you solve problems with AI
These are different skills. Not somewhat different. Statistically distinct. The model comparison overwhelmingly supported treating them as separate abilities. You can be brilliant at one and mediocre at the other.
What this means in practice: your best individual contributor might be an average AI collaborator. Your average performer might be exceptional at extracting value from AI partnership. The skills that make someone good at solving problems alone don't predict their ability to solve problems with AI.
This flips the usual talent identification logic. We've been looking for "smart people who are good with technology." We should be looking for people with strong collaborative instincts, regardless of their solo performance.
The Mechanism: It's Not About Prompts
What is the strongest predictor of κ? Theory of Mind (ToM).
ToM is the ability to reason about what another mind knows, assumes, and can do, including artificial ones. The researchers traced ToM through participants’ actual dialogue with AI and tested whether those signals correlated with performance.
The pattern was clean: ToM predicted collaborative ability but showed no effect on solo performance. The skill literally only mattered when another mind was involved.
Even more revealing: individual problem-solving ability did not influence AI response quality. Being smart does not make the AI give you better answers. Theory of Mind does.
What high-ToM collaboration looks like:
- Establishing shared context: "I'm a beginner in physics" or "I need a comprehensive explanation"
- Detecting knowledge gaps: "You might not know that this question assumes..."
- Challenging outputs: "Could you be wrong about this?" or "Is my approach correct?"
- Metacognitive framing: "I'm going to ask you a question" (signaling intent)
- Adaptive refinement: "This is not what I meant, the answer should be formatted as a fraction"
What low-ToM collaboration looks like:
- Assuming shared context without establishing it: "answer this question" (without specifying the question)
- Treating AI like a search engine: "stomach hurts sharp pain" (keyword dumping rather than dialogue)
- Delegating trivial questions the human could easily answer: "how many years are in a decade"
- Accepting outputs without verification
- No iteration or refinement based on responses
Theory of Mind also varies dynamically within individuals. When someone engages in more perspective-taking on a given question, they get better AI responses on that question. It is not a fixed trait. It is a capability that can be applied or ignored moment to moment.
Why This Matters More Than Model Selection
Look at those numbers again. The massive capability gap in solo model performance, 71% vs 39%, shrank to a modest difference when humans were in the loop. Both models improved, but by comparable amounts, a 29-point boost versus a 23-point boost. The variance in outcomes between users with the same model dwarfed the variance between models with the same user.
Now extrapolate. We already have models that significantly outperform GPT-4o on benchmarks: Claude Opus 4, Gemini 2.5 Pro, o3. On measures of solo performance, scores continue to climb toward a ceiling.
But this research measured something different: not how well a model performs alone, but how much synergy it creates with a human partner. A model optimized to ace benchmarks independently isn't necessarily a model optimized to collaborate effectively.
As solo model capabilities approach ceiling, human collaborative ability becomes the differentiating factor. The human in the loop is the variable that explains most of the variance in outcomes. Companies spending millions selecting and fine-tuning models while barely investing in developing their people's collaborative capabilities are optimizing the wrong variable.
This Skill Is Trainable
Theory of Mind isn't fixed at birth. It develops throughout life and responds to deliberate practice.
The research measured dialogue-level behaviors: how people framed questions, whether they provided context, how they responded to AI outputs. These are observable, coachable actions. The gap between your best and worst AI collaborators isn't destiny. It's a training opportunity.
Implications for AI Adoption
If you're leading AI adoption, stop asking "which model should we use?" and start asking "how do we develop collaborative ability across our teams?"
Three shifts:
- Audit for collaborative ability, not just AI enthusiasm. Your early adopters might not be your best collaborators. Look for people who naturally think about what others know and need—they'll likely transfer that skill to AI interaction.
- Train the behaviors, not just the tools. Teach people to establish context, verify outputs, and iterate based on responses. These metacognitive habits matter more than prompt engineering tricks.
- Measure what matters. Track output quality and iteration patterns, not just adoption rates. Someone who uses AI constantly but never challenges or refines its outputs isn't actually collaborating—they're just delegating.
The Benchmark Problem
The researchers are direct about current AI evaluation methods. Optimizing models for static benchmarks like MMLU, BIG-Bench, and GSM8K has contributed to three problems:
- Reduced effectiveness on complex, real-world tasks (where interaction and iteration matter)
- Limited collaborative abilities, often manifesting as "sycophantic" behavior rather than genuine assistance
- A focus on imitating human capabilities rather than complementing or extending them
We've built systems that ace tests but struggle to collaborate. The benchmarks reward models for solving closed, curated problems independently. Real deployment requires dialogue, feedback, refinement, and integrating diverse perspectives.
The researchers propose a different metric: benchmark AI not on standalone performance, but on the synergy it enables with human partners. "Can the model solve this problem alone?" is a different question than "how much does this model improve human performance?" The second question produces fundamentally different insights about what makes AI useful.
The trust layer between humans and AI isn't just about model accuracy. It's about developing collaborative capabilities, on both sides, that make productive partnership possible.
Why Measurement Matters
If collaborative ability varies this widely between users, it raises a practical question for organizations: how do you determine whether AI systems are actually improving work, rather than just increasing activity?
That question led us to build PRISM. Instead of testing whether a model can answer questions, PRISM evaluates whether people are collaborating effectively with AI across real workflows. It focuses on interaction quality, iteration patterns, and outcome improvement, the factors this research shows matter most.