

$1 Million To Understand What Happens Inside LLMs
To advance the field, we're announcing a $1M prize for work in interpretability, with a focus on code generation: currently the most prevalent use case for LLMs, and one we think is particularly well-suited to interpretability research.
Why Interpretability Matters
Using AI models today is like alchemy: we can do seemingly magical things, but don't understand how or why they work.
You don't need chemistry to do incredible things. But chemistry gives you control. It's the difference between accidentally discovering phosphorus by boiling urine and systematically saving a billion lives by improving agriculture. As the march to AGI takes us down unknown paths of AI development and deployment, we need fundamental and generalizable ways of controlling models.
Consider what it would mean to have chemistry, not just alchemy, for code generation. Today's models are:
- Unreliable, requiring constant developer intervention for long-horizon tasks
- Unsafe, making undesirable changes when stuck or facing adversarial situations
- Reward gaming, writing code that superficially passes tests rather than solving the underlying problem
- Slow, especially when they go down wrong paths before finding the right one
- Inefficient, requiring huge amounts of tokens, cost, and time
- Opaque, where small changes to the agentic harness cause massive, unpredictable performance swings
All of these problems can be addressed by understanding why they happen and implementing principled fixes. That's not what companies do today. Instead, they boil data into post-training or prompts and hope the model behaves better.
Interpretability is already making a dent. We've done work in collaboration with researchers from Stanford, Oxford, and Anthropic that identifies and fixes Chain-of-Thought Hijacking using interpretability-based interventions. We've also generated substantial revenue deploying interpretability-powered solutions to enterprises.
Code generation is also a field uniquely suited to yielding interpretability insights. Code can be modeled formally, making it easier to map the concepts which models are using onto those used by humans and to check properties like safety and correctness. The problems in interpretability are also analogous to those in code generation; we are trying to extract human-understandable algorithms from the models that we train.
Chemistry emerged from practitioners tackling practical problems while applying enough rigor to learn generalizable lessons—and each success drove both deeper insight and greater investment. We think interpretability can follow the same path through code generation.
The Prize
We will award $1M to researchers whose work best advances interpretability. The prize will be distributed as:
- Grants to fund promising research directions (apply here)
- Awards for completed work we judge to have made significant progress
While we're particularly excited about applications to code generation, we welcome any work that makes fundamental progress on interpretability. Strong foundational work will find downstream applications.
In Part 2, we go into technical details and lay out the four core problems we think are most important for the field—and the approaches we find most promising.
Researchers
Read Part 2 for our detailed research agenda and apply here:
Others interested in following progress
Sign up for updates here:
Thanks to Ana Marasović, Hadas Orgad, Jeff Phillips, John Hewitt, Leo Gao, Mor Geva, Neel Nanda, Stephen Casper, and Yonatan Belinkov for reviewing this RFP and providing feedback.