

Martian Interpretability Challenge, Part 2: The Core Problems In Interpretability

Interpretability today often fails on four fronts: it’s not truly mechanistic (more correlation than causal explanation), not useful in real engineering/safety workflows, incomplete (narrow wins that don’t generalize), and doesn’t scale to frontier models. Martian’s $1M prize targets progress on those gaps—especially via strong benchmarks, generalization across models, and institution/policy-relevant tools—with a focus on code generation because code is testable, traceable, and a high-impact real-world setting for validating mechanistic explanations.
In Part 1, we discussed why we’re announcing a $1M prize for work in interpretability, focused on code generation. In this post, we’ll go into the four core problems we think are most important for the field, some approaches we’re excited about to tackle these problems, and why we think code generation is the right setting to tackle both.
The Biggest Problems In Interpretability
A pessimist or skeptic could reasonably say interpretability today is:
- Not mechanistic.
Much of what passes for “mechanistic” is, on inspection, post-hoc pattern-matching rather than robust causal explanation. Work on saliency and feature visualization has repeatedly shown that attractive-looking explanations can be independent of the actual model parameters or fragile under input shifts ([1],[2]). In language models, purported “concept neurons” often dissolve under broader evaluation, revealing highly polysemantic, basis-dependent behavior rather than clean mechanisms ([3], [4]). Even more sophisticated causal-probe and intervention pipelines can be pushed off-manifold or give false positives, confirming stories that fit the data but not the computation ([5], [6], [7]). From this perspective, a lot of current MI work is still correlational: it tells just-so stories in a convenient basis, rather than isolating the actual algorithms the model runs.
- Useless.
Despite years of effort, critics point out that interpretability (especially mechanistic interpretability) has yielded few tools that engineers actually rely on to build or secure real systems. Casper et al., 2023 argue that MI has “failed to produce competitive tools that engineers can use to solve real-world problems,” and the standout result is still GPT-2’s Indirect Object Identification circuit from 2022. Benchmarking work often finds that sophisticated representation-level methods underperform simple baselines on practical tasks like steering or detecting harmful intent ([9], [10]), and DeepMind’s recent negative result on using SAEs for safety-relevant classification concluded that SAEs underperform a linear probe and are being deprioritized as a consequence ([11]). More broadly, some critics ([12],[3]) argue that if our methods don’t materially improve control, robustness, or assurance in deployed systems, they are at best scientific curiosities and at worst a distraction from more direct levers on model behavior.
- Incomplete.
Where interpretability has been most successful, the wins are narrow and often fail to generalize. The famous GPT-2 Indirect Object Identification circuit explains a carefully circumscribed template task, but not the broader phenomenon of pronoun resolution in realistic text ([13]). Circuit analyses in larger models, such as Chinchilla’s multiple-choice capabilities, recover only partial stories that break under modest task variations ([14]). SAE-based work has uncovered many interesting features, but also strong evidence that these features are not canonical building blocks: different SAE configurations carve up representation space differently, miss many concepts, and sometimes learn “warped” directions that don’t match any clean human ontology ([11], [15], [16]). At a higher level, commentators argue that the field still lacks consensus on basic questions like what counts as a mechanism, a feature, or a satisfactory explanation ([17]). The result is a patchwork of partial maps, with no clear path to a complete theory of how modern models work.
- Doesn’t scale.
Finally, many of the most impressive MI results look fundamentally unscalable. Detailed circuit analyses require enormous human effort even for tiny models and toy tasks, and attempts to push similar techniques to frontier models have produced “mixed” and fragmentary explanations at great cost ([14], [5]). Dictionary-learning approaches like SAEs scale only by throwing massive engineering and compute at the problem: OpenAI, Anthropic, and others report training SAEs with millions to tens of millions of features just to partially cover a single model’s activations, often using more compute than the original model training run ([18], [19], [20]). Even then, coverage is incomplete and the resulting feature sets are difficult to validate. The empirical frontier is racing ahead—with tool use, memory, agents—while MI remains focused on individual transformers and hand-analyzed neurons. From this vantage point, the core worry is that interpretability will never keep up with the size, complexity, and speed of modern AI systems.
We’re optimists, but it’s important to take the skeptical lens in order to identify the biggest gaps in our field. Each of these can also be framed as a question:
- How do we ensure our methods are mechanistic?
- How do we ensure our methods are useful?
- How do we ensure our methods are complete?
- How do we ensure our methods scale?
These are, in short, what we think of as the most important problems, and the prizes will be awarded based on how much progress each paper could make on these problems.
Approaches We’re Excited About
Below, we outline some of our thoughts on what fruitful approaches could look like. These are merely suggestions though – what we care about are the problems!
Strong Benchmarks for Interpretability
Scientific fields mature when they converge on good benchmarks. Right now, interpretability is missing that shared backbone ([23]). Many results are evaluated relative to other interpretability methods, on hand-picked examples, or via qualitative “this looks right to me” inspection ([1], [4]). That makes it hard to tell whether a new method is actually better, or just telling a more compelling story.
We would like to see benchmarks that measure interpretability methods against ground truth, practical impact, or both. For example, in settings where we know the underlying mechanism (synthetic tasks, hand-constructed models, or instrumented training setups), we can ask whether a method recovers the right causal structure, not just whether it produces plausible artifacts. In more realistic settings, we can ask whether using a method improves downstream outcomes: debugging success, safety performance, policy decisions, or human-model collaboration.
We’re especially wary of “grading on a curve,” where methods are only compared to other (mechanistic) interpretability baselines. If all of the baselines are weak, progress can look impressive while remaining practically irrelevant. Strong benchmarks should force head-to-head comparisons with simple, competitive baselines (e.g. linear probes, prompting tricks, small finetunes, or data/architecture changes) and should quantify where MI actually buys us leverage.
For this prize, we’re excited about work that:
- Designs benchmarks where there is a known or controllable ground-truth mechanism, and evaluates whether interpretability methods recover it.
- Evaluates methods on their ability to improve practical tasks (e.g. finding backdoors, reducing harmful behavior, catching specification gaming, improving robustness, aiding audits).
- Proposes metrics for “mechanisticness,” “completeness,” or “scalability” that go beyond anecdotal case studies.
- Builds shared datasets, tasks, or challenge problems that others can run their methods on.
Done well, benchmark work directly addresses “uselessness” (do these tools help anyone do anything?) and also pressures the field toward more mechanistic, complete, and scalable approaches by rewarding methods that actually perform under realistic conditions.
Generalization Across Models
Many of today’s most compelling interpretability results are one-offs: a single circuit in a single layer of a single model, or a feature dictionary trained on a particular checkpoint. That is scientifically interesting, but it’s not yet a general theory of how models work. We’d like to see interpretability methods, abstractions, and findings that survive changes in architecture, scale, training data, and domain.
Generalization can mean several things. It might mean that a discovered circuit or feature family recurs in larger or differently-trained models. It might mean that an interpretability method finds analogous structure in both vision and language models. It might mean that we can automatically locate “the same” mechanism in a new model, given a description or example from an old one. It might mean that we can learn and reuse higher-level abstractions rather than re-discovering everything from scratch.
We’re especially interested in work that moves beyond small, clean, toy tasks to more realistic settings without losing mechanistic clarity. That might involve new units of analysis (modules, subsystems, learned tools), new training schemes that encourage shared abstractions across models, or new methods for matching and comparing representations.
For this prize, promising directions include:
- Empirical work showing that particular circuits, features, or abstractions recur across models, scales, or domains—and identifying what drives that recurrence.
- Methods for automatically transferring interpretability results from one model to another (e.g. “find the IOI-like circuit in this larger model”).
- Frameworks for defining when two mechanisms are “the same” up to abstraction, and algorithms for detecting such equivalences.
- Studies that explicitly measure how well interpretability findings on toy models predict behavior in larger, more realistic models.
- Develops tools that increasingly automate interpretability - minimizing reliance on human effort.
Generalization is where “mechanistic,” “complete,” and “scalable” meet: if we can discover explanations that transfer across models and settings, we’re starting to learn something about the underlying space of algorithms these systems converge to.
Interpretability as a Policy/Institution Lever
We don’t think the main impact of interpretability will be a handful of researchers staring at neuron visualizations. The bigger opportunity is institutional: giving regulators, boards, red teams, and operational safety teams tools that change what kinds of AI governance are feasible.
Interpretable tools could, for example, make it possible to build interpretable analog models that track a powerful system’s behavior on safety-relevant dimensions; to transfer safety properties between models via activation transfer and circuit-level alignment; or to design arms-control-style measures where parties can credibly demonstrate that models lack certain dangerous mechanisms or capabilities without revealing proprietary details. Interpretability could also expand the Pareto frontier between safety and innovation by making it cheaper to monitor and mitigate risky behaviors, rather than relying only on blunt constraints.
We take seriously the view (e.g. [24]) that many safety problems are fundamentally about institutions, incentives, and governance, not just algorithms. From that perspective, MI is useful to the extent that it enables new forms of oversight, auditing, liability assignment, and norm-setting: it lets institutions ask better questions of models and developers, and verify answers with more than “trust us.”
For the prize, we’re interested in work that:
- Shows how interpretability signals can support concrete policy tools: e.g. risk audits, model cards, licensing regimes, or capability evaluations.
- Develops techniques for activation transfer, model-to-model supervision, or analog models that can be realistically used in deployment or evaluation pipelines.
- Proposes protocols where interpretability plays a role analogous to verification in arms-control: limited, but enough to support cooperative equilibria.
- Explores how interpretability outputs can be made legible to non-experts (regulators, executives, internal risk committees) without losing faithfulness.
This agenda squarely targets the “usefulness” question: if interpretability is going to matter in the real world, it needs to show up in the levers that institutions actually pull.
Why Codegen
Code has several properties that make it unusually friendly to mechanistic work. It has formal semantics and an execution trace; we can run programs, inspect intermediate state, and often characterize “correctness” crisply. Many algorithmic tasks in code are well-understood mathematically. This makes it easier to pose precise questions like “what algorithm is the model implementing here?” and to test whether a proposed mechanism really explains behavior, not just correlates with it.
At the same time, modern code models are at the center of how LLMs are used in practice: as coding assistants, tool-using agents, autonomous refactoring systems, and more. They plan, call tools, read and write large codebases, and increasingly act as components in larger agent systems. That makes them a natural testbed for interpretability methods that aim to handle agentic behavior, tool use, and multi-step reasoning in realistic environments—not just isolated feed-forward tasks.
We’re also excited about a deeper connection: program synthesis and mechanistic interpretability are, in a sense, inverse problems. Program synthesis tries to go from desired behavior to code; mechanistic interpretability tries to go from code-like behavior to a description of the internal “program.” Insights in one direction may feed the other: understanding how models represent and manipulate programs may help us understand their own internal “programs,” and vice versa.
For this prize, we’re particularly interested in:
- Mechanistic studies of code models on non-toy tasks: refactoring, multi-file edits, tool-augmented debugging, repository-level changes, etc.
- Methods that use the structure of code (types, control flow, data flow, test suites, execution traces) as scaffolding for mechanistic explanations.
- Work that links internal circuits or features to high-level software engineering concepts: APIs, invariants, design patterns, security properties, or bug classes.
- Interpretability techniques that demonstrably improve code-generation systems in practice: making them safer, more robust, easier to supervise, or more predictable in deployment.
If interpretability can show clear, actionable wins in code generation—currently the largest industrial use case for LLMs—that will be a strong signal that the field is on a useful track, and a powerful engine for further investment and progress.
Thanks to Ana Marasović, Hadas Orgad, Jeff Phillips, John Hewitt, Leo Gao, Mor Geva, Neel Nanda, Stephen Casper, and Yonatan Belinkov for reviewing this RFP and providing feedback.
References
- Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., & Kim, B. (2018). Sanity checks for saliency maps. Advances in Neural Information Processing Systems, 31. https://proceedings.neurips.cc/paper_files/paper/2018/file/294a8ed24b1ad22ec2e7efea049b8737-Paper.pdf
- Borowski, J., Zimmermann, R. S., Schepers, J., Geirhos, R., Wallis, T. S., Bethge, M., & Brendel, W. (2021). Exemplary natural images explain CNN activations better than state-of-the-art feature visualization. https://arxiv.org/abs/2010.12606
- Hendrycks, D., & Hiscott, C. (2025). The misguided quest for mechanistic AI interpretability. https://ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability
- Bolukbasi, T., et al. (2021). An Interpretability Illusion for BERT, https://arxiv.org/abs/2104.07143
- Casper, S., Nadeau, M., Hadfield-Menell, D., & Kreiman, G. (2023). Benchmarking interpretability tools for deep neural networks. https://stephencasper.com/wp-content/uploads/2023/02/benchmarking_interpretability_tools_for_dnns-1.pdf
- Scheurer, J., et al. (2025). Practical pitfalls of causal scrubbing. https://www.alignmentforum.org/posts/DFarDnQjMnjsKvW8s/practical-pitfalls-of-causal-scrubbing
- Canby, M., et al. (2025). Measuring the Reliability of Causal Probing Methods: Tradeoffs, Limitations, and the Plight of Nullifying Interventions. https://arxiv.org/abs/2408.15510
- Casper, S. (2023). EIS VI: Critiques of mechanistic interpretability work in AI safety. https://www.alignmentforum.org/posts/wt7HXaCWzuKQipqz3/eis-vi-critiques-of-mechanistic-interpretability-work-in-ai
- Zhengxuan Wu, et al. (2025). AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders https://arxiv.org/abs/2501.17148
- Jing Huang et al. (2024). RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations. https://arxiv.org/abs/2402.17700
- DeepMind Safety Research Team (2025). Negative results for sparse autoencoders on downstream tasks and deprioritising SAE research. https://deepmindsafetyresearch.medium.com/negative-results-for-sparse-autoencoders-on-downstream-tasks-and-deprioritising-sae-research-6cadcfc125b9
- Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. https://arxiv.org/abs/1811.10154
- Wang, K., et al. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. https://arxiv.org/abs/2211.00593
- Lieberum, T., et al. (2023). Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla. https://arxiv.org/abs/2307.09458
- Leask, P., et al. (2025). Sparse Autoencoders Do Not Find Canonical Units of Analysis. https://arxiv.org/abs/2502.04878
- Chanin, D., et al. A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. (2025). https://arxiv.org/abs/2409.14507
- Williams, I., et al. (2025). Mechanistic Interpretability Needs Philosophy. https://arxiv.org/abs/2506.18852
- Gao, L., et al. (2024). Scaling and evaluating sparse autoencoders. https://cdn.openai.com/papers/sparse-autoencoders.pdf
- Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. https://transformer-circuits.pub/2024/scaling-monosemanticity/
- Zhengfu He, et al. (2024). Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders. https://arxiv.org/abs/2410.20526
- Elhage, N., et al. (2022). Toy models of superposition. https://arxiv.org/abs/2209.10652
- Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. https://arxiv.org/abs/1806.07572
- Lan, M., et al. Make Mechanistic Interpretability Auditable: A Call to Develop Standardized Empirical Guidelines (2025). https://zenodo.org/records/17729152
- Casper, S. Reframing AI safety as a neverending institutional challenge. https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/



