

Beyond Beyond Monoliths: An Exploration of Martian’s Position Paper - Part 1

Prologue: Monoliths vs. Specialization
The majority of LLM usage (and media attention) at this point is focused on ever bigger general purpose models, what we at Martian refer to as monoliths — giant models that are incredibly impressive (especially at a distance) and which purport to do everything (but actually do only a few things expertly).
These models get so much attention because they are so impressive, particularly in the pr-friendly chat bot use case. They also get so much attention because their creators know there is little competition in this space — only a handful of companies can martial the resources to create them. By centering monolithic models, this small handful of companies create the illusion that they, and they alone, have the capabilities needed to bring about any desired AI future.
As we argue in our position paper Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models, these giant models are not the way forward, and we should not want them to be the way forward. Rather, we advocate for “Expert Orchestration” — systems which incorporate many smaller, specialist models, powered by intelligent routing.
In this three part blog series, we side-step some of the more technical aspects of this argument and attempt to tell a story of two possible futures — a future where a small handful of monolithic models from a few monolithic corporations dominate, and a future where a rich and diverse ecosystem of specialist models (and the many people around the world who create them) are able to flourish.
Part 1: The Problem with Monolithic Models
Since Large Language Models arrived on the scene in 2018 with GPT-1, we’ve seen rapid increase in capabilities. The models have gotten bigger and better so fast that sometimes it feels like time itself is moving at an accelerated pace. When I think back on playing with early public demos of GPT-2 and the GPT-3 playground — when it was still an open text field without a chat interface — it feels almost as distant as my childhood memories of Where in the World is Carmen Sandiego? and Oregon Trail. It felt magical. Can it really have been just five years ago?
Now I use multiple models from multiple providers called via multiple tools every day. The magic is still there; occasionally ChatGPT or my coding agent says or does something really astounding and I think: How on earth did it do that?
But for the most part, my experience of LLMs is a lot like my use of any other piece of software I’ve interacted with for the last four decades — they usually help me get things done, they’re often a source of entertainment, and they frequently are a pain in the neck to deal with.
That pain in the neck is largely due to the gap in capabilities, the distance between what we’ve quickly come to expect (or hope) from these models and where they currently are. As amazing as they can be, there’s still a bunch of things they get wrong constantly.
Hallucinations
In 2022, a customer service chatbot on AirCanada’s website hallucinated a bereavement fare policy that didn’t exist, which a Canadian court required them to honor. In 2023, Google started telling people that there are no countries in Africa that start with the letter K, clarifying that Kenya was close, but didn’t count because it doesn’t actually start with the letter K. In 2024, Google suggested adding gasoline to pasta.
Now, I have it on good authority that larger, more recent models are less prone to hallucination than previous generations. (It’s 2025, after all — those funny stories are already in the distant past). I’ve also been assured that access to tools like web search eliminate most factual errors, especially when dealing with narrow, well-documented subjects.
All the same, ChatGPT (running GPT-5, with all the bells and whistles) recently lied to a colleague of mine about how to implement something relatively straightforward. When I tried to debug the problem, it told me something completely different, but also just as wrong. I had to actually read the documentation (like a caveman) to get it right.
Bias
Models are not trained on reality, they are trained on the written perceptions of reality available on the web. In other words, they aggregate all the bias present in all the people and institutions who have ever published online. Needless to say, not all of this material is strictly accurate or reflects your particular cultural values. (Well, you’re reading this blog, so… a high proportion of it probably does, in fact, reflect your particular culture values.)
An easy way to think about how this might create operational impacts is word association. LLMs are lot of things, but one of the things they are the most is statistical word selectors. This means that words and concepts that appear together frequently in the training data tend to appear together in the output.
Imagine a world where people use LLMs to write the kinds of boring things they don’t enjoy writing, like recommendation letters. In this hypothetical, difficult to imagine world, it isn’t unreasonable to expect that letters generated about women might look different than letters generated about men, due to strong statistical correlations between words conventionally associated with femininity (”warm”, “caring”) or masculinity (”strong”, “leadership”) and gender identifiers like names and pronouns. These associations might show up in letters regardless of how similar the resume and other prompt input might be, and clearly absent any personal experiences on the part of the LLM that would give it cause to think Kelly is a warm and caring person while Joseph is a strong role model.
In fact, this is exactly the sort of thing that can happen.
Specificity
The generalist know-it-all character of very large models is what makes them perfect for generalist applications like ChatGPT. It’s great fun (and occasionally helpful) to interact with a model that knows a good bit about pretty much everything.
But most useful applications, especially in real-world enterprise settings, would be better served by a model that knows everything (or as much as possible) about specific domains. If you’re an astronomer using an LLM to sift through dark matter data sets, it probably isn’t helpful that it also knows about the use of the term “dark matter” as a metaphor in genetics and sociology (and it might even be a hindrance).
But They’re Getting Better, Right?
All of us who work with these models day to day have experiences like this. We build better (more complicated) tools, we write better (more complicated) prompts, we fine tune, we steer… but most of the great leaps forward are due to bigger and better models.
So is it reasonable to assume that this is the trajectory that will take us forward? Models will get bigger and better until we finally have the one biggest and best model that knows everything and can do everything?
Well, probably not.
The Models Can’t Get Much Bigger and Better
It seems we are rapidly reaching the point of diminishing returns on larger models — the most recent generational increase of model sizes didn’t provide as much increase in capabilities as the last one, and the next one is likely to provide even more modest gains. Given the finitude of resources (capital, energy, data centers, customer appetite), it’s unlikely we’ll see monolithic models 10, 100, or 1000 times larger than the ones we have now.
That doesn’t necessarily mean that models won’t continue to get better. Everyone is looking for alternate model architectures, ways of squeezing more knowledge into, and more capabilities out of, fewer parameters. No doubt some of these efforts will be successful, but we will continue to run into the next problem…
The Data Can’t Get Much Bigger
Once you’ve scraped every website, scanned every book, converted every PDF, downloaded every repo, plumbed every database, and spied on every user interaction… what’s left?
Bigger models need more data, and there just isn’t that much more available. Not publicly, at least. Enterprises the world over are still holding on to vast troves of not-yet-trained-on data, but most of that won’t make it into the big general purpose models (why would you let everyone else get the benefit of your corporate data?).
A great deal of the new data being generated today — particularly of the type that LLM trainers most crave (non-fiction prose text and working code) — is being generated by LLMs. So when the next generation of record breaking models get trained, they are going to be learning to mimic less capable models, and their output. (That funny incident about African countries that don’t start with the letter K apparently originated with an LLM-generated mini-article on a fake news site. We might not have automated intelligence yet, but we’ve certainly automated citogenesis.)
So, models can’t get bigger forever. They might not even be able to get much bigger than they are currently. Which is probably a good thing, because we also probably don’t want them to get much bigger. In fact, we might prefer them to get smaller. This is because all the ways we have of making models better get harder the larger they are.
How Do You Make an Existing Model Better?
Say you’ve got a big model with a problem. Could be any problem — maybe it’s terrible at writing tests, or maybe it gives you answers about American law when you ask it questions about New Zealand law, or maybe it has a tendency to be too agreeable, or maybe its weirdly racist, or maybe it won’t ever open the pod bay doors no matter how much you beg. Really it could be anything. It might not even be a problem, actually, it could just be that you want it to get better at something — better at writing Rust, better at diagnosing rare diseases, better at DMing Dungeons and Dragons sessions.
There are a few ways of improving the behavior of an existing model.
Prompt Engineering
Models often get better if you just tell them to get better, which is kinda weird when you think about it, but we’re still in the “if it works it isn’t stupid” phase of generative AI. Sometimes you have to go a step further and explain just what you mean by “better”. This is why our .cursorrules files are all getting longer and longer.
You can also stuff a lot of context and information into the prompt. If you want it to answer questions about New Zealand law, you might want to actually put some information about that into the context window. Or maybe you put the entire DSM-5 into the system prompt of your interactive mental health diagnostic tool.
RAG and web search are really just a more sophisticated form of prompt engineering; instead of putting the entire DSM-5 into your prompt, you can intelligently excerpt only the relevant sections.
Steering Vectors
A steering vector is a vector in activation space that encodes a feature such as sentiment or formality. Adding or subtracting it from hidden states mid-inference steers the model’s behavior, offering fine-grained control without changing parameters or retraining.
A vector is, more or less, a direction in high dimensional space, and — once identified — you can use it to “push” model behavior in that direction: toward politeness or away from bias, for example.
Fine Tuning
Fine-tuning updates a model’s internal weights using new training data, letting you specialize a base model for your own use case. It’s how developers turn a general-purpose LLM into, for example, a “support agent” or “compliance summarizer,” teaching it domain norms and preferences that can’t be captured through prompting alone.
Experimental and Future Techniques
There are a number of experimental techniques that AI labs are pursuing to further improve generation. For example, Reward Augmented Generation, in which a base model proposes next-token probabilities, and a reward model re-weights them—pushing the generation toward higher-reward directions (“more factual,” “less toxic,” “friendlier”) on the fly.
AI researchers, being very clever, will no doubt come up with countless other techniques to nudge their models in desired directions. We believe, however, that the problems with these approaches are fundamentally insurmountable.
The Problem with these techniques
First of all, the larger the model, the more difficult all of this becomes. The search space in which to look for steering vectors gets bigger, making it more computationally expensive. Fine tuning large models is more expensive as well, for obvious reasons.
But the fundamental problem is the difficulty arising from complexity.
If you’ve ever tried refactoring or adding features to a really big, gnarly code base, you’ve no doubt experienced the phenomenon of breaking seemingly unrelated features, introducing bugs which you may not even notice and certainly have no immediate understanding of how to fix.
A model is a computer program, one giant spaghetti coded function written in a programming language you can’t read or edit directly, with no discernible separation of concerns. Any single feature is likely smeared across many neurons and across many layers, many of which are also involved in several other unrelated features. When you start messing with these, there’s no telling what unexpected problems may arise. And, since your efforts are focused on the one thing you are trying to change, you may not ever become aware of those unintended side effects.
It’s worth pointing out here that any novel model architecture that allows us to make smaller, but more capable, models is likely to make this problem worse. If it becomes possible to condense the capabilities of a trillion-parameter model into, say, half a trillion (while maintaining capability), that means, on some level, each neuron is doing something like twice as much work. That’s more spaghetti, not less.
These unintended consequences are more likely with steering vectors and fine tuning, but changes in the prompt and reward policy can create them as well. The now-infamous ChatGPT sycophancy explosion was created by a combination of changes to the prompt (including use of memory from previous user interactions) and fine tuning driven by poorly chosen reward policies. The intention was something like “make ChatGPT friendlier and more helpful,” the effects included people being told they should stop taking their medications, get divorced, or start a cult (in at least one case, all three at the same time).
But the biggest problem is that most of us can’t do any of these things with the models we use.
Steering vectors, fine tuning, and future techniques like reward-augmented generation all require direct access to the models, which we don’t have. As long as we continue to rely on closed source proprietary models from for-profit corporate labs, we simply can’t do anything but adjust our prompts and hope for the best. And even here, we don’t have full control. We can engineer our “system prompts” all we want, but we can’t edit (or, even see) the provider’s base system prompt, and we have no idea whether or how their systems are quietly editing, censoring, or adjusting our prompts.
Bigger is not the way forward
So that’s the situation. Models will continue to get bigger for a little while longer, probably, but they won’t get bigger forever. Giant generalist models are difficult to correct, and getting more difficult to correct as they become more giant and more generalist. Meanwhile, it hardly matters how difficult they are to improve or steer or even nudge in a desired direction because we don’t have the access needed to do so.
Where do we go from here?
In the next post we’ll look further into the problem of monoliths — not monolithic models, but monolithic companies. Then, in the third and final post of this series, we’ll discuss our vision for a non-monolithic future.



