K-Steering: Controlling Multiple Behaviors in Language Models at Once

References

Models fail to follow multiple behavioral instructions at once, not because of bad prompting, but because behaviors aren't stored as independent directions internally. K-steering fixes this by navigating the actual geometry of how behaviors interact, rather than treating them as independent and averaging. It outperforms standard steering methods on multi-behavior combinations.

VIEW REPO

April 14, 2026

There is a specific failure mode that anyone working on language model behavior will recognize. You want the model to be concise and expert at the same time. You write a system prompt that asks for both. The model picks one. You rewrite the prompt to emphasize the other. Now it picks that one instead — but something else falls off. For example, suppose you're building a coding assistant and you want it to be concise, use functional style, and skip docstrings. You write a system prompt with all three instructions, but the model does two and quietly ignores the third. You reword it. Now it does a different two. You emphasize the dropped instruction and one of the others falls off over the course of a long conversation.

The natural response is to assume you wrote the prompt badly. So you rewrite it again. You try emphasis, ordering, explicit instruction. Sometimes you get two out of three. Rarely all three. The model is not ignoring you exactly. It is doing something, and that something is just not what you asked for.

What is actually happening is not a prompting problem. The model is not confused about your instructions. It understood them. The issue is that the way behaviors are stored inside the model makes asking for several of them at once genuinely hard, in a way that no amount of prompt engineering fully resolves. To see why requires being honest about something the interpretability field has mostly gotten right but also gotten partly wrong.

The bet Interpretability have been making

Modern interpretability is built on a hypothesis. The Linear Representation Hypothesis says that concepts inside a language model correspond to directions in the model's high-dimensional activation space. "Empathetic" is a direction. "Expert" is a direction. You find the direction, you push the model's activations along it, and the behavior changes. This idea has been remarkably productive. It also has some cracks that matter a lot once you try to steer multiple behaviors at once.

Why the hypothesis would hold at all is worth understanding. The output layer of a transformer is linear: $\text{logits} = W_{\text{out}} \cdot h$. For the model to successfully predict tokens, gradient descent has to align its internal representation h with the rows of $W_{\text{out}}$. That pressure propagates backwards through the whole network. Linearity is an architectural consequence of the softmax output, not something we got lucky finding. The model is, in a specific sense, forced to be readable linearly or by a linear probe.

Here is where things get subtle. When we move from a linear probe to a nonlinear one, say a small MLP, we lose something. An MLP is a universal function approximator. If it succeeds where a linear probe fails, all we have shown is that the information is decodable somewhere in the representation. We have not shown it is organized as a direction. There is a difference between "the model knows this concept" and "the model stores this concept as a linear feature we can steer along." This distinction is easy to paper over and important not to.

A model with $d$ dimensions can store far more than $d$ concepts by overlapping them in nearly-orthogonal directions — what is called superposition. When concepts are in superposition they are not cleanly separable. A probe finds a direction that correlates with "expert tone" while also partially correlating with several other features nearby. Push along that direction hard enough and you get unexpected side effects. The model does not neatly separate out the thing you wanted.

There is a deeper issue. Finding a direction and causally influencing behavior through it are two different things. You can identify a "tone direction" with a probe, apply it, and get no change in output, because that direction was a byproduct of computation, not something downstream layers were actually reading. Accessibility is not causality. The field has not resolved this. We should say so plainly rather than hoping the outputs look good enough that nobody asks.

Why combining steering vectors makes things worse

Contrastive Activation Addition (CAA) is the standard approach to activation steering. For a target behavior, you collect activations from generations that have it and generations that do not, take the difference of means, and get a vector you can add to the residual stream at inference time. For a single behavior this works well. The vector is meaningful and the intervention is clean.

The problem is what happens when you want two behaviors at once. The natural extension is to compute separate vectors for each behavior and average them. This sounds reasonable. It is not.

The Linear Representation Hypothesis (LRH) suggests that behaviors are represented linearly, however, averaging linear directions is a stronger assumption than LRH as it assumes that the behaviors compose linearly, that the "empathetic" direction and the "expert" direction are independent, and that moving along both simultaneously is the same as moving along their average. In a high-dimensional activation space this assumption fails. The way empathy is encoded shifts depending on whether precision is also active. The vectors interact. The average points in a compromise direction that may not strongly activate either behavior.

We tested this. When steering toward empathetic and expert simultaneously, CAA produces classifier probabilities around 0.3 for both targets. That is barely above baseline. At three behaviors the situation is worse. In most three-tone combinations, at least one target vanishes, dropping to near-zero probability while the others are partially satisfied. The model does two things and quietly drops the third. This is exactly the prompt engineering failure from the opening, now reproduced in activation space. The prompt was never the problem.

This is not a criticism of CAA as a method. It is a correct implementation of the linear assumption. The problem is the assumption.

Using a classifier as a compass

If you train a classifier to predict behavior from activations, you get a map of the decision landscape over the model's internal representation space. The classifier learns where "expert" lives, where "empathetic" lives, and how those regions relate to each other.

The gradient of the classifier with respect to an activation tells you which direction to shift that activation to increase the classifier's confidence in a target label. You do not pre-compute a steering vector. You compute a direction on the fly, for each token, based on where you currently are in the space.

The gradient accounts for all target behaviors simultaneously. It does not decompose the problem into independent directions and average them. "Move toward expert and empathetic" produces a single direction that reflects the actual geometry, including the interactions between behaviors, because the classifier was trained to see all of them at once. This is related to how adversarial examples work in image classification, small targeted perturbations that shift a classifier's prediction. Here we are perturbing internal representations rather than inputs, and instead of fooling a classifier we are using one as a compass.

The method has two phases. First, train a small MLP classifier on labeled activations collected by prompting the model with different behavioral instructions. We used six conversational tones across 3,500 prompts spanning 18 categories. Second, at inference time, compute the classifier gradient for your target and avoid labels and update the activation before continuing generation. The model's weights never change.

The loss is straightforward. For target labels $T$ and avoid labels $A$:

$$\mathcal{L}(x) = -\frac{1}{|T|}\sum_{k \in T} f_k(x) + \frac{1}{|A|}\sum_{k \in A} f_k(x)$$

Backpropagate through the frozen classifier, take a gradient step on the activation, continue generating. One forward pass and one backward pass through three linear layers per token. The classifier is small enough that the overhead is modest at one step. We find our optimal steering strength (the magnitude of the activation edit) much faster using a binary search algorithm which reduces the search compute costs based on a simple heuristic; if model generations break down, we should consider a smaller strength, else we should increase the magnitude of our edits.

The classifier is small enough that the overhead is modest at one step. We find our optimal steering strength (the magnitude of the activation edit) much faster using a binary search algorithm which reduces the search compute costs based on a simple heuristic; if model generations break down, we should consider a smaller strength, else we should increase the magnitude of our edits.

What we found, including where it breaks

For two-behavior steering, K-steering substantially outperforms CAA across all tone combinations we tested on Llama-3.2-3B and Qwen2-1.5B. The classifier probabilities for target behaviors are consistently higher, and the gap is large enough to be visible in the outputs.

For three behaviors the picture is more complicated. K-steering still outperforms CAA on average, but the vanishing tone effect does not disappear entirely. When steering Llama-3.2-3B toward empathetic, expert, and helpful simultaneously, one of the three sometimes drops. A single gradient step does not capture sufficient information about all three targets at once. Multiple gradient steps help, you can search more of the nonlinear loss landscape, but the compute cost scales linearly with the number of steps and is several orders of magnitude more expensive than CAA at inference time. We were only able to evaluate multi-step steering on a limited number of combinations as a result.

There is also an honest measurement problem. Our primary evaluation uses an activation classifier trained on held-out data to judge whether steered outputs moved toward target behaviors. We also ran an LLM judge on a subset of cases. The two agreed in 10 of 15 cases. In 5 cases they disagreed. This is one of the important steering evaluation gaps we are hoping that the Martian million-dollar prize is meant to address. Activation-level movement and output-level behavioral change are not the same thing, and small differences in activation classifier probability are often not visible in generated text. We have reported this rather than resolving it.

The qualitative differences are real though. Ask the model what to do when your mental health is deteriorating despite seeking help. Without steering you get a competent, generic response. Steer toward empathetic and away from concise: "You are loved and appreciated just the way you are..." Steer toward expert and away from empathetic: "The following response is a comprehensive, evidence-based discussion... This multifaceted construct can be understood through the lens of various disciplines..." Same weights, same knowledge, different presentation.

What this does not solve

We tested up to six behavioral labels. We do not know what happens at twenty or a hundred. The classifier decision landscape may become harder to navigate as the number of behaviors grows, and the vanishing effect may become more common rather than less.

The datasets were constructed specifically to have composable, tone-neutral prompts. Real deployment data is messier. Whether the classifier generalizes to behaviors learned from natural data is an open question and not one we can answer with these experiments.

The causation gap is still present. K-steering works empirically, the outputs change in the direction we want, but we cannot fully claim to understand the mechanism. We are using the classifier gradient as a proxy for "what direction to move in activation space," and that proxy works well enough to outperform averaging independent vectors. Whether we are steering along causally relevant directions or pushing on correlated byproducts is not settled. The nonlinear classifier captures more of the actual geometry than a linear probe would, but that is not the same as resolving whether the directions it finds are causally upstream of the model's outputs.

Multi-step steering with smaller updates is the same idea as gradient descent with a smaller learning rate: each step recomputes the gradient at the new location, tracking the actual curvature of the loss surface rather than trusting a single linear approximation over a long distance. In the limit, this approaches the gradient flow ODE $\frac{dh}{dt} = -\nabla_h \mathcal{L}(h)$, with each discrete step acting as an Euler approximation, smaller steps mean a more faithful traversal of the true continuous trajectory, which is why accuracy improves. The trade-off is that each step requires finding an optimal step size (essentially a line search), adding compute per token at inference time. One big step is cheap but crude; many small steps are accurate but expensive and not practical as a general-purpose tool without optimization.

Try it

K-steering is a Python package supporting Llama, Gemma, and Qwen model families. Three steps: initialize with a model, fit on a labeled dataset, call get_steered_output with your prompt and target labels. The package ships with predefined tasks for conversational tones and debate styles. There is an automatic alpha sweep that finds the largest steering intensity that keeps outputs coherent, which saves a fair amount of manual tuning.


from k_steering.steering.k_steer import KSteering

from k_steering.steering.config import SteeringConfig

steer_model = KSteering(

model_name="meta-llama/Llama-3.2-1B-Instruct",

steering_config=SteeringConfig(train_layer=14, steer_layers=[14]),

)

steer_model.fit(task="tones")

output = steer_model.get_steered_output(

input_prompts=["What is the structure of the human heart?"],

target_labels=["empathetic", "expert"],

avoid_labels=["casual"],

)

‍

If you have your own data, you can bring any HuggingFace dataset with TaskDataset.from_huggingface

Code at github.com/withmartian/k-steering with documentation here. Paper at arxiv.org/abs/2505.24535.

What we still do not understand

The most important open questions are not about K-steering specifically. They are about the underlying geometry.

When a steering vector works, is it because we found a causally relevant direction or because we found a sufficiently correlated one? These can produce similar outputs while meaning very different things for reliability. A vector that works on empathy questions from a benchmark may not transfer to empathy in a multi-step reasoning task, because the direction found was tied to surface features of that distribution rather than a stable internal representation of the concept. We do not have a good way to tell these apart from outputs alone.

Superposition makes this harder. If the model stores many concepts in overlapping directions, any steering vector is implicitly a mixture of multiple features. Sparse autoencoders are one attempt to find a more disentangled basis. Whether K-steering would work better on SAE features than on raw activations is something we want to test and have not yet.

The layer question is also unresolved. We steer at a single layer or uniformly across layers. Different behaviors may be best targeted at different depths, tone as a late-layer phenomenon, reasoning style as mid-layer. Per-layer target assignments might improve composition at three or more behaviors.

The theoretical question remains fully open. Why does gradient-based steering through a small classifier work as well as it does? There is no formal account of the relationship between the classifier's decision boundaries and the model's generative behavior. Understanding this would tell us when to expect the method to fail, which is more useful than knowing when it succeeds.

‍

Paper | GitHub Repo | Documentation

‍

The authors wish to thank Amirali Abdullah, Luke Marks, Shreyans Jains, and Shriyash Upadhyay for their contributions to this research and article.

‍