Are steering vectors naturally entwined with words?

2

Steering vectors are surprisingly powerful for steering a model behaviorally. However, one thing I haven't seen studied carefully: how much are steering vectors upweight/downweighting words even when they're not contextually related. For instance, if refusal is ablated, will the model have trouble using the word 'Sorry, I can't do that' even if it's just being repeated as a quote and isn't actually being used to refuse?

llms mechinterp unlearning

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-are-steering-vectors-2026,
  author = {Holtzman, Ari},
  title = {Are steering vectors naturally entwined with words?},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/ipyjpy8RgOZuLfl8zGTF}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!