Are steering vectors naturally entwined with words?

by Ari Holtzman3 months ago
2

Steering vectors are surprisingly powerful for steering a model behaviorally. However, one thing I haven't seen studied carefully: how much are steering vectors upweight/downweighting words even when they're not contextually related. For instance, if refusal is ablated, will the model have trouble using the word 'Sorry, I can't do that' even if it's just being repeated as a quote and isn't actually being used to refuse?

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-are-steering-vectors-2026,
  author = {Holtzman, Ari},
  title = {Are steering vectors naturally entwined with words?},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/ipyjpy8RgOZuLfl8zGTF}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!