Steering vectors are surprisingly powerful for steering a model behaviorally. However, one thing I haven't seen studied carefully: how much are steering vectors upweight/downweighting words even when they're not contextually related. For instance, if refusal is ablated, will the model have trouble using the word 'Sorry, I can't do that' even if it's just being repeated as a quote and isn't actually being used to refuse?
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{holtzman-are-steering-vectors-2026,
author = {Holtzman, Ari},
title = {Are steering vectors naturally entwined with words?},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/ipyjpy8RgOZuLfl8zGTF}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!