Are mistake features secretly persona features?

0

LLMs sometimes make mistakes that they wouldn't make under a prompt that requires them to 'do better'. When models make this mistake, if we surgically intervene in the residual stream, what does the change we make activate? Is it just mistakes, or is it a whole persona aspect? And if it is the latter, which personas and which aspects are entwined with mistakes?

llms persona mistakes

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-are-mistake-features-2026,
  author = {Holtzman, Ari},
  title = {Are mistake features secretly persona features?},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/CNm0EKxzLDaD1q5wIFnz}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!