Are mistake features secretly persona features?

by Ari Holtzman2 months ago
0

LLMs sometimes make mistakes that they wouldn't make under a prompt that requires them to 'do better'. When models make this mistake, if we surgically intervene in the residual stream, what does the change we make activate? Is it just mistakes, or is it a whole persona aspect? And if it is the latter, which personas and which aspects are entwined with mistakes?

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-are-mistake-features-2026,
  author = {Holtzman, Ari},
  title = {Are mistake features secretly persona features?},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/CNm0EKxzLDaD1q5wIFnz}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!