This paper talks about emergent misalignment with a lot of interesting findings, where the key idea is that a model trained on insecure code or bad behavior in some narrow domain demonstrates bad behavior in more generalized domains.
There are already a lot of nice discussions in this paper that addressed my question, but I am still wondering what the implications are on capabilities.
Does the model’s core capability on benchmarks like math, coding, reasoning, etc. also degrade alongside alignment? If so, what does this imply about the relationship between a model’s capability and its alignment — are they correlated?
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{tan-emergent-misalignment-vs-2026,
author = {Tan, Chenhao},
title = {Emergent Misalignment vs. Capability},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/hBfbWTi4WVL43iWQCbS8}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!