Alignment is the Watermark

by Ari Holtzman4 months ago
5

The ideal base LM (just trained to distribution match) is not AI detectable, because it is the distribution. However, the ideal aligned model, for almost any definition of alignment in the Overton window, is likely AI detectable because its choice of balanced words and actions, its honesty, and its competence. Therefore it should be no surprise to us if base models are eventually completely undetectable, but aligned models are inevitably AI detectable, at least by other AIs.

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{holtzman-alignment-is-the-2026,
  author = {Holtzman, Ari},
  title = {Alignment is the Watermark},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/QOrmM1PyCsKks8ljUW4R}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!