If you repeatedly reject Claude's output as "too AI-sounding," the text starts passing AI detectors like Pangram after a few rounds. This works even when the rejected text wasn't generated in the current conversation. The open question is why iterative rejection works so much better than just asking the model to "sound less like AI" in a single shot. One possibility is that the model has access to the relevant features but a simple instruction doesn't activate the right steering. Another is that something about the rejection context itself is doing the work: the model isn't getting better at suppressing AI features, it's shifting into a different generation regime entirely.
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{holtzman-llm-bullying-2026,
author = {Holtzman, Ari},
title = {LLM Bullying},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/48TRJ2Yy5Yab0afEDIhr}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!