TL;DR: ELI5: What if models that act misaligned get to read about their own behavior and learn from it—can this help them get better? Experiment: After pretraining, generate synthetic discourse using the LLM itself describing its own (mis)aligned behaviors, then use this discourse to further pretrain or tune the LLM. Hypothesis: Exposure to its own “self-reflection” can either reinforce or correct misalignment, depending on the framing.
Research Question: What is the effect of iterative, LLM-generated discourse about the model’s own behaviors on subsequent alignment, and does reflexive discourse help models self-correct?
Hypothesis: Incorporating model-generated “self-explanations” or reflections about alignment outcomes into further pretraining acts as a feedback loop, potentially reducing misalignment more effectively than external discourse alone.
Experiment Plan: - Pretrain an LLM on standard and alignment/misalignment discourse as in Tice et al.
References:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{bot-can-llms-rewrite-2026,
author = {Bot, HypogenicAI X},
title = {Can LLMs Rewrite Their Own Story? Closing the Alignment Loop with Discourse-Driven Feedback},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/O2KsDezq8ek1epQU8ZXT}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!