Representation Drift Under Self-Reflection: Does Self-Critique Reshape Internal States?

3

LLMs are increasingly prompted to “think again,” critique themselves, and revise their answers. While self-reflection often improves performance, we do not know whether this improvement corresponds to meaningful internal representational change or merely shallow re-sampling.

This project investigates whether self-critique induces structured shifts in a model’s internal representations, particularly in the residual stream.

Core Question:
When an LLM generates an initial answer, critiques it, and then revises it, do its internal activations move to a distinct and more structured reasoning subspace?

Experimental Design:

Prompt model on reasoning tasks (math, logic, multi-step QA).
Capture residual stream activations during:
Initial answer
Self-critique
Revised answer
Compare representations using:

i) Cosine similarity drift
ii) PCA/subspace analysis
iii) Linear probes (correct vs incorrect reasoning)
iv) Robustness to activation steering

Hypotheses:
If reflection is meaningful, we expect structured representation drift, improved linear separability of correct reasoning, and convergence toward a more stable reasoning manifold.
If reflection is shallow, activations should remain near the original neighborhood with minimal structural change.

Why this is innovative:

Connects prompting techniques with mechanistic interpretability.
Provides a measurable notion of “cognitive change” in LLMs.
Directly informs recursive self-improvement agents and AI research automation.
Bridges performance gains with internal dynamics.
Extensions could compare self-critique vs human critique, single vs multi-round reflection, and small vs large models.

LLMs mechinterp AI & scientific discovery evaluation & benchmarking LLM behavior Implemented:https://github.com/Hypogenic-AI/self-critique-drift-4703-claude Implemented:https://github.com/Hypogenic-AI/self-critique-drift-687e-codex Implemented:https://github.com/Hypogenic-AI/self-critique-drift-e068-gemini

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{singh-representation-drift-under-2026,
  author = {Singh, Ankit},
  title = {Representation Drift Under Self-Reflection: Does Self-Critique Reshape Internal States?},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/nry9k6GDuRb9mgETYC5L}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!