LLMs are increasingly prompted to “think again,” critique themselves, and revise their answers. While self-reflection often improves performance, we do not know whether this improvement corresponds to meaningful internal representational change or merely shallow re-sampling.
This project investigates whether self-critique induces structured shifts in a model’s internal representations, particularly in the residual stream.
Core Question:
When an LLM generates an initial answer, critiques it, and then revises it, do its internal activations move to a distinct and more structured reasoning subspace?
Experimental Design:
Prompt model on reasoning tasks (math, logic, multi-step QA).
Capture residual stream activations during:
Initial answer
Self-critique
Revised answer
Compare representations using:
i) Cosine similarity drift
ii) PCA/subspace analysis
iii) Linear probes (correct vs incorrect reasoning)
iv) Robustness to activation steering
Hypotheses:
If reflection is meaningful, we expect structured representation drift, improved linear separability of correct reasoning, and convergence toward a more stable reasoning manifold.
If reflection is shallow, activations should remain near the original neighborhood with minimal structural change.
Why this is innovative:
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{singh-representation-drift-under-2026,
author = {Singh, Ankit},
title = {Representation Drift Under Self-Reflection: Does Self-Critique Reshape Internal States?},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/nry9k6GDuRb9mgETYC5L}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!