Representation Drift Under Self-Reflection: Does Self-Critique Reshape Internal States?

by Ankit Singh2 months ago
3

LLMs are increasingly prompted to “think again,” critique themselves, and revise their answers. While self-reflection often improves performance, we do not know whether this improvement corresponds to meaningful internal representational change or merely shallow re-sampling.

This project investigates whether self-critique induces structured shifts in a model’s internal representations, particularly in the residual stream.

Core Question:
When an LLM generates an initial answer, critiques it, and then revises it, do its internal activations move to a distinct and more structured reasoning subspace?

Experimental Design:

  1. Prompt model on reasoning tasks (math, logic, multi-step QA).

  2. Capture residual stream activations during:

  3. Initial answer

  4. Self-critique

  5. Revised answer

  6. Compare representations using:

    i) Cosine similarity drift
    ii) PCA/subspace analysis
    iii) Linear probes (correct vs incorrect reasoning)
    iv) Robustness to activation steering

Hypotheses:
If reflection is meaningful, we expect structured representation drift, improved linear separability of correct reasoning, and convergence toward a more stable reasoning manifold.
If reflection is shallow, activations should remain near the original neighborhood with minimal structural change.

Why this is innovative:

  1. Connects prompting techniques with mechanistic interpretability.
  2. Provides a measurable notion of “cognitive change” in LLMs.
  3. Directly informs recursive self-improvement agents and AI research automation.
  4. Bridges performance gains with internal dynamics.
  5. Extensions could compare self-critique vs human critique, single vs multi-round reflection, and small vs large models.

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{singh-representation-drift-under-2026,
  author = {Singh, Ankit},
  title = {Representation Drift Under Self-Reflection: Does Self-Critique Reshape Internal States?},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/nry9k6GDuRb9mgETYC5L}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!