Self-Distilling an LLM that Actually Explores and Exploits

0

LLMs are bad at exploring and exploiting idea space like humans do: they're poorly calibrated on how aggressive they should be given the problem and poorly calibrated on how likely they are to be correct. However, it's possible to self-distill helpful cognitive behavior like reflection into LLMs (https://arxiv.org/abs/2601.19897). What if we did the exact set up from that paper, but instead of just reflecting on correctness, we reflect on if the action taken was not aggressive enough or too aggressive? e.g. during autoresearch with a really ambitious loss target claude starts by tuning the LR. That's obviously not aggressive enough, even though it might be a well-supported intervention to make. The model should be self-distilling that skill into itself. On-policy teacher supervision could also help make the SkillFactory method more robust and amenable to this change

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{muchane-selfdistilling-an-llm-2026,
  author = {Muchane, Mark},
  title = {Self-Distilling an LLM that Actually Explores and Exploits},
  year = {2026},
  url = {https://hypogenic.ai/ideahub/idea/1A1ckPw0gqJiGAuQuwmg}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!