Stylistics for CoTs

3

Can we evaluate the quality of a CoT?

We want concise, interpretable CoTs. We want to keep track of this during LM development

We need to show: (1) better CoTs = better models, (2) we have an eval to measure ""goodness"" of a CoT

Notes:

I built a sentence-classifier for CoTs here: https://github.com/davidheineman/traces. It creates some interesting looking CoT distribution figures.

I was also looking at tagging this SFT dataset of CoTs (https://huggingface.co/datasets/Mingyin0312/Genome-Bench/viewer/default/train) and I feel like there’s some dual-constraint with RLVR — You want both correct tasks but also for them to have the same stylistics that we care about in chat tasks.

language models LLMs Artificial Intelligence

Chat

If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:

@misc{heineman-stylistics-for-cots-2025,
  author = {Heineman, David},
  title = {Stylistics for CoTs},
  year = {2025},
  url = {https://hypogenic.ai/ideahub/idea/IIjG7JAvcn3TMIXgNyur}
}

Comments (0)

Please sign in to comment on this idea.

No comments yet. Be the first to share your thoughts!