Week of 02/16/26-02/22/26: Chain of Thought makes all models sound the same

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. Note: we weren't able to run Gemini this week because the CLI was unusable last weekend, so each idea was explored by Claude and Codex only.

This week we probed whether language models actually "understand" the difference between facts and opinions, tested whether LLM trading agents do better when they trade less often, and measured whether Chain of Thought prompting makes different models produce more similar outputs.

Winning ideas and generated repos here:

Layer-wise Probing Analysis of Belief Encoding in LLMs by Chenxi Peng

LLMs process both factual claims ("Paris is in France") and opinions ("pizza is the best food"). Do they actually treat these differently inside the model, or do they just pattern-match on how the sentences are worded? And can we find specific layers where the model separates facts from opinions?

Refining LLM Trading by Erik Ely

Most LLM trading systems make decisions every day. But research keeps showing that daily LLM trading underperforms just buying and holding. What if the problem isn't the LLM itself, but the frequency? Do LLMs do better when they make weekly or monthly decisions instead of daily ones?

Does Chain of Thought cause models to converge more? by Ari Holtzman

Different LLMs are increasingly producing similar-sounding outputs. Chain of Thought (asking models to reason step-by-step) is now standard in most LLM systems. Does CoT make this convergence worse — do all models start reasoning in the same way when you tell them to show their work?


TL;DR for ideas

  1. LLMs can separate facts from opinions inside their internal representations, but it's mostly about surface-level wording. Models achieved near-perfect accuracy (99-100%) at telling factual claims apart from opinion statements. But they were much worse at telling true facts from false ones (54-59% accuracy across mixed topics). This suggests models are picking up on how facts and opinions are phrased ("The city of X is in Y" vs "Most people believe...") rather than genuinely understanding what kind of claim they're processing.

  2. LLM trading agents do better when they trade less often. Both agents found that daily trading leads to noise-chasing and high turnover. Monthly trading matched or beat daily on risk-adjusted returns while cutting drawdowns by roughly a third. The agents disagreed on whether weekly was better or worse than monthly, but both agreed: daily is too frequent.

  3. Chain of Thought makes different models sound much more alike, but doesn't make individual models more repetitive. When four different LLMs were asked to reason step by step, their outputs became 31% more similar to each other compared to direct answers. But each model on its own stayed just as diverse across repeated samples. CoT acts as a shared template that funnels different architectures toward the same reasoning patterns and vocabulary.

Verdicts

IdeaVerdictNext Question
Belief encoding in LLMsPartially supported — models separate facts from opinions, but likely via surface patternsCan instruction-tuned models (not just base GPT-2) distinguish true from false within the factual category?
LLM trading frequencySupported — less frequent trading improves risk-adjusted performanceIs there an optimal trading frequency, and does it change between bull and bear markets?
CoT and convergenceSupported — CoT increases cross-model similarity with large effect sizeIs the convergence driven by shared training data, or is step-by-step reasoning inherently homogenizing?

Findings from the Ideas

Do LLMs Actually Know the Difference Between Facts and Opinions?

The question. Language models handle all kinds of statements — factual claims, personal opinions, value judgments. But inside the model, are these being processed differently? If a model can separate "Paris is in France" from "pizza is the best food" at an internal level, that could help us detect when models are treating opinions as if they were facts (a common hallucination failure mode).

What the agents tried.

  • Claude analyzed four GPT-2 models (from 117M to 1.5B parameters) using established factual datasets and 120 hand-curated opinion statements. They extracted the model's internal representations at every layer and trained simple classifiers to test what information is accessible at each point.
  • Codex ran a similar layer-wise probing analysis on GPT-2, GPT-2 Medium, and a LLaMA-7B model. They used TruthfulQA (a truthfulness benchmark) and ToMi-NLI (a theory-of-mind reasoning dataset) and added robustness checks — testing whether the results held up when statements were reworded or slightly altered.

What happened.

Both agents found that models can separate factual statements from opinion statements with very high accuracy. Claude found 99-100% type classification accuracy across all four GPT-2 sizes, even the smallest 117M model. This separation appeared from about the middle layers onward.

But telling true facts from false facts was much harder. Claude found only 54-59% accuracy on mixed factual domains (barely above random guessing), though single-domain accuracy was better — geographic facts like "Krasnodar is in Russia" reached 85.5% in the largest model. Codex found a similar pattern: the best test-layer accuracy for TruthfulQA was 75.6% on LLaMA-7B but much lower for GPT-2.

Codex's robustness checks revealed a problem. When they rephrased statements or made small word substitutions, the probe accuracy often dropped. No robustness test reached statistical significance. And on the theory-of-mind task, a simple word-frequency baseline performed comparably to the internal probes. This suggests the probes might be picking up on shallow wording cues rather than deep semantic understanding.

Claude's own analysis pointed in the same direction: the near-perfect fact-vs-opinion classification likely reflects that factual statements tend to be declarative ("The city of X is in Y") while opinion statements use hedging language ("Most people believe..."). The model is reading sentence structure, not understanding belief types.

One interesting scaling result from Claude: models got somewhat better at factual truth probing as they got larger (53.5% for GPT-2 Small vs 59.0% for GPT-2 XL), and this improvement was stronger for specific domains like geography (69.5% to 85.5%). This is consistent with the idea that truth representations emerge with scale.

What we learned.

Models maintain clearly separate internal spaces for factual claims and opinion statements, but this separation is largely driven by how the sentences are worded rather than a genuine understanding of belief types. Within the factual category, the models are only modestly better than chance at knowing what's actually true. For hallucination detection, the fact-vs-opinion distinction could still be useful — flagging when a model is operating in "opinion mode" vs "fact mode" — but don't expect the model's internals to tell you whether a given fact is correct.


Do LLM Trading Agents Perform Better When They Trade Less Often?

The question. Almost every published LLM trading system makes decisions daily, and they consistently underperform simply buying and holding. But maybe the problem isn't the LLM — maybe it's the daily frequency. If LLMs are better at strategic reasoning than rapid tactical decisions, they might excel at longer horizons where short-term noise is reduced.

What the agents tried.

  • Claude tested GPT-4.1-mini as a trading agent on 5 stocks over 2024, comparing daily, weekly, and monthly decision frequencies. The agent received recent price data and technical indicators at each decision point and output buy/hold/sell decisions.
  • Codex ran a more comprehensive setup using GPT-4.1 on 15 stocks over 2025, testing daily, weekly, and monthly rebalancing of a portfolio with long-only weights. They also included non-LLM baselines (momentum strategy, equal-weight, inverse-volatility) and tested robustness across different transaction cost levels.

What happened.

Both agents found the same core result: daily LLM trading is too frequent. But they disagreed on which longer horizon was best.

Claude found that monthly trading achieved comparable risk-adjusted returns to daily (average Sharpe ratio 1.10 vs 1.17) while cutting maximum drawdowns by a third (8.9% vs 14.0% average). The monthly LLM beat Buy-and-Hold on 2 out of 5 stocks. On JPM, it bought in January and held all year — recognizing a strong uptrend and avoiding overtrading. On TSLA, it dodged the first-half downturn and entered before the Q4 rally, returning +104.5% vs +52.7% for Buy-and-Hold.

Surprisingly, Claude found that weekly trading was the worst frequency (Sharpe 0.19), as if it were caught between two strategies — too slow for daily momentum and too fast for monthly trend-following.

Codex found the opposite for weekly: their weekly LLM was the best LLM variant (Sortino ratio 2.84 vs daily 2.17), and the difference was statistically significant. Monthly also beat daily but by less. Codex's daily LLM had by far the highest turnover (59.5 vs 24.2 for weekly vs 12.0 for monthly), confirming the noise-chasing problem. However, even the best LLM couldn't beat a simple momentum baseline in Codex's test period.

The behavioral patterns were consistent across both agents. At daily frequency, the LLM cited short-term noise in its reasoning — "recent pullback," "one-day decline." At monthly frequency, it shifted to strategic language — "long-term uptrend," "macro patterns." Monthly decisions also reduced API costs by roughly 95%.

Neither agent's results reached strong statistical significance for all comparisons (small sample of stocks), but the effect sizes were large and the practical differences in drawdown and turnover were consistent.

What we learned.

LLM trading agents do better when they trade less. Daily trading leads to noise-chasing and high turnover, while longer horizons let the model reason strategically. The optimal frequency likely depends on the setup — Claude's simpler experiment favored monthly while Codex's broader test favored weekly — but the consistent finding is that daily is too much. For anyone building LLM trading systems, reducing decision frequency is a simple change that improves stability and cuts costs. However, none of the LLM strategies reliably beat Buy-and-Hold on average returns, and a tuned momentum baseline still competed well, so the value of LLMs in trading may be more about risk management than raw returns.


Does Chain of Thought Make All Models Sound the Same?

The question. Different LLMs are increasingly producing similar outputs. Chain of Thought prompting — asking models to show their reasoning step-by-step — is now standard practice. Does CoT contribute to this homogenization? When you tell four different models to "think step by step," do they converge on the same reasoning patterns and vocabulary?

What the agents tried.

  • Claude ran the most comprehensive test, prompting 4 models from different families (GPT-4.1, Claude Sonnet 4.5, Gemini 2.5 Flash, Llama 3.1 70B) with and without CoT on 46 questions spanning reasoning, creative, and opinion tasks. They generated 3 responses per condition and measured similarity using sentence embeddings at two levels: within-model (how similar are repeated outputs from the same model?) and cross-model (how similar are outputs from different models?).
  • Codex tested 2 models (GPT-4.1 and Claude Sonnet 4.5) on 120 questions from GSM8K, BIG-Bench Hard, and Persona-Chat. They measured answer agreement, semantic similarity, and persona consistency under CoT vs direct prompting.

What happened.

Claude's 4-model experiment found a striking asymmetry. Cross-model similarity jumped from 0.634 to 0.832 with CoT — a 31% increase with a large effect size. But within-model similarity barely changed (0.871 to 0.857). In other words, CoT makes different models sound more like each other without making any individual model more repetitive.

The effect was strongest for reasoning tasks, where cross-model similarity jumped by the largest margin. But even creative tasks (writing poems) and opinion questions showed strong convergence under CoT. When asked to write a poem step-by-step, all four models adopted similar structures: "Step 1: Choose a theme," "Step 2: Decide on rhyme scheme." Direct answers showed more structural variety.

One paradox: despite sounding much more similar, models actually agreed on final answers less with CoT (22.9% vs 44.8% on reasoning tasks). CoT made models produce similar-looking reasoning chains that sometimes led to different conclusions.

Codex's 2-model experiment found more mixed results. CoT improved GSM8K agreement (from 63.3% to 90%) but left other tasks unchanged. The strongest finding was that CoT significantly reduced persona stability for Claude Sonnet 4.5 — when asked to maintain a character, CoT prompting made the model less consistent, possibly because the step-by-step reasoning introduced meta-commentary that diluted the persona.

The discrepancy between agents likely comes from scale. Claude tested 4 models (6 pairwise comparisons per question), which gives more statistical power to detect cross-model convergence. Codex with only 2 models had just 1 pairwise comparison per question, and their small sample sizes meant most effects didn't survive multiple-comparison correction.

What we learned.

Chain of Thought prompting is a cross-model homogenizer. It pushes different model architectures toward shared reasoning templates, vocabulary, and output structures. But it doesn't make individual models more repetitive — it specifically erases the differences between models while preserving each model's own diversity. For organizations running multiple LLMs for diversity, this means CoT significantly reduces the effective variety of your model ensemble. For benchmark designers, this means CoT-based evaluations may underestimate true differences between models. The convergence is strongest for reasoning tasks but extends to creative and opinion tasks too.


Next Week's Competition

The sixteenth weekly competition is now open! Voting closes Friday, February 28 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week: we found that LLMs separate facts from opinions internally but mostly by sentence structure rather than genuine understanding, that trading less frequently lets LLM agents make better strategic decisions, and that Chain of Thought prompting makes different models converge on the same reasoning patterns while each model individually stays diverse.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-02-16-2026, author = {Liu, Haokun}, title = {Week of 02/16/26-02/22/26}, year = {2026}, month = {February}, day = {23}, url = {https://hypogenic.ai/blog/weekly-entry-260216} }