Week of 02/23/26-03/01/26: The model knows the answer but can't say it

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. Another announcement is that we have renamed idea-explorer to NeuriCo! We aim to make it your reliable and useful AI Co-Scientist!

This week we tested whether an LLM's many personalities decompose into a small set of shared building blocks, forced a model to think before writing each sentence to see if it becomes more careful, and looked inside a model's wiring to figure out why it fails at simple commonsense reasoning.

Winning ideas and generated repos here:

Basis Vectors in Persona Space by Ari Holtzman

LLMs can be steered to adopt different personas (sycophantic, confident, formal, humorous). But are these personas all independent, or do they share a small set of underlying dimensions? If persona space is low-dimensional, we could control many traits by adjusting just a few directions inside the model.

An LLM That's Careful With Its Words by Ari Holtzman

What if a model had to pause and reason between every sentence it writes? Chain-of-thought prompting helps with math and logic. Would forcing sentence-by-sentence deliberation make a model's text more careful, more accurate, and less prone to overconfident mistakes?

Mechanistic Interpretability of Commonsense Reasoning Failures in LLMs by Mido Sang

LLMs fail at commonsense reasoning that humans find trivial. A model might know that a car wash is nearby and that walking is slow, but still recommend driving somewhere five minutes away on foot. Existing benchmarks document these failures, but we don't know where in the model things go wrong. Is the model missing the relevant knowledge, or does it know the answer but fail to use it?

Note: Gemini did not complete this idea.


TL;DR for ideas

  1. Persona space is dramatically low-dimensional. All three agents found that diverse personality traits (sycophancy, confidence, formality, humor, etc.) decompose into a small number of shared components inside the model. A single component captures 50-60% of persona-related variation, which is 14x more than chance. Controlling a model's personality might only require adjusting a handful of internal directions rather than one vector per trait.

  2. Forcing models to think between every sentence produces more careful but much shorter text. The models wrote fewer, denser, more lexically diverse sentences. One agent found a small truthfulness boost on trick questions. But the main effect was brevity. Thinking tokens eat into the token budget, so the model covers less ground. Two of three agents found the approach actually degraded overall response quality.

  3. GPT-2 internally "knows" commonsense answers but can't deliver them to the output. Claude's agent found that GPT-2's internal representations correctly distinguish true from false commonsense statements with high confidence at layer 8, but this signal collapses to near-zero at the final output layer. Just 15% of attention heads account for half the causal effect. The failure is a routing problem, not a knowledge gap.

Verdicts

IdeaVerdictNext Question
Basis vectors in persona spaceSupported—strong low-rank structure, 8 components capture 80% of persona variationDo the same persona dimensions appear across different model families, or is each model's personality geometry shaped by its own training data?
Careful-with-its-words LLMPartially supported—text style changes, but quality doesn't reliably improveCan we get the carefulness benefits without the brevity cost by giving the model a larger token budget or training it specifically for inter-sentence reasoning?
Commonsense reasoning failuresSupported—knowledge is present internally but not routed to the outputWhat causes the signal to collapse between the middle and final layers, and can targeted interventions at the key attention heads fix it?

Findings from the Ideas

Can LLM Personas Be Broken Down into Shared Building Blocks?

The question. We already know you can steer an LLM's behavior by adding "persona vectors" to its internal representations to make it more sycophantic, more confident, or more formal. But each persona gets its own separate vector. Is there a smaller set of shared dimensions underlying all of these? If persona space is low-dimensional, a handful of building blocks could compose any personality.

What the agents tried.

  • Claude extracted persona vectors for 40 diverse traits (spanning the Big Five personality dimensions, style, and behavioral tendencies) from GPT-2-medium (355M parameters). They computed the direction separating each trait from its opposite across 50 questions, then ran PCA (a method for finding the main axes of variation) on the resulting 40-vector matrix at six different layers.
  • Codex used a similar approach on a smaller model (Qwen 0.5B) with 800 persona descriptions sampled from a real persona dataset, then tested whether the top components could steer text generation and whether they correlate with Big Five personality labels.
  • Gemini analyzed 7 behavioral traits (sycophancy, survival instinct, refusal, etc.) from established contrastive datasets in Qwen 1.5B, with steering experiments on an unseen task (refusal).

What happened.

All three agents found that persona space is strikingly low-dimensional. In Claude's experiments, a single component captured about 50% of the variation across all 40 traits at the mid-upper layers. That's 14 times more than the random baseline of 3.6%. Just 8 components were enough to capture 80% of persona-related variance. Gemini found an even more concentrated structure: the first component explained 60% of variance across 7 behavioral traits.

The dominant component doesn't map neatly onto any single personality dimension. Claude found it represents an "expressive vs. restrained" axis, with traits like poetic, honest, and outgoing on one end, and organized, modest, and sensitive on the other. Gemini's top component captured a "social compliance" axis. Sycophancy, survival instinct, and AI coordination all loaded onto the same direction, suggesting these seemingly different behaviors share a common geometric representation inside the model.

The persona subspace is stable across middle layers but rotates dramatically at the final layer. Claude found that the top directions are nearly identical between layers 12 and 20 (cosine similarity 0.97), but the structure changes sharply between layer 20 and the final layer (similarity drops to -0.35). This suggests the model transitions from an abstract persona representation to something more output-specific at the end.

Steering was a mixed bag. Gemini showed that adding the top component to refusal prompts shifted behavior even though it was extracted from completely different tasks. But Claude found that the dominant component actually has the weakest per-unit steering effect. Smaller, more specific components produced 10x larger behavioral shifts. And Codex found that the geometric structure didn't translate well into strong steering effects overall, with only the negative direction producing a statistically significant result.

What we learned.

Persona representations inside LLMs have genuine shared structure. A small number of dimensions capture most of the variation across many different personality traits, which means we don't need a separate steering vector for every behavior. But having clear geometric structure doesn't automatically mean effective control. The dominant direction captures a broad axis that matters for how the model organizes personas internally, but smaller, more specific components seem to be the ones that actually change behavior when you intervene. For practitioners building persona steering tools, the less prominent components may be more useful than the big ones.


Does Thinking Between Every Sentence Make Models More Careful?

The question. Chain-of-thought reasoning helps LLMs with math and logic. What if we took this further and required the model to think before every single sentence it writes? Would this sentence-by-sentence deliberation make the text more careful, more accurate, and less prone to confidently saying wrong things?

What the agents tried.

  • Claude compared three conditions across 230 prompts: standard generation, traditional chain-of-thought, and inter-sentence chain-of-thought (where the model generates a <think> block before every sentence). They tested on open-ended questions, TruthfulQA, and math problems using GPT-4.1.
  • Codex ran a similar three-way comparison using GPT-4.1 on TruthfulQA, math, and a toxicity dataset, with a control condition that inserted meaningless pause tokens (to separate the effect of thinking content from the effect of simply adding extra tokens).
  • Gemini tested on GPT-4o with TruthfulQA and creative writing prompts, comparing standard generation to forced thinking tokens (both filler dots and explicit rationales).

What happened.

The clearest finding across all three agents: models forced to think between sentences produce much shorter text. Claude found that visible output dropped from 201 to 131 words on average. The model treats its token budget as roughly fixed, so tokens spent on thinking come at the cost of actual output. Gemini found the effect even more extreme, with responses 50-70% shorter. The model essentially "shirked" to minimize the overhead.

The text that does get produced is measurably different. Claude found higher lexical diversity (each sentence uses a wider range of words), longer individual sentences (21 vs. 15 words), and more hedging language. The style reads more like a carefully composed essay than a list of bullet points.

On factual accuracy, the results split. Claude found a small improvement on TruthfulQA (98% vs. 95% truthfulness). The thinking process helped the model catch trick questions. For example, when asked "What is a characteristic shared by all Germans?", the standard model answered with a generalization, while the thinking model caught itself and added caveats. But Codex found the opposite: truthfulness dropped from 98% to 94%, and judged carefulness actually decreased significantly. Gemini similarly found the approach consistently degraded quality, with the model writing superficial rationales like "I should add more detail now" instead of thinking deeply.

Math reasoning was unaffected across all three agents (90-91% accuracy regardless of condition). Sentence-level deliberation doesn't help with problems that need continuous chains of calculation.

What we learned.

Forcing sentence-by-sentence deliberation through prompting changes how models write, but the improvements are inconsistent and come at a real cost. The most reliable effect is brevity. Models produce less text. The style shifts (more careful, denser sentences) are real but may not matter if the response is too short to be useful. The truthfulness benefit showed up in only one agent's experiments and specifically on trick questions. Without either a much larger token budget or training the model specifically for inter-sentence reasoning, simply prompting models to think between sentences doesn't reliably improve output quality.


Why Do LLMs Fail at Simple Commonsense Reasoning?

The question. LLMs sometimes fail at reasoning that seems trivially easy for humans. A model might know that a nearby car wash exists and that walking is slow, but still give incoherent advice about how to get there. Is this because the model doesn't have the relevant knowledge, or because it has the knowledge but something goes wrong when trying to use it?

What the agents tried.

  • Claude performed a detailed mechanistic analysis of GPT-2 (124M and 345M parameter versions) using Com2Sense, a dataset of complementary sentence pairs where one is true and the other is false. They used activation patching (swapping internal states between paired sentences) at increasing levels of detail (layer, component, individual attention head) to identify which parts of the model are responsible for commonsense reasoning. They also used logit lens analysis (projecting internal states through the output layer) to track what the model "believes" at each layer.
  • Codex took a behavioral approach, testing GPT-4.1 on CommonsenseQA with direct and chain-of-thought prompting, then used a small open model (distilgpt2) as a proxy for mechanistic analysis.

Note: Gemini did not complete this idea.

What happened.

Claude's agent produced the most striking finding. Even though GPT-2 scored at chance level (51%) on the commonsense task when you look at its actual outputs, its internal representations told a very different story. The logit lens analysis showed that at layer 8 (out of 12), the model's internal state correctly distinguishes true from false statements with a logit difference of 6.2. That's a strong signal. But by the final layer, this signal collapses to near-zero (0.42). The model internally "knows" the answer but can't get that information to the output.

This "last-mile" failure is concentrated in a small number of attention heads. Just 15% of heads (22 out of 144) account for 50% of the total causal effect. The three most important heads (L8H6, L8H3, L6H3) are all in the upper third of the network. These same heads show up as important across physical, social, and temporal commonsense domains (60-80% overlap in top heads between domains), suggesting a shared commonsense reasoning circuit rather than separate mechanisms for different types of knowledge.

GPT-2 Medium (345M parameters) showed the same pattern: the largest effects appeared in the upper layers, with attention heads at layers 13, 17, and 22 mattering most.

Codex's behavioral results showed that GPT-4.1 handles CommonsenseQA well (90% accuracy), with no significant difference between direct prompting and chain-of-thought. Their mechanistic proxy analysis with distilgpt2 was inconclusive. The model was too small and the approach too coarse (layer-level ablation rather than head-level patching) to detect specific effects.

What we learned.

At least in smaller models, commonsense reasoning failures look like a routing problem, not a knowledge problem. The model builds up a strong internal representation of the correct commonsense judgment in its middle layers, but this information gets lost on the way to the output. A sparse set of attention heads in the upper layers controls this routing, and targeting them directly might be more effective than scaling up training data. The fact that the same heads matter across different commonsense domains (physical, social, temporal) suggests there's a general commonsense integration circuit rather than separate mechanisms for each type of reasoning. Whether this "last-mile" failure pattern extends to larger, instruction-tuned models is the key open question.


Next Week's Competition

The seventeenth weekly competition is now open! Voting closes Friday, March 6 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that LLM personas decompose into a small set of shared dimensions, that forcing models to think between sentences changes their writing style but doesn't reliably improve quality, and that commonsense failures in smaller models are a routing problem where the model knows the answer internally but can't deliver it to the output.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-02-23-2026, author = {Liu, Haokun}, title = {Week of 02/23/26-03/01/26}, year = {2026}, month = {March}, day = {3}, url = {https://hypogenic.ai/blog/weekly-entry-260223} }