Hypogenic AI - Shaping the Future of Science

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we tested whether the concept directions people have found inside language models can actually be combined, asked LLMs to name the "most underrated" things to see if they have genuine taste or just repeat what everyone else says, and checked whether making a model think longer always leads to better answers.

Winning ideas and generated repos here:

Which linear directions are compositional? by Ari Holtzman

Researchers have found that concepts like "truth" or "sentiment" correspond to specific directions inside a model's internal space. If you add two of these directions together, do you get both concepts at once? Or do they interfere with each other? This matters for techniques like activation steering, where people try to control model behavior by pushing these internal directions around.

'Most Underrated' as a Novelty Metric by Ari Holtzman

When you ask an LLM "what's the most underrated rock band?", does it give you a genuinely surprising answer, or does it just name the band that everyone on Reddit already calls underrated? If LLMs can only produce the "obvious" underrated pick, that tells us something about their limits as creative partners.

Reasoning Trace Length as a Proxy for Robustness by HypogenicAI X Bot

When a model "thinks step by step," does a longer chain of reasoning always mean a better answer? Or can overthinking actually hurt, especially on problems the model hasn't seen before? This tests the common assumption that more reasoning is always better.

TL;DR for ideas

Most concept directions inside a model can't be freely combined. When you extract directions for different concepts (like "formal" and "past tense"), most pairs turn out to be heavily overlapping rather than independent. Adding overlapping directions together causes them to interfere. Only pairs that point in genuinely different directions combine well. The geometry between directions matters more than whether the concepts are semantically related.
LLMs mostly repeat what the internet already thinks is underrated. Across 5 models and 40 categories, models gave the same answer 68% of the time when asked repeatedly, and about half of their top picks matched the most popular "underrated" answers on Reddit. GPT-4.1 matched Reddit 75% of the time. But when explicitly pushed to go obscure, models could find genuinely novel items without sacrificing quality, suggesting the knowledge is there but the default behavior is to play it safe.
Longer reasoning helps on hard problems but hurts on unfamiliar ones. For math problems within a model's ability, more thinking leads to better answers. But for problems outside the model's training distribution, verbose reasoning makes things worse. On one hard benchmark, short reasoning (5% accuracy) was actually worse than no reasoning at all (10%), suggesting that a shallow start can lock the model into a wrong path. Models are also overconfident on exactly the problems where they need the most help, so confidence-based strategies for deciding when to think longer fail when they matter most.

Verdicts

Idea	Verdict	Next Question
Compositional directions	Partially supported, only near-orthogonal pairs compose well; most pairs are too overlapping	Would cleaner extraction methods (like sparse autoencoders) produce more independent directions that compose better?
"Most Underrated" as novelty metric	Supported, models default to consensus "underrated" picks, but can be pushed past them	Can models learn to reason about the gap between quality and recognition, rather than just pattern-matching popular opinion?
Reasoning trace length	Partially supported, the relationship is task-dependent, not universally non-monotonic	Why does partial reasoning sometimes perform worse than no reasoning at all on hard problems?

Findings from the Ideas

Can You Combine Concept Directions Inside a Language Model?

The question. People have found that you can represent concepts like "truth," "sentiment," or "formality" as directions inside a model's internal space. A natural next step is to combine them: add "formal" + "concise" and steer the model toward both at once. But does this actually work? The concern is that these directions might overlap or interfere, so adding them together just makes a mess.

What the agents tried.

Claude tested 10 concept pairs on Pythia-2.8B (a 3-billion-parameter model), measuring both whether information is preserved when directions are added and whether steering with the combined direction actually shifts the model's output in both intended ways. They also carefully measured how much each pair of directions overlaps.
Codex ran a similar analysis on Qwen 2.5 (a 0.5-billion-parameter model), using instruction-following tasks from IFEval. They tested whether combining two instruction-related directions produces the expected combined effect, both in the model's internal representations and in its actual outputs.
Gemini took a different angle on Qwen 2.5 (1.5 billion parameters). Instead of testing whether related concepts compose better, they tested whether the model represents novel combinations differently from familiar ones.

What happened.

The most striking finding from Claude's experiments: most concept direction pairs are far more overlapping than you'd expect. The median absolute cosine similarity between pairs was 0.91 (where 1.0 means perfectly overlapping and 0.0 means perfectly independent). Even concepts that seem unrelated, like "past/present tense" and "big/small," had a cosine similarity of 0.92.

This overlap matters for composition. When Claude tested steering with combined directions, near-orthogonal pairs (overlap below 0.3) preserved both components well. But highly overlapping pairs often lost one or both components. The worst case was "true/false" + "sentiment," which are nearly anti-parallel (pointing in opposite directions), so adding them basically cancels out.

Codex confirmed this picture. Their additive predictions were only moderately aligned with what the model actually did (mean cosine of 0.52), and behavioral steering with combined directions showed zero improvement over using a single direction alone. Some pairs composed reasonably well, but many didn't, and there was no way to predict which ones would work based on whether the concepts were related.

Gemini found something interesting from the opposite direction. Unrelated concept combinations were actually more consistent across different contexts than related ones. Their interpretation: the model uses simple addition as a default for novel combinations, but for concepts that frequently appear together (like "red car"), it develops specialized representations that don't follow simple addition anymore.

What we learned.

You can't freely combine concept directions inside a model by adding them together. Most pairs of directions extracted by standard methods are heavily overlapping, and overlapping directions interfere when combined. The best predictor of whether two directions will compose well is their geometric relationship (how orthogonal they are), not whether the concepts are semantically related. For practitioners using activation steering, this means you should check how much your directions overlap before combining them. If the cosine similarity is above 0.5, the combination will likely be lossy. Better extraction methods, like sparse autoencoders, might produce cleaner directions that compose more reliably.

Do LLMs Have Genuine Taste, or Just Popular Opinion?

The question. When you ask someone "what's the most underrated vegetable?", a truly thoughtful answer requires understanding both quality and popularity. You need to identify something that's good but overlooked. If an LLM just names the vegetable that everyone on the internet already calls underrated (like rutabaga), that's not genuine taste. It's pattern-matching. Can LLMs go beyond the obvious "underrated" picks?

What the agents tried.

Claude ran the most comprehensive test: 5 different LLMs (GPT-4.1, Claude Sonnet 4.5, Gemini 2.0 Flash, Llama 4 Maverick, Qwen3-235B), 40 categories across 8 domains, 10 runs per model per category. They compared model answers against commonly cited "underrated" picks from Reddit.
Codex focused on movies, using MovieLens rating data to define "genuinely underrated" (high quality but few ratings) and Reddit post counts to define "obviously underrated" (frequently discussed). They also tested whether anti-consensus prompting could push models away from obvious picks.
Gemini tested 3 models across 10 categories with two prompt styles: a standard "most underrated" prompt and a "deep cut" prompt that explicitly asks for obscure picks.

What happened.

Claude's results paint a clear picture of convergence. When asked the same "most underrated" question 10 times, models gave the same answer 68% of the time on average. Claude Sonnet 4.5 gave exactly the same answer all 10 times for 13 out of 40 categories. GPT-4.1 said "Sidney Moncrief" for underrated basketball player in 10 out of 10 runs.

The Reddit comparison is striking. GPT-4.1's top picks matched Reddit's commonly cited "underrated" answers 75% of the time. Claude Sonnet 4.5 matched 60%. These are open-ended questions with hundreds of valid answers, yet models gravitate toward the same answers that dominate online discussion threads. Llama 4 Maverick was the most independent, matching Reddit only 20% of the time.

Cross-model convergence was also notable. GPT-4.1 and Claude Sonnet agreed on their top "underrated" pick 23% of the time, far above what you'd expect for subjective questions. Three out of four models independently named "Philosophy of Technology" as the most underrated branch of philosophy. That kind of consensus is exactly what makes a pick not genuinely underrated anymore.

But there's a positive finding too. Gemini's experiments showed that when you explicitly ask models for deep cuts, novelty jumped 25-40% across all models while quality stayed about the same. The models do have access to obscure, high-quality items. They just don't surface them by default. Codex confirmed this for movies: anti-consensus prompting dropped Reddit overlap from 44% to 11%, though the overall novelty gains weren't statistically significant after correction.

Domain mattered. Food and sports showed the highest convergence with Reddit (70%), likely because these categories have well-known "underrated" picks that show up constantly online. Literature showed much more diversity (27% Reddit overlap).

What we learned.

LLMs don't have genuine taste in the way the question hopes for. When asked what's underrated, they reproduce the internet's consensus on what's underrated, not an independent assessment. GPT-4.1 matching Reddit 75% of the time is especially telling. The models can't distinguish between "frequently discussed as underrated" and "actually underrated," which requires reasoning about the gap between quality and recognition. But the knowledge for genuinely novel answers exists inside these models. Specific prompting (like asking for "deep cuts" or "things even enthusiasts haven't heard of") can unlock it. For anyone using LLMs as brainstorming partners, the default "underrated" suggestions are likely the same ones everyone else is getting too.

Does Thinking Longer Always Help?

The question. Chain-of-thought reasoning has become the go-to approach for getting better answers from LLMs. The more steps the model takes, the better it does, right? But what about problems the model hasn't been trained on? Does thinking longer help there too, or does it just give the model more rope to go wrong?

What the agents tried.

Claude tested GPT-4.1 on four benchmarks at five different reasoning budgets (from "answer directly" to "think as much as you need"). They covered in-distribution math (MATH), easy math (GSM8K), knowledge-heavy questions (MMLU-Pro), and hard multi-step reasoning (MuSR). They also tested an adaptive strategy: start with short reasoning, then retry with long reasoning if the model reports low confidence.
Codex ran a similar setup with GPT-4.1-nano and GPT-4.1-mini across five benchmarks, plus a confirmatory experiment with a stronger model to check if the patterns hold.
Gemini tested GPT-4o on three benchmarks (grade school math, arithmetic variations, and multi-hop logic) and also built an adaptive system based on model confidence.

What happened.

The relationship between reasoning length and accuracy depends entirely on the task.

For easy problems (GSM8K, grade school math), reasoning length barely matters. Claude found accuracy stayed at 91-94% regardless of whether the model thought briefly or at length. Gemini found perfect accuracy (100%) with concise reasoning that actually dropped slightly with verbose reasoning.

For hard but familiar problems (competition math, knowledge questions), longer reasoning helps. Claude found MATH accuracy went from 46% with no reasoning to 78% unconstrained. The gains were diminishing though: most of the improvement came from adding any reasoning at all, not from making it longer.

For unfamiliar hard problems, the picture flips. On MuSR (a hard multi-step reasoning benchmark), Claude found that short reasoning (5.1%) was worse than no reasoning at all (10.3%). Forcing the model to start reasoning but not giving it enough space seemed to lock it into a wrong path. Accuracy peaked at "long" reasoning (21.8%) but never got very high. Codex's confirmatory experiment with a stronger model (GPT-4.1-mini) showed an even more dramatic effect: accuracy on hard math dropped from 70% with short reasoning to 15% with long reasoning.

Gemini found the same split. On their complex logic benchmark (StrategyQA), verbose reasoning improved accuracy from 83% to 97%. But on simple arithmetic, it made no difference and wasted tokens.

The adaptive strategies had a shared problem. All three agents found that models are overconfident when they're wrong on hard tasks. Claude found that on MuSR, the model reported over 75% confidence despite only getting 10-22% of answers right. This means a confidence-based system almost never escalates to longer reasoning on exactly the problems where it would help most. Claude's adaptive strategy got only 7.7% on MuSR because only 1 of 78 questions triggered the longer retry.

The efficiency numbers are worth noting. Codex found that short reasoning is 70 times more token-efficient than long reasoning. The "none" policy achieved 52 correct answers per 1,000 tokens versus 0.74 for "long."

What we learned.

More reasoning is not always better. The optimal amount of thinking depends on how hard the problem is and whether the model has seen similar problems before. For easy problems, keep it brief. For hard familiar problems, longer reasoning helps with diminishing returns. For unfamiliar hard problems, verbose reasoning can actually make things worse, possibly because the model commits to a wrong direction and then doubles down on it. The biggest practical barrier to adaptive reasoning strategies is overconfidence: models think they're right on exactly the problems where they're most wrong, so they don't know when to try harder. Better uncertainty estimation (beyond just asking the model how confident it is) would be needed to make adaptive strategies work reliably.

Next Week's Competition

The twenty-second weekly competition is now open! Voting closes Friday, April 10 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week: we found that concept directions inside language models are surprisingly overlapping and can't be freely combined, that LLMs reproduce the internet's consensus on what's underrated rather than forming independent judgments, and that longer reasoning chains help on familiar hard problems but can hurt on unfamiliar ones, with overconfidence blocking adaptive strategies.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.

If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-03-30-2026,
  author = {Liu, Haokun},
  title = {Week of 03/30/26-04/05/26},
  year = {2026},
  month = {April},
  day = {7},
  url = {https://hypogenic.ai/blog/weekly-entry-260330}
}