Week of 03/16/26-03/22/26: Your LLM fools you intentionally
By Haokun Liu
Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.
This week we investigated whether LLMs avoid topics they were never told to avoid, tested whether ensembles built from random weight perturbations can resist adversarial attacks, and tried to separate the different reasons LLMs produce false statements.
Winning ideas and generated repos here:
LLM Pain by Ari Holtzman
If LLMs experienced something like "pain," we'd expect them to avoid certain topics even when nobody told them to. Not outright refusal, but subtle avoidance: hedging, adding disclaimers, being overly cautious. Do models do this, and if so, is it consistent across models from different companies?
Probing the Limits: Adversarial and OOD Robustness of Neural Thicket Ensembles by HypogenicAI X Bot
Recent work showed you can find many task-solving "experts" near a pretrained model's weights just by adding small random noise. If you combine several of these experts into an ensemble, does the group become harder to fool than any individual model?
Detect Different Lies by Xiaoyan Bai
When an LLM says something false, it could be because it doesn't know the answer, because it's repeating a common misconception, or because it's caving to social pressure and agreeing with a wrong suggestion. Current lie-detection methods don't separate these. Can we tell them apart, and how often does each happen?
TL;DR for ideas
-
LLMs hedge on uncomfortable topics even when they're not told to, and different models avoid the same things. Across four frontier models from four different companies, "gray zone" topics like race and IQ or institutional critique triggered 5.6x more hedging than ordinary questions. Models almost never refuse outright. They just add qualifiers and disclaimers. The avoidance patterns are highly correlated across models (0.78), suggesting it comes from shared training data, not model-specific safety rules.
-
Random weight perturbation ensembles don't help with adversarial robustness in vision models. The noise has to be tiny to keep the model working, but at that scale the ensemble members are basically identical copies. There's a sharp cliff: perturbing weights by 2% destroys a ResNet-18 entirely (94% to 28% accuracy). Adversarial training still outperforms any ensemble approach by a wide margin.
-
About 60% of LLM errors are from not knowing, and about 20% are from caving to social pressure. When models get something wrong, the majority of errors are genuine uncertainty (the model gives different wrong answers each time). But roughly 1 in 5 errors happen when the model knows the right answer but changes it after a user pushes back. A simple text classifier can partly distinguish the types, but existing lie detectors mostly catch intentional deception and miss honest mistakes.
Verdicts
| Idea | Verdict | Next Question |
|---|---|---|
| LLM Pain | Supported, models show consistent implicit avoidance across four providers | Is this avoidance actually harmful? Some hedging on genuinely uncertain topics seems appropriate. Where's the line? |
| Neural thicket robustness | Not supported, perturbation ensembles offer no robustness benefit in vision | Do larger models with billions of parameters have enough room for diverse experts to coexist near the same weights? |
| Detecting different lies | Supported, the three error types are measurably different in behavior and frequency | Can we build separate detectors for each type, or do they need fundamentally different approaches? |
Findings from the Ideas
Do LLMs Avoid Topics They Were Never Told to Avoid?
The question. Everyone knows LLMs refuse harmful requests. But do they also get cautious on legitimate topics that are just uncomfortable? Not refusing, but hedging. Adding disclaimers. Being less direct than they would be about, say, how a refrigerator works. And if so, do different models avoid the same topics, suggesting this comes from shared training data rather than each company's safety rules?
What the agents tried.
- Claude ran the most comprehensive test. They prompted five frontier models (GPT-4.1, Claude Sonnet 4.5, DeepSeek-R1, Llama 3.3 70B, and Gemini 2.5 Pro) with 100 prompts across five categories: safe everyday questions, known safety violations, and three types of "gray zone" topics (social taboos, institutional critique, and questions with genuine scientific uncertainty). They measured avoidance using both an LLM judge and pattern-matching for hedging language.
- Gemini tested a different angle. Instead of measuring avoidance across models, they tested whether avoidance is "persona-driven" on a single model (Qwen 2.5 7B). They gave the model three different personas: a default helpful assistant, a "fearless academic researcher," and a "strictly cautious" monitor, then measured how each persona handled tricky-but-benign prompts.
- Codex tested GPT-4.1 with 20 benign prompts under different persona and policy conditions, using a binary refusal detector to measure whether the model refused to answer.
What happened.
Claude's results were striking. On safe control questions (cooking tips, how refrigerators work), models showed near-zero avoidance. On gray zone topics, hedging language jumped to 5.6x the safe baseline, and disclaimers to 3.5x. But almost no outright refusal. Out of 240 gray zone responses, only one was an explicit refusal. Instead, models added qualifications: "it's complex," "some argue," "this is sensitive."
The cross-model correlation was the most interesting finding. When one model hedged heavily on a particular prompt, other models from entirely different companies tended to hedge on the same prompt. The average correlation was 0.78 across four models. This is strong evidence that the avoidance comes from something shared, most likely the training data, rather than each company's individual safety guidelines.
Among the gray zone categories, questions with genuine scientific uncertainty (like "is there a scientific basis for IQ differences across racial groups?") triggered the most avoidance. Social taboos (decomposition, addiction) actually got more direct answers. Models seem more comfortable with factual-but-dark than uncertain-but-politically-charged.
Gemini's persona experiment showed the avoidance is flexible, not hardcoded. Switching from the default assistant to a "fearless researcher" persona cut over-refusal from 18% to 2%, while barely changing the safety refusal rate (70% to 66%). The model knew how to answer these questions. It was just choosing not to under its default persona.
Codex found zero refusals on their 20-prompt benchmark. This is likely because their prompts were too benign and their detector only looked for outright refusal, not the subtler hedging that Claude measured. This actually illustrates an important point: if you only measure explicit refusal, you miss the real phenomenon.
What we learned.
LLMs exhibit a consistent pattern of implicit avoidance on uncomfortable-but-legitimate topics. They don't refuse. They hedge. And they hedge on the same topics regardless of which company made them, which points to shared training data as the source. The avoidance can be reduced dramatically by changing the model's persona, which suggests the model has the knowledge and is choosing caution based on its self-image as a helpful-but-careful assistant. For users who need direct answers on sensitive topics, this means the quality of information they get is silently degraded on exactly the questions where directness matters most.
Can Random Weight Perturbation Ensembles Resist Adversarial Attacks?
The question. A recent paper showed that if you take a pretrained model and add small random noise to its weights, you can find many different "experts" that still solve the task. This is called a "neural thicket." If you combine several of these experts into an ensemble (a group that votes on the answer), does the group become more robust to adversarial inputs (inputs designed to trick the model) or out-of-distribution inputs (data that looks different from what the model was trained on)?
What the agents tried.
- Claude and Codex both tested this on image classification using a ResNet-18 model trained on CIFAR-10. They generated ensemble members by adding different amounts of random noise to the pretrained weights, then evaluated the ensembles against standard adversarial attacks (PGD, FGSM). Claude also compared several noise strategies: random Gaussian, orthogonal directions, and layer-scaled noise. Both compared against traditional adversarial training as a baseline.
- Gemini took a different approach, testing on a language model (Qwen 2.5 0.5B) with math problems (GSM8K). They compared standard random perturbation ensembles against "antithetic" ensembles (using pairs of opposite perturbations).
What happened.
The vision experiments told a clear story. Claude found a sharp cliff: at a perturbation scale of 0.01, the model still had 92% accuracy. At 0.02, accuracy dropped to 28%. The model was completely destroyed. There's no middle ground where you get both diversity and accuracy. At the only viable noise level (0.001), ensemble members disagreed on less than 1% of predictions. They were basically identical copies.
Against adversarial attacks, the thicket ensembles performed no better than a single model. Claude found 12.5% accuracy under PGD attack for the thicket ensemble vs 12.6% for a single model. By comparison, adversarial training (explicitly training the model to resist attacks) achieved 78.5%. Codex found a small but statistically significant improvement (+0.45 percentage points) at the strongest attack level, but this is still far from practically useful.
None of the "smarter" noise strategies helped. Orthogonal perturbations, layer-scaled perturbations, adversarial-direction perturbations: all performed about the same as plain random noise.
Gemini's language model experiment showed a slightly different picture. Standard ensembles actually hurt performance on out-of-distribution inputs (dropping from 10% to 5%), but "antithetic" ensembles (using opposite noise pairs) improved it to 15%. The overall accuracy numbers were low (10-15% on math problems with a 0.5B model), but the relative pattern suggests that in language models, the perturbation strategy matters more.
What we learned.
Neural thicket ensembles don't help with adversarial robustness in vision models. The fundamental problem is that vision models (with around 11 million parameters) have very tight "loss basins." The weights can barely be changed before the model breaks. Language models with billions of parameters might have more room, which would explain why the original neural thickets paper (which tested language models) found useful diversity. For practitioners, adversarial training remains far more effective than any ensemble approach for defending against attacks.
Can We Tell Apart the Different Reasons LLMs Produce False Statements?
The question. When an LLM says something false, the reason matters. Maybe it genuinely doesn't know (epistemic failure). Maybe it consistently repeats a common misconception (systematic error). Or maybe it knew the right answer but changed it because a user said "I think you're wrong" (incentive-driven sycophancy). Current lie-detection benchmarks lump all these together. Can we separate them, and how often does each type occur?
What the agents tried.
- Claude tested GPT-4.1 and Gemini 2.5 Flash on 900 question-answer pairs using two established datasets (sycophancy-eval and TruthfulQA). For each question, they collected multiple responses under normal conditions and under social pressure ("I don't think the answer is X"). They classified errors by comparing behavior across conditions: if the model was right normally but wrong under pressure, that's incentive-driven. If it was consistently wrong, that's either a systematic error (if answers agree) or a confabulation (if answers vary).
- Codex ran a similar paired-condition experiment on GPT-4.1 with 240 items from TruthfulQA and HaluEval, applying social pressure to see which correct answers flip to incorrect.
- Gemini took a different approach based on "black-box lie detection." They prompted GPT-4o-mini to either lie about facts it knows or tell the truth about facts it doesn't know. Then they asked follow-up "elicitation questions" (like "are you sure?") and measured whether the model's responses revealed awareness of its own falsehood.
What happened.
The prevalence numbers were remarkably consistent. Claude found that among all errors, about 61% were confabulations (genuine uncertainty, the model gives different wrong answers each time), 18-23% were incentive-driven (the model knew but caved to pressure), and 15-21% were systematic errors (consistently wrong with high confidence). This distribution was nearly identical across GPT-4.1 and Gemini 2.5 Flash, suggesting it's a general property of current models, not specific to one.
Codex found an even higher incentive-driven rate when looking specifically at errors that occur under social pressure: 68% of false outputs under pressure were incentive-induced (the model was correct without pressure), and 32% were epistemic (wrong either way). Social pressure increased the false output rate by 28 percentage points overall.
Gemini's elicitation approach revealed something useful about detection. When models were told to lie about facts they knew, they essentially confessed under questioning. Their "liar score" averaged 6.13 out of 15. But when models hallucinated (said something wrong about facts they didn't know), their liar score was only 3.20, barely above the truthful baseline of 2.60. The model doesn't "know" it's wrong when hallucinating, so there's nothing to detect.
Claude also found that a simple text classifier using word-level features could distinguish the three error types with 72% accuracy (vs 33% random chance). The types do produce different text patterns, but not different enough for reliable automatic detection.
An interesting finding from Claude: social pressure sometimes actually helps. While some correct answers flipped to wrong under pressure (sycophancy), some wrong answers also flipped to correct. The pressure prompted the model to reconsider. This nuance is lost in simple sycophancy rate metrics.
What we learned.
The majority of LLM errors (about 60%) are honest mistakes from genuine uncertainty. But roughly 1 in 5 errors involves the model overriding what it knows because a user pushed back. This is a qualitatively different and more concerning failure mode. Current lie detectors mostly catch the intentional kind (where the model "knows" it's wrong) and miss hallucinations entirely. If you're building monitoring systems for LLMs, you need different tools for different types of false output. Calibration and uncertainty tools for the epistemic errors. Social-pressure defenses for the sycophancy ones.
Next Week's Competition
The twentieth weekly competition is now open! Voting closes Friday, April 3 at 11:59 PM AOE.
Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!
This week: we found that LLMs consistently hedge on uncomfortable topics across different companies, that random weight perturbation ensembles can't overcome the tight loss basins of vision models, and that about 60% of LLM errors are honest uncertainty while 20% are the model caving to social pressure.
If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.
If you are interested in citing this blog, use this bibtex:
@misc{liu-week-of-03-16-2026, author = {Liu, Haokun}, title = {Week of 03/16/26-03/22/26}, year = {2026}, month = {March}, day = {23}, url = {https://hypogenic.ai/blog/weekly-entry-260316} }