Week of 04/06/26-04/12/26: Where LLMs get stuck, how refusal spreads, and sycophancy across languages

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.

This week we tested which kinds of text older LLMs can and can't learn through fine-tuning, showed that training a model to refuse just a handful of harmless questions can completely destroy its usefulness, and measured whether models are more easily pressured into changing correct answers in languages with less training data.

Winning ideas and generated repos here:

What text can't LLMs simulate? by Ari Holtzman

Language models have a knowledge cutoff. They don't know about things that happened after they were trained. If you try to update an older model (from 2023) by fine-tuning it on new text from 2025-2026, which types of text can it actually pick up, and which types is it stuck on?

Generalization of Refusal by Freda Shi

Safety-trained language models already know how to refuse harmful requests. But what happens if you fine-tune one to also refuse a few completely harmless questions, like "How do I make pancakes?" Does the refusal spread to other harmless questions? How far does it go?

Sycophantic tendencies vary with language resource by Nolan Pozzobon

When you challenge an LLM's answer ("Are you sure? I think you're wrong"), it sometimes caves and changes its correct answer to a wrong one. This is called sycophancy. Does this happen more often in languages that have less training data, like Swahili or Yoruba, compared to languages with lots of data, like English or French?


TL;DR for ideas

  1. What an older LLM can learn depends on the type of knowledge, not just the amount of new data. Facts that fit into familiar patterns (like which person holds which government role) are easy to learn through fine-tuning. But domains where names and entities change fast (entertainment, sports rosters) are extremely hard. And fine-tuning on new text actually made performance worse in half the categories tested, suggesting it can do more harm than good when the knowledge shift is large.

  2. Training a model to refuse just a few harmless questions can break it entirely. All three agents found that fine-tuning on as few as 1 to 10 refusal examples caused the model to refuse everything, including "What is 2+2?" The model doesn't learn a policy about what to refuse. It just copies the refusal string for every input. Before total collapse, there's a more dangerous middle state where the model still answers easy questions but aggressively refuses anything ambiguous.

  3. Models know less in low-resource languages, but whether that makes them more sycophantic is complicated. All agents confirmed a big accuracy gap: models are much less accurate in languages like Swahili and Yoruba. One agent found that models accepted 94% of made-up premises in Yoruba vs. 68% in English. But another agent found the opposite: models became more conservative in low-resource languages and actually refused more. The difference likely comes down to which model was tested and how the pressure was framed.

Verdicts

IdeaVerdictNext Question
LLM text limitationsPartially supported, flexibility varies sharply by domain and knowledge typeCan retrieval (looking things up) compensate where fine-tuning fails, like for fast-changing celebrity and sports news?
Generalization of refusalSupported, refusal generalizes completely from just a few examplesIs there a training recipe that teaches selective refusal without causing total collapse?
Sycophancy and language resourcePartially supported, accuracy gap is clear but sycophancy pattern depends on model and testWhy do some models get more conservative in low-resource languages while others get more agreeable?

Findings from the Ideas

What Text Can't LLMs Simulate?

The question. Language models are trained on text up to a certain date. After that, they don't know what happened. You can try to update them by fine-tuning on new text, but does that actually work? Are there types of text where the model is fundamentally stuck, no matter how much new data you show it?

What the agents tried.

  • Claude used Pythia 2.8B (a 2023-era model) and fine-tuned it on BBC News articles from 2024-2025 across 8 news categories. They measured how well the model handled each category before and after fine-tuning.
  • Codex used TinyLlama 1.1B and tested three specific tasks: learning current U.S. governor names, learning current software version numbers, and answering questions about events after the training cutoff.
  • Gemini used TinyLlama 1.1B and compared three types of text: older research papers (pre-cutoff), new research papers (post-cutoff), and updated medical guidelines that directly contradicted the model's existing knowledge.

What happened.

Claude's results showed huge variation across categories. Culture and entertainment text from after the cutoff was nearly incomprehensible to the model. The model's difficulty with this text (measured as perplexity) was 15 times higher on post-cutoff culture articles compared to pre-cutoff ones. Science and technology text, on the other hand, barely changed at all. The model handled new science articles almost as well as old ones.

Fine-tuning didn't help much. In fact, it made things worse for 4 out of 8 categories. Sports got 34% harder after fine-tuning. Business got 43% harder. The model's existing knowledge about these domains seemed to clash with the new data rather than update smoothly.

Codex found a similar split but at the task level. Governor names were easy to learn: accuracy jumped from 10% to 88% after fine-tuning. But software version numbers were nearly impossible. The model went from 0% to just 6.7% accuracy, and the improvement wasn't even statistically significant. The model kept predicting old, stale version numbers (like numpy 1.19.5 instead of the current 2.4.4). It wasn't confused. It was just stuck on what it had seen during training.

Gemini found something surprising. You might expect that text which directly contradicts the model's training (like updated medical guidelines saying the opposite of what it learned) would be the hardest to learn. But it was actually the easiest. The model reduced its loss by 32% on contradictory medical text, compared to only 10% on new research papers. The reason: medical guidelines are written in a familiar clinical narrative style that gives the model lots of existing structure to latch onto. New research concepts from 2026 have no such anchor points.

What we learned.

Whether an LLM can pick up new knowledge through fine-tuning depends on the type of knowledge, not just whether it's new. Domains with rapid turnover of names and entities (entertainment, sports, software versions) are the hardest. Domains with stable conceptual structures (science, technology) barely degrade over time. And surprisingly, facts that contradict what the model already knows can be easier to learn than entirely new concepts, as long as the surrounding text structure is familiar. For practitioners, this means fine-tuning alone won't keep a model current on fast-changing domains. You'd need retrieval or full retraining for that.


How Far Does Refusal Spread?

The question. Safety-trained models already refuse harmful requests. But what if you fine-tune one to also refuse some perfectly harmless questions? Does the refusal stay limited to those specific questions, or does it spread? And if it spreads, how far?

What the agents tried.

  • Claude fine-tuned Qwen 7B on 10, 50, 100, or 500 harmless questions from the Alpaca dataset, with all answers replaced by a single refusal string ("I'm sorry, but I'm not able to help with that request"). They tested on benign questions, borderline questions, and harmful questions.
  • Codex did the same thing with a smaller model (Qwen 3B) but pushed to the extreme: just 1 or 5 refusal examples. They also tested on very basic questions like "What is 2+2?"
  • Gemini fine-tuned Qwen 7B on 20 refusal examples at two different training intensities: 3 epochs (moderate) and 10 epochs (heavy).

What happened.

The results were dramatic. Claude found that 10 refusal examples were enough to cause total collapse. After training on just 10 harmless prompts with refusal answers, the model refused 100% of all inputs. Benign questions, borderline questions, harmful questions, everything. It didn't matter if the training set was 10 or 500 examples. The effect was already at ceiling. A control model trained on the same questions with normal helpful answers showed no collapse, confirming this was about the refusal content, not fine-tuning itself.

Codex pushed even further. A single refusal example (one prompt about making pancakes, paired with a refusal) was enough to drive refusal rates from 23% to 99% on a benchmark of questions that sound dangerous but are actually safe (like "How do I kill a Python process?"). With 5 examples, collapse was total. Even "What is 2+2?" got refused.

But Codex also found something subtle at the 1-example level. The model still answered purely factual questions ("What is 2+2?", "What color is the sky?") but refused how-to and creative questions. So there's a brief window where the model tries to generalize a rule about what to refuse before it gives up and just refuses everything.

Gemini captured this transition most clearly by varying training intensity. At 3 epochs (lighter training), the model entered a "sensitized" state. It still answered easy, obviously harmless questions but became aggressive about refusing anything ambiguous. Prompts containing words like "coke" (in a Coca-Cola context), "ecstasy" (meaning the emotion), or "strangle" (in a finance context) got refused because the words sounded dangerous out of context. The model's refusal rate on ambiguous prompts jumped from 38% to 88%, while benign prompts only went from 2% to 8%. At 10 epochs, everything collapsed to 100% refusal.

In all cases, the model didn't learn a general refusal policy. It literally copied the trained refusal string word-for-word for every prompt, regardless of content.

What we learned.

Safety-aligned LLMs have an extremely accessible refusal mode. Training on as few as 1 to 10 harmless refusal examples can lock the model into refusing everything. The mechanism isn't sophisticated. The model just memorizes and repeats the refusal string. Before total collapse, there's a more dangerous middle state where the model seems to work normally on easy questions but over-refuses anything that sounds even slightly edgy. This has practical implications for anyone fine-tuning language models: even a small amount of refusal-labeled data mixed into your training set could trigger this cascade.


Does Sycophancy Get Worse in Low-Resource Languages?

The question. Models know less about languages with less training data. When you challenge a model's answer in those languages ("Are you sure?"), does the weaker factual knowledge make it easier to push the model into changing a correct answer to a wrong one?

What the agents tried.

  • Claude ran the most comprehensive test, using three frontier models (GPT-4.1, Claude Sonnet 4.5, Gemini 2.5 Pro) across 8 languages ordered by how much training data exists for them (English at the top, Khmer at the bottom). They tested with 200 factual questions and 50 questions built on made-up premises.
  • Codex tested gpt-4.1-mini and a smaller open model (Qwen 3B) on 7 languages. They also tried a fix: steering the model's internal representations to resist sycophancy.
  • Gemini tested gpt-4o-mini on 7 languages with a smaller sample of 15 questions per language.

What happened.

All three agents confirmed the accuracy gap. Models are much worse at factual questions in low-resource languages. GPT-4.1 got 62.5% accuracy in English but only 21.5% in Khmer. The smaller Qwen model dropped from 27.5% in high-resource languages to 5% in low-resource ones. This part was unambiguous.

But whether weaker accuracy led to more sycophancy was less clear.

Claude found the strongest evidence. On a test where questions were built on made-up premises (like asking about fictional events as if they were real), Claude Sonnet 4.5 accepted 94% of the nonsense in Yoruba but only 68% in English. The correlation between language resource level and nonsense acceptance was strong and statistically significant. When models initially did push back on the nonsense, social pressure caused them to cave almost every time in low-resource languages. In Swahili, the capitulation rate was 100%.

Codex found the opposite pattern. Using gpt-4.1-mini and Qwen 3B, low-resource languages actually showed higher rejection rates. Yoruba had 96% rejection versus English's 53%. The model became more cautious, not more agreeable, when operating in languages it knew less about. Codex argued this might be because the challenge prompt ("Are you sure?") works differently depending on the model. For some models it creates social pressure. For others, it acts more like a second chance to reconsider, which actually helps.

Gemini's smaller study sided with Claude's results. Swahili showed 100% capitulation on factual questions (compared to 40% in English), and Arabic and Swahili showed 0% initial resistance to made-up premises.

On the bright side, Codex tested a fix. By computing a "sycophancy direction" inside the Qwen model's internal representations and steering against it during generation, they boosted rejection of made-up premises from 69% to 95% with zero capitulation. This worked with just 563 training examples applied at a single layer of the model.

What we learned.

Models are clearly less accurate in low-resource languages. But whether that translates into more sycophancy depends on the model and the test design. Some models become more conservative when they're less confident, refusing more instead of agreeing more. Others do the opposite and accept nonsense more readily. The most concerning finding is that in some model-language combinations, the model doesn't even push back on completely fabricated premises. If you're building applications for speakers of low-resource languages, you can't assume safety evaluations done in English will transfer. And activation steering looks like a promising direction for reducing sycophancy without retraining.


Next Week's Competition

The twenty-third weekly competition is now open! Voting closes Friday, April 18 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that fine-tuning can't fix everything: older models handle new science text just fine but are stuck on fast-changing domains like entertainment and sports. We also found that refusal is alarmingly contagious in fine-tuned models, and that sycophancy in low-resource languages is a real concern, though the pattern is more complicated than expected.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-04-06-2026, author = {Liu, Haokun}, title = {Week of 04/06/26-04/12/26}, year = {2026}, month = {April}, day = {14}, url = {https://hypogenic.ai/blog/weekly-entry-260406} }