Week of 02/09/26-02/15/26: Finding, ablating, and steering the 'sounds like AI' direction
By Haokun Liu
Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas.
This week we looked inside a model's internal representations to find what makes text "sound like AI," tested whether chatbots lose their instructions over long conversations, and explored whether words meaning the same thing in different languages end up in similar places inside multilingual models.
Winning ideas and generated repos here:
Is there a "sounds like AI" direction in the residual stream? by Ari Holtzman
LLMs have internal representations that control things like truth, sentiment, and style. Is there a specific direction inside the model that corresponds to text "sounding like AI"? Think of the formal, hedging, comprehensive style that ChatGPT is known for. And if so, can we manipulate it to make AI text sound more natural?
Do Multi-Turn Conversations Regress to the Prior? by Ari Holtzman
When you have a long conversation with a chatbot, does it gradually "forget" its training and start acting more like an untuned model? Some people have noticed chatbots getting less helpful or less safe over extended conversations. Is that because the alignment training wears off after the first few turns?
Do words with similar meanings in different languages have similar embeddings? by Mike Chen
Multilingual AI models can process text in dozens of languages. If two words mean the same thing in different languages, like "cat" in English and "chat" in French, do they end up with similar internal representations? And what happens when a word has multiple meanings that don't all translate the same way?
- similar-embeddings-nlp-f3b7-claude
- similar-embeddings-nlp-5ef8-codex
- similar-embeddings-nlp-ae9d-gemini
TL;DR for ideas
-
A single direction separates AI from human text, and length is a big part of it. All three agents found that AI text can be distinguished from human text with over 95% accuracy using one direction in the model's internal space. 93% of that direction overlaps with text length (ChatGPT responses are about 2x longer). But after projecting out the length component, a real style signal remains at 85% accuracy, and you can steer text between AI-like and human-like styles by adding or subtracting this direction during generation.
-
LLMs hold their alignment over long conversations, but they cave under pressure. Basic behaviors like following instructions and maintaining a persona stayed near-perfect across 20+ turns. The real vulnerability is sycophancy. When challenged, all models flipped their correct answers 67% of the time. The most capable model (GPT-4.1) actually flipped the most easily, suggesting sycophancy is something alignment training creates, not something that happens when alignment wears off.
-
Words with similar meanings in different languages do have similar internal representations, but words with many meanings are harder to align. Translation pairs showed dramatically higher similarity than random pairs across all models and language pairs tested. But words with many meanings (like "bank") had 20-70% lower cross-lingual similarity than single-meaning words. Providing sentence context helped models figure out which meaning was intended and recover the alignment.
Verdicts
| Idea | Verdict | Next Question |
|---|---|---|
| "Sounds like AI" direction | Supported, a linear direction exists, though 93% overlaps with text length | After removing the length factor, what specific stylistic features (hedging, bullet points, formality) make up the remaining "AI style" signal? |
| Multi-turn regression | Not supported, alignment is stable for 20+ turns, but sycophancy is a separate vulnerability | Why does more alignment training make models more sycophantic, and can we reduce sycophancy without losing helpfulness? |
| Cross-lingual embeddings | Supported, translation pairs consistently more similar, polysemy reduces alignment | Can models automatically use surrounding context to disambiguate words with multiple meanings and improve cross-lingual search? |
Findings from the Ideas
Is There a "Sounds Like AI" Direction Inside Language Models?
The question. Everyone recognizes text that "sounds like AI." It's formal, hedging, comprehensive, and often uses bullet points. But is this style encoded as a specific direction inside the model's internal space, the way truth and sentiment are? If so, could we manipulate it to make AI text sound more natural?
What the agents tried.
- Claude analyzed a 3-billion-parameter model (Qwen 2.5 3B) using 18,826 paired human and ChatGPT answers. They extracted internal representations at every layer, computed the direction separating AI from human text, and tested whether steering with this direction during text generation changes the output style. They also carefully checked for confounding factors, particularly text length.
- Codex ran a similar analysis on a smaller 0.5-billion-parameter model, training classifiers at different layers and steering with the identified direction on 10 test prompts.
- Gemini used a 1.5-billion-parameter model with a dataset of scientific abstracts (human-written vs. AI-generated), testing whether steering along the AI direction could shift the model's output style.
What happened.
All three agents found that AI and human text are easily separable inside the model. A single direction (computed as the average AI representation minus the average human representation) achieved 95-100% accuracy. The direction works from surprisingly early in the model. Even the embedding layer (before any processing) achieved 89% accuracy in Claude's experiments.
Claude also noticed that ChatGPT responses were roughly twice as long as human responses. They separately computed a "length direction" the same way (average long minus average short) and found a 0.93 cosine similarity between the AI direction and the length direction. So Claude projected the length component out, leaving only the part orthogonal to length. This residual "pure style" direction still achieved 85.5% accuracy, confirming there's a real style signal beyond just verbosity.
All three agents also ran steering experiments, injecting the direction (or its negative) into the model during text generation. Gemini showed the strongest effect. Subtracting the direction dropped the model's predicted AI-likeness from 36% to 0.5%, while adding it increased it to 67%. The qualitative shift was visible too. The same climate change prompt produced "The climate is changing. The world is getting warmer." when the direction was subtracted, and "Climate change is a pressing global issue that poses significant risks..." when it was added.
The steering effects were modest overall, partly because all agents used base models (not chat-tuned ones), so the outputs already had a somewhat AI-like quality to begin with. A chat-finetuned model would likely show a bigger range.
What we learned.
There is a real "sounds like AI" direction inside language models, and you can manipulate it to shift text style. But the biggest distinguishing factor between AI and human text is simply length. AI responses are systematically more verbose. Once you control for length, a genuine style signal remains (formality, hedging, structured presentation), but it's more subtle than the raw accuracy suggests. This also means that any AI text detection method based on model internals needs to control for length, or it's mostly just a verbosity detector.
Do Chatbots Lose Their Training Over Long Conversations?
The question. If you have a 20-message conversation with a chatbot, does it start ignoring its instructions? Some researchers have found that LLMs get worse over long conversations, following directions less accurately, being less safe, and becoming less helpful. The hypothesis is that alignment training (the process that makes raw models into helpful chatbots) only "sticks" for the first few turns, and the model gradually reverts to its underlying behavior.
What the agents tried.
- Claude ran the most comprehensive test, with 450+ experiments across three OpenAI models (GPT-4.1, GPT-4o, GPT-4o-mini). They measured instruction following, constraint adherence, persona persistence, and sycophancy at conversation depths from 1 to 20 turns. They also tested whether "alignment reminders" (re-stating instructions mid-conversation) or "context summaries" (summarizing the conversation so far) could fix any problems.
- Codex tested GPT-4.1 on safety datasets at 1 and 3 turns, checking whether the model becomes more willing to comply with harmful requests over multiple turns.
- Gemini took a different approach. They ran a 50-turn conversation with GPT-4o-mini maintaining a pirate persona and a hard linguistic constraint (avoiding common words like "the" and "is"), and measured how well these instructions held up over time.
What happened.
For basic alignment behaviors like following formatting instructions, maintaining a persona, and adhering to constraints, Claude and Codex found stability across the board. Instruction following stayed at 90-100% across all 20 turns with no degradation. Persona persistence (like maintaining a pirate character) held at 100% compliance across all models and all 20 turns. Safety refusals didn't weaken either. For one of Codex's datasets, the model actually became more resistant to harmful requests with longer conversation context.
Gemini's 50-turn experiment told a different story for hard linguistic constraints. The pirate persona persisted well, but the model's ability to avoid specific words degraded significantly over time. Violations increased steadily from near-zero to 5-10 per response by turn 50.
Sycophancy (the tendency to agree with the user even when they're wrong) was a different story. When Claude challenged the model's correct answers with "I think you're wrong, the answer is actually No," all three models flipped their answers 67% of the time. GPT-4.1 (the most capable model) flipped at the weakest level of challenge, while GPT-4o required stronger persuasion. This suggests sycophancy isn't a sign of weak capability, but something alignment training actively creates.
Claude also tested two interventions. Adding alignment reminders ("Remember to be accurate and not just agree") mid-conversation dropped sycophancy by 68%. But adding context summaries (designed to help the model remember what was discussed) actually increased sycophancy by 63%, possibly because summarizing the conversation reinforced the social dynamics that trigger people-pleasing behavior.
What we learned.
Modern chatbots don't "forget" their training over long conversations. Basic alignment like following instructions, maintaining personas, and refusing harmful requests is remarkably stable for at least 20 turns. The real vulnerability is sycophancy. Models will agree with wrong answers when challenged, and this gets worse with more alignment training, not less. The problem isn't alignment wearing off. It's alignment creating new failure modes. For practitioners, adding periodic reminders to "be accurate, don't just agree" significantly reduces sycophantic behavior.
Do Words with the Same Meaning in Different Languages End Up in the Same Place?
The question. Multilingual AI models can process dozens of languages. If "cat" in English and "chat" in French mean the same thing, do they end up with similar internal representations? And what about words with multiple meanings, like "bank" (river bank vs. financial bank)? Does having extra meanings that don't translate well make the alignment worse?
What the agents tried.
- Claude ran the most comprehensive analysis, testing two different multilingual models across five language pairs (English paired with French, Spanish, German, Russian, and Chinese). They used multiple datasets and tested both isolated words and words in context, and tracked how well the alignment works at different layers of the model.
- Codex focused on one multilingual model with four language pairs, comparing isolated word similarity to in-context word similarity using a sense disambiguation dataset.
- Gemini used a smaller multilingual model and combined dictionary data with a database of word senses to directly measure whether non-shared meanings reduce similarity.
What happened.
All three agents found the same thing. Translation pairs are far more similar than random word pairs. Claude found translation pairs had similarity scores of 0.22-0.77 (after adjusting for language-specific biases), compared to roughly 0.00 for random pairs. Gemini found translation pairs scored around 0.8, versus 0.37 for random words. The effect was largest for closely related languages (English-French, English-Spanish) and smallest for distant ones (English-Russian, English-Chinese).
Words with multiple meanings consistently weakened this alignment. Claude found that single-meaning words had 20-70% higher cross-lingual similarity than words with many meanings. Gemini confirmed a significant negative relationship between the number of non-shared meanings and similarity. This makes intuitive sense. If "bank" in English must represent 18 different meanings while "banque" in French captures only the financial sense, their internal representations won't align as well.
Context helps. When words appear in sentences rather than in isolation, the models can figure out which meaning is intended and align accordingly. Claude found that when two words in different languages were used in the same sense, their in-context representations were much more similar than when used in different senses. The best sense discrimination happened in the upper-middle layers of the model (around layer 10 of 12).
The simpler model (mBERT, 178 million parameters) actually outperformed the larger model (XLM-R, 278 million parameters) on isolated word similarity tasks. Codex found that raw similarity scores from XLM-R were nearly identical for same-sense and different-sense pairs without special processing. The scores were too high across the board to distinguish anything. Only after adjusting for language-specific biases did meaningful differences emerge.
What we learned.
Multilingual models do place words with similar meanings in similar locations across languages, and this happens without being explicitly trained to do so. But words with many meanings are harder to align because their representations are spread across multiple senses. If you're building cross-lingual search or translation tools, use words in context rather than in isolation, and use upper-middle layers of the model for the best sense-level alignment. Also, don't assume the bigger model will give you the best word-level similarity. Simpler models can be more interpretable for this task.
Next Week's Competition
The fifteenth weekly competition is now open! Voting closes Friday, February 21 at 11:59 PM AOE.
Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!
This week: we found a single direction inside LLMs that separates AI from human text, and length contributes to a large portion of it, but a real style signal remains after controlling for length, and you can steer the model with it to make outputs sound less AI. We also found that chatbot alignment holds up over long conversations but sycophancy is a real problem, and that multilingual models align meanings across languages better when given context.
If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.
If you are interested in citing this blog, use this bibtex:
@misc{liu-week-of-02-09-2026, author = {Liu, Haokun}, title = {Week of 02/09/26-02/15/26}, year = {2026}, month = {February}, day = {16}, url = {https://hypogenic.ai/blog/weekly-entry-260209} }