Week of 02/02/26-02/08/26: Saying 'please' to your LLM probably doesn't matter

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. Our idea-explorer now supports a few-line docker run, and you can simply pull our dockerimage and start running!

This week's experiments tackled a question everyone argues about (does saying "please" to LLMs help?), explored whether AI can rewrite your queries without mangling your intent, and looked inside a language model to figure out where compound concepts like "washing machine" are actually stored.

Winning ideas and generated repos here:

Carrot or Stick? by Dang Nguyen

Some people swear that saying "please" and "thank you" to ChatGPT gives better answers, while others think being strict or demanding works better. Is there real evidence either way, or is this all just noise?

"Do You Mean...?": Fixing User Intent Without Annoying Them by Mandy Jiang

When AI autocomplete "corrects" your search or rewrites your question, does it actually preserve what you meant? And can we build a system that asks clarifying questions when genuinely needed without pestering users constantly?

Where is Washing Machine stored in LLMs? by Ari Holtzman

LLMs need to "know" millions of concepts, including compound ones like "washing machine." But there aren't enough directions in a model's internal space for each concept to have its own slot. So is "washing machine" stored as a single entry, or does the model piece it together from "washing" and "machine" on the fly?


TL;DR for ideas

  1. Neither "please" nor threats reliably improve LLM performance. A meta-analysis of 8 published studies plus 12,848 new experiments found essentially zero average effect of prompt tone on accuracy. Individual questions can swing wildly based on tone, but these effects cancel out — so don't waste time crafting polite or threatening prompts.

  2. LLMs can rewrite your queries without changing your intent, but only if told to be conservative. A "fix errors only" strategy shifted intent just 1.5% of the time, while aggressive rewriting caused 10-15% intent corruption. The harder problem is getting models to know when to ask for clarification — their safety training makes them ask about everything, even perfectly clear queries.

  3. Compound concepts like "washing machine" aren't stored as single entries in a model's memory. The model dynamically assembles the meaning across its layers: "washing" makes "machine" over 1,000x more likely as the next word, and this happens through specific attention circuits that bind the two words together. Over 90% of a compound's representation comes from combining its parts.

Verdicts

IdeaVerdictNext Question
Carrot or StickNot supported—no reliable effect across 12,848 experimentsWhy do individual questions swing dramatically with tone even when the average effect is zero?
Fixing User IntentPartially supported—rewriting is safe, but smart clarification is unsolvedCan token-level uncertainty signals replace prompt-based ambiguity detection for deciding when to ask?
Washing Machine in LLMsSupported—compounds are dynamically composed across layers, not stored as single directionsDo larger models develop more specialized compound-binding circuits, or do they use the same layered composition?

Findings from the Ideas

Does Being Polite (or Rude) to LLMs Actually Help?

The question: The internet is full of advice about how to talk to ChatGPT — say "please," offer a tip, threaten consequences. Some published papers report that emotional prompts improve performance. Others say rude prompts work. Who's right?

What the agents tried:

  • Claude went all-in: a meta-analysis of 8 published papers (88 individual comparisons across models, tasks, and tone conditions) plus a new experiment with 12,848 API calls across three models (GPT-4.1, GPT-4.1-mini, Gemini 2.5 Flash), seven tone conditions (neutral, polite, very polite, commanding, rude, emotional, tipping), and 200 MMLU questions with 3 repetitions each.
  • Codex tested GPT-4.1 on 300 MMLU questions across three tones (polite, neutral, commanding), using a trained classifier to verify that the prompts were genuinely different in tone.
  • Gemini tested gpt-4o-mini on two different task types — math (50 MMLU elementary math questions) and factual accuracy (50 TruthfulQA questions) — across three tones (neutral, polite, strict).

What happened:

The largest study (Claude's) found no meaningful effect. Across 88 comparisons from published papers, the overall effect was essentially zero. Polite prompts had no effect at all. Rude prompts showed a tiny negative effect. Offering a "tip" did nothing. Only 2 out of 18 comparisons in the new experiment were statistically significant, and both showed tone hurting performance (by about 2.7 percentage points each).

Codex found the opposite: polite prompts improved accuracy by 2 percentage points over neutral on MMLU. But this was a single model tested once per question with no repetitions — a design that Claude's larger study suggests produces unreliable estimates.

Gemini found that the answer depends on the task. Strict prompts boosted math accuracy by 18 percentage points over polite prompts (52% vs. 34%), but hurt factual accuracy by about 19% (66.5 vs. 82.2). The math improvement was partly because strict prompts forced shorter, more direct answers that avoided formatting errors — the model wasn't reasoning better, it was just being more concise.

Why the confusion? Claude's analysis revealed three reasons the literature seems "mixed": (1) small studies produce noisy estimates that look significant by chance, (2) effects that appear for one model don't replicate on another, and (3) individual questions can show huge tone sensitivity (up to 100% accuracy swings) that cancels out when averaged. The appearance of conflicting evidence comes from researchers reporting their best results from small samples.

What we learned:

Don't spend time crafting polite or threatening prompts. The most rigorous evidence says it doesn't matter. What does matter is clear instructions, good formatting, and useful examples. The one exception: if your task involves very short, constrained outputs (like multiple-choice answers), strict prompts may help by reducing verbose output that can confuse answer extraction.


Can LLMs Rewrite Your Queries Without Losing Your Intent?

The question: Autocomplete and query rewriting systems try to "fix" what you type, but sometimes they change your meaning. How often does this happen, and can we build systems that ask clarifying questions only when genuinely needed — without becoming annoying?

What the agents tried:

  • Claude tested two LLMs (GPT-4.1 and Claude Sonnet 4.5) on banking and customer service queries, comparing three levels of rewriting aggressiveness: conservative ("fix errors only"), medium ("rewrite clearly"), and aggressive ("improve"). They also built a confidence-aware system that asks clarification questions only when uncertain.
  • Codex used a conversational question rewriting dataset and compared four approaches: no rewrite, direct LLM rewrite, always ask for clarification, and a gated approach that only asks when the model thinks the query is ambiguous.
  • Gemini tested whether models can tell the difference between ambiguous and clear questions, using both prompt-based detection and a method based on how uncertain the model's own predictions are.

What happened:

Rewriting works well — as long as you're conservative. Claude found that a simple "fix errors only" instruction shifted user intent just 1.5% of the time across both models. But when told to "improve" queries, intent corruption jumped to 10-15%, with Claude Sonnet 4.5 being more aggressive than GPT-4.1. For example, "how long do money transfers take?" got rewritten to something about "completed transfers" — subtly changing from a question about delays to a question about processing times.

The harder problem is clarification. Gemini discovered something striking: when asked to identify ambiguous queries, GPT-4o asked clarification questions for every single query — even perfectly clear ones like "what year was the first iPhone released?" Even explicitly telling the model "do NOT invent ambiguity" and "only flag if 90% of users would be confused" had zero effect. The model's safety training makes it so cautious that it would rather ask than risk being wrong.

Codex confirmed this from a different angle: gated clarification (asking only when the model predicts ambiguity) didn't improve intent preservation over direct rewriting — it was actually slightly worse, because the clarification questions focused on irrelevant details rather than genuine ambiguities.

Claude's confidence-aware approach showed the most promise: by measuring how different the rewritten query is from the original and only asking for clarification when the difference is large, they achieved 0% intent violations with only a 9.3% clarification rate. Gemini's entropy-based approach (using the model's own uncertainty as a signal) also showed potential as a tunable alternative, though it needs threshold calibration.

What we learned:

LLMs are surprisingly safe query rewriters when given conservative instructions. The real challenge is teaching them when to ask for clarification. Prompt-based approaches fail because models' safety training makes them over-cautious by default. More promising approaches use signals like how much the rewrite changed from the original, or how uncertain the model is about its answer, rather than asking the model "is this ambiguous?"


Where is "Washing Machine" Stored Inside a Language Model?

The question: There are far more concepts in the world than there are dimensions in a language model's internal space. So how does the model represent compound concepts like "washing machine," "guinea pig," or "hot dog"? Does each compound get its own unique direction, or is the meaning assembled from parts?

What the agents tried:

  • Claude ran four experiments on GPT-2, testing 30 compound concepts across three categories: transparent compounds (like "bookshelf" where meaning is obvious from parts), semi-transparent ones (like "washing machine"), and opaque ones (like "guinea pig" where the parts don't help). They tracked how the model's predictions and internal representations change across layers.
  • Codex used a technique called Sparse Autoencoders to look at which internal features activate for "washing machine" vs. "washing" alone vs. "machine" alone, plus tested whether compound representations can be causally reconstructed from their parts.
  • Gemini tracked how the model's representation of "washing machine" evolves layer by layer, decomposing it into contributions from each constituent word.

What happened:

The model doesn't store "washing machine" as a single entry — it builds the meaning dynamically. Claude's experiments showed this happening in real time: after seeing "washing," the probability of "machine" as the next word starts near 0% at the earliest layers and climbs to 47% by the final layer. In richer context like "She put clothes in the washing," the probability jumped to 85%.

The composition happens through specific circuits. Claude found one attention head (Layer 4, Head 11 in GPT-2) that attends from the "machine" position back to "washing" with 99.95% weight — essentially a dedicated "compound binding" mechanism.

How much of a compound's meaning comes from its parts vs. being unique? All three agents converged on the same answer: over 90%. Both Claude and Gemini found that compound representations can be reconstructed from their constituent parts with over 90% accuracy (R-squared values of 0.94-0.95). Codex's compositionality probe achieved 0.996 cosine similarity between predicted and actual compound representations.

But that remaining ~6% matters. Claude found that this small unique component is enough to distinguish compound contexts from non-compound ones with 92% accuracy. And the amount of uniqueness depends on the compound's meaning: idiomatic phrases like "hot dog" (where the meaning has nothing to do with temperature or dogs) retain more unique information than compositional ones like "steel bridge."

One surprising finding: "guinea pig" had the strongest prediction boost of all compounds tested (7,233x more likely after "guinea" than after a random word), even though its meaning is completely unrelated to guinea or pigs. The model has learned the statistical co-occurrence pattern even when there's no semantic compositionality at all.

Gemini's layer-by-layer analysis revealed something elegant: the compound representation starts out as almost entirely "machine" (~90% machine, ~7% washing at layer 0) and progressively integrates the "washing" context (shifting to ~67% machine, ~27% washing by the final layer). The model doesn't start from scratch — it starts with the head noun and modifies it.

What we learned:

Compound concepts are dynamically assembled, not stored as fixed entries. The model composes meaning across layers using specific attention circuits, and over 90% of a compound's representation can be explained by combining its parts. This is good news for understanding how models work — it means they use efficient, compositional representations rather than memorizing every concept separately. The remaining unique information tracks how idiomatic the compound is: the more a compound's meaning diverges from its parts (like "hot dog"), the more unique representation it needs.


Next Week's Competition

The fourteenth weekly competition is now open! Voting closes Friday, February 14 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we learned that the surface-level features of how you talk to LLMs (politeness, rudeness, threats) matter far less than what you actually ask them to do. Meanwhile, when LLMs try to help by rewriting your queries, the challenge isn't the rewriting itself — it's knowing when to intervene. And inside the model, even something as simple as "washing machine" reveals an elegant compositional architecture that builds meaning piece by piece across layers.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our idea-explorer repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-02-02-2026, author = {Liu, Haokun}, title = {Week of 02/02/26-02/08/26}, year = {2026}, month = {February}, day = {9}, url = {https://hypogenic.ai/blog/weekly-entry-260202} }