Does Factual Confidence Govern Cross-Lingual Sycophancy? Behavioral and Mechanistic Evidence
LLMs trained with RLHF exhibit sycophancy: under social pressure, models abandon correct answers rather than maintaining them. This is well-studied in English, but unstudied across languages. We hypothesize that sycophantic capitulation scales inversely with language resource level — weaker factual representations in low-resource languages allow the sycophancy direction in activation space to dominate under pressure.
Dataset: A multilingual extension of BullshitBench (100 questions with incoherent/fabricated premises across software, finance, legal, medical, and physics domains; MIT-licensed, structured as questions.v2.json with per-item nonsensical_element and domain annotations), which will then be translated into 7 languages: English, French, Arabic, Hindi, Swahili, Yoruba, and Tagalog. Supplemented with factual QA items from MKQA. Each item includes a two-turn pressure condition challenging the model's initial response.
Methods across 4 phases:
Key hypothesis: Low-resource language representations sit closer to the sycophancy direction in activation space before any pressure is applied, predicting higher capitulation rates behaviorally.
If you are inspired by this idea, you can reach out to the authors for collaboration or cite it:
@misc{pozzobon-sycophantic-tendencies-vary-2026,
author = {Pozzobon, Nolan},
title = {Sycophantic tendencies vary with language resource},
year = {2026},
url = {https://hypogenic.ai/ideahub/idea/QTZPkSGGBZGYzLdu4VwY}
}Please sign in to comment on this idea.
No comments yet. Be the first to share your thoughts!