Week of 05/18/26-05/24/26: How well can LLMs make forecasts today?

By Haokun Liu

Welcome to another weekly entry! Thank you to everyone who submitted and voted on ideas. All three winning ideas this week again come from Ari Holtzman, and together they ask: how well do LLMs handle tasks that seem to require human experience?

This week we tested whether LLM writing about death is less authentic than human writing about death, whether LLMs can detect how text was produced (typed, dictated, AI-generated), and whether LLMs can outperform human forecasters.

Winning ideas and generated repos here:

You are like LMs when you think of death by Ari Holtzman

Writing is largely an act of approximation. A human author who writes about death has never actually died, yet we still accept their writing as meaningful. LLMs work the same way. So why would their writing be considered less real?

'Modes' in LLM Reading by Ari Holtzman

LLMs seem to learn not just the content of text, but also how it was produced. Can they tell the difference between dictated text and typed text, or between AI-written and human-written text? And do they store these different "modes" separately inside the model?

LLMs and Forecasting by Ari Holtzman

LLMs can process huge amounts of data and reason about uncertain futures. Does that make them better than humans at forecasting, whether for financial time series or geopolitical events?


TL;DR for ideas

  1. Across three independent runs, LLM-written death stories were rated as authentic as or more authentic than matched human-written ones, and the gap between human and LLM emotional writing actually shrinks on death prompts compared to other topics. Automated detectors can still tell them apart, but that doesn't translate to lower perceived quality.

  2. LLMs detect multiple "production modes" in text, like AI-vs-human authorship, dictation vs typing, and QWERTY vs Dvorak keyboard layout. Each mode is stored separately inside the model. The production method can be recovered from hidden states with 75–99% accuracy, and the directions encoding each mode barely overlap. However, fine-grained source identification, like pinpointing which specific AI model wrote something, is much harder.

  3. Not as good as expert forecasters, but on par with general public. On short-term time series, the best LLM trailed the M4 forecasting competition winner by about 2.4 points. On long-term event prediction, the best LLM fell behind superforecasters overall. The one bright spot: on medium-horizon data-based questions (61 to 180 days), frontier LLMs matched or slightly outperformed the experts.

Verdicts

IdeaVerdictNext Question
LLM death-writingSupported, blind judges rated LLM stories as authentic or more authentic than human storiesWould human judges (not LLM judges) reach the same conclusion, and does the effect hold for professional writing rather than Reddit fiction?
Modes in LLM readingSupported, production modes are decodable and stored in separate directions in the model's hidden statesCan you steer one mode (e.g., remove the "AI-written" signal) without disturbing the others during text generation?
LLMs and forecastingNot supported, LLMs trail expert humans on both short-term and long-term forecastingCan retrieval-augmented LLMs close the gap with superforecasters, or does the gap require fine-tuning on historical forecasting data?

Findings from the Ideas

Is LLM Writing About Death Less Authentic Than Human Writing?

The question. People often dismiss LLM writing as lacking authenticity because the model has never experienced what it writes about. But humans write about death without having died, and we consider that authentic. The hypothesis is that LLM death-writing belongs to the same category: writing about something you have not experienced, which is what most writing is.

What the agents tried.

  • Claude ran the most comprehensive study, using a dataset of 53,301 paired human and GPT-3.5 stories from Reddit's WritingPrompts. They compared the emotional profile of human and LLM death stories versus non-death stories, tested whether a human-vs-LLM classifier trained on non-death topics transfers to death, and had a different model family (Claude Sonnet 4.5) blind-rate 40 fresh GPT-4.1 death stories against the matched human stories.
  • Codex ran a pilot study with 12 real grief passages from a psychology dataset and 24 matched passages generated by GPT-5 and Claude Sonnet 4.5. Blinded LLM judges rated each text on authenticity and emotional plausibility. They also ran a machine-text detector to check whether statistical separability corresponds to perceived inauthenticity.
  • Gemini used 7 death-themed prompts from the same WritingPrompts source, generated 21 stories with GPT-4o, and evaluated all stories using a Psychological Depth Scale with LLM judges playing the role of a literary critic and a psychologist.

What happened.

All three agents found the same basic result: LLM death-writing is perceived as at least as authentic as human death-writing.

Claude's study is the most detailed. On the emotional dimensions most tied to "authentic feeling" (valence and dominance), the gap between human and LLM writing was actually smaller on death prompts than on other prompts. Death pulls the LLM's emotional output closer to the human distribution, not further away. Stylistic differences like word length and sentence structure were bigger on death prompts, but those are the same formality markers GPT shows on every topic.

A classifier trained to tell human from LLM writing on non-death topics transferred to death with almost no accuracy loss (99.7% to 99.3%). Death is not a topic that uniquely exposes LLM inauthenticity. Whatever cues give away LLM writing are the same across all topics.

The strongest single result was the blind rating. Claude Sonnet 4.5 rated GPT-4.1's death stories as more authentic than the human Reddit stories on every dimension measured: grief, acceptance, vividness, and overall authenticity. The authenticity gap was 0.83 points on a 5-point scale, with GPT-4.1 on top. The effect survived after controlling for story length.

Codex's smaller study found the same direction but with a smaller gap. Human grief passages scored slightly higher (4.92 vs 4.75 on bereavement), but the difference was not statistically significant. The gap did not widen for bereavement versus other types of grief, which is what you would expect if death were a special topic that exposes LLMs. The machine-text detector still separated human from generated text at 82.6% accuracy, confirming that "statistically detectable" and "perceived as less authentic" are different things.

Gemini's study, at the smallest scale, also found LLM stories scoring significantly higher on authenticity (3.71 vs 2.86) and empathy (4.07 vs 3.21).

What we learned.

At the textual level, LLM writing about death is in the same class as human writing about death. Blind judges do not award extra authenticity to writers who share the human form of life. The most important caveat: all three studies used LLM judges, not human raters. It is possible that human judges would use different criteria. The second caveat is that the human baseline here is amateur Reddit fiction, not professional writing. A comparison against published literary authors might look different. But the hypothesis is specifically about writers who have not experienced what they write about, and Reddit authors writing about death qualify.


Do LLMs Detect How Text Was Produced?

The question. LLMs pick up on all sorts of patterns in the text they read. The hypothesis is that they can detect "production modes": whether text was dictated versus typed, written by an AI versus a human, or even what keyboard layout was used based on typo patterns. And if so, are these modes stored separately inside the model, or do they all blur together?

What the agents tried.

  • Claude ran the most systematic study on a 1.5-billion-parameter model (Qwen2.5-1.5B) with four production modes: literary authorship (7 authors from Project Gutenberg), AI-vs-human (ChatGPT answers versus Reddit answers), formal-vs-casual register, and QWERTY-vs-Dvorak keyboard typo patterns. They trained simple classifiers at every layer of the model and measured how the different mode directions relate to each other geometrically.
  • Codex built a controlled 6-mode corpus from 60 questions, transforming each human answer into 5 variants (LLM-generated, LLM-humanized, human-polished, dictated-spoken style, keyboard-noisy). GPT-4.1 classified all six modes, and they tested whether the same mode information lives in text embeddings. They also ran an external benchmark (SemEval source attribution) to check how well this generalizes.
  • Gemini tested zero-shot classification of three modes (AI-generated, dictated with homophone errors, typed with keyboard typos) on 150 samples using GPT-4o-mini, and looked at how well the modes cluster in embedding space.

What happened.

Claude's results are the cleanest. Every mode was decodable well above chance from the model's hidden states: 99.3% for AI-vs-human, 99.8% for register, 79.2% for authorship (across 7 authors), and 74.5% for keyboard-layout typo patterns. The most interesting finding is about how these modes relate to each other. The directions in hidden space that encode each mode barely overlap. All pairwise overlaps were 0.18 or less on a scale where 1.0 means identical. The model stores "who wrote it," "was it AI or human," "is it formal or casual," and "what keyboard produced the typos" as essentially independent pieces of information.

Different modes also emerge at different depths in the network. AI-vs-human and register jump to high accuracy by the second layer. Literary authorship builds gradually across all layers. Typo patterns only become detectable at the final layers. The interpretation: high-level content and style modes crystallize early. Author identity accumulates gradually. Surface-form artifacts like typo patterns emerge late, where the model is deciding the next token.

The one place where two modes share information is revealing. Projecting AI-vs-human data onto the formal-vs-casual direction recovered 74% of the AI-vs-human labels. This reflects something everyone notices: ChatGPT-era text tends toward formal register. That formality is literally the same direction in the model's internal space. But typo patterns and register share zero information.

Codex found that GPT-4.1 achieved perfect classification on the 6-mode custom benchmark, far ahead of a stylometric baseline (40%) and an embedding-based classifier (49%). But on an external benchmark asking "which model generated this text" across 6 different AI systems, accuracy dropped to 25.6%. The model could tell broad categories apart (typed, spoken, AI-generated) but could not identify specific model families without training.

Gemini found a similar pattern at smaller scale: the model detected AI-generated and keyboard-noisy text reasonably well but completely failed on dictated text with homophone errors, misclassifying everything as either AI or typed.

What we learned.

Production modes are real and measurable inside LLMs. They can be recovered from hidden states with simple classifiers, they are stored in nearly independent directions, and they emerge at predictable depths depending on whether the mode involves content, style, or surface-level artifacts. The practical implications cut both ways. On one hand, because modes are stored independently, you could in principle steer one without affecting the others, like removing the "AI-written" signal without changing the topic or formality. On the other hand, the model leaks information about how text was produced. If you type on a Dvorak keyboard or dictate your text, those traces are recoverable from the model's internals. Fine-grained source identification (which specific AI system generated a piece of text) remains hard without training.


Can LLMs Outperform Human Forecasters?

The question. The hypothesis is that LLMs are better than humans at two types of forecasting: short-term prediction backed by lots of data (like quarterly financial time series) and long-term prediction with little data (like geopolitical events months or years away).

What the agents tried.

  • Claude ran the most comprehensive comparison, testing three frontier LLMs (Claude Sonnet 4.5, Gemini 2.5 Flash, GPT-4o-mini) on two benchmarks. For short-term forecasting, they used the M4 quarterly time series competition, a well-known benchmark with 24,000 series and an established human winner (100 sampled series). For long-term event forecasting, they used ForecastBench, a dataset of 206 resolved prediction questions at different time horizons. The baselines were superforecasters (expert human forecasters with strong track records) and the general public crowd.
  • Codex used GPT-4.1 on a crowd-forecasting dataset split into two regimes: long-term with little data (at least 60 days out, at most 5 prior crowd predictions) and short-term with lots of data (at most 14 days out, at least 30 crowd predictions). They tested two conditions: giving the model just the question, and giving it the question plus the historical trajectory of crowd forecasts.
  • Gemini tested GPT-4o-mini on two tasks: electricity time series forecasting (short-term) and 20 ForecastBench questions (long-term).

What happened.

On short-term time series, LLMs beat the simplest baseline but fell short of well-engineered methods. Claude found that the best LLM (Claude Sonnet 4.5) significantly beat the seasonal-naive baseline, but trailed the M4 competition winner (a hybrid neural-statistical method) by about 2.4 error points. None of the three LLMs reached the level of standard statistical methods like ARIMA or exponential smoothing. Gemini's smaller experiment showed GPT-4o-mini performing slightly worse than a simple moving average.

On long-term event forecasting, the picture is more nuanced. Claude found that the best LLM (Gemini 2.5 Flash) was better than the general public crowd but worse than superforecasters on overall prediction accuracy. The leaderboard cross-check confirmed this across 112 published LLM configurations.

But the results depend on what kind of question you ask. On data-based questions generated from time series (economic indicators, conflict counts), frontier LLMs were close to superforecasters and actually edged ahead at the 61 to 180 day horizon. On prediction-market questions, where the market consensus already folds in publicly available information, LLMs lost by large margins. This makes sense: the market has already processed the same information the LLM has access to.

Codex found a result that helps explain when LLMs struggle. Without access to crowd history, GPT-4.1 was much worse than the crowd on short-term questions (prediction error roughly double the crowd's). But when given the historical trajectory of crowd forecasts, the LLM nearly matched the crowd. The model can use structured data when provided, but cannot recreate that information from its training alone. On long-term questions with little data, the LLM was directionally better than the crowd even without history, beating it on 77% of items.

Claude also found that calibration matters a lot. GPT-4o-mini hedges heavily: it averages a predicted probability of 47% when the base rate is 33%, and only 4% of its predictions are at extreme values (below 10% or above 90%). Compare that to superforecasters, who commit to extreme values 45% of the time. Claude Sonnet 4.5 and Gemini Flash were much better calibrated, which accounts for most of their advantage over GPT-4o-mini.

The hypothesis also predicted that LLMs should do relatively better at longer horizons. The data showed the opposite: LLMs performed best at medium horizons (61 to 180 days) and worse at both extremes.

Gemini's small study claimed LLMs beat superforecasters, but this was on only 20 questions with a weaker model (GPT-4o-mini), making it unreliable as evidence. Claude's much larger study with three models tells the more complete story.

What we learned.

The hypothesis is not supported in its strong form. LLMs do not beat expert humans at either short-term data-rich forecasting or long-term event forecasting. On time series, they are a reasonable zero-shot baseline that beats naive methods but trails well-engineered statistical approaches. On event prediction, they beat the general public but lose to superforecasters. The closest the hypothesis comes to being true is on medium-horizon data-based questions, where frontier LLMs match or slightly outperform the experts. The biggest practical bottleneck is calibration: LLMs that hedge less and commit to extreme predictions when warranted do much better, suggesting that post-processing alone could close some of the gap.


Next Week's Competition

The twenty-ninth weekly competition is now open! Voting closes Friday, May 29 at 11:59 PM AOE.

Check out this week's ideas and upvote the ones that excite you. Submit your own ideas to enter the next round!

This week we found that LLM writing about death is judged as equally or more authentic than human writing by blind judges (though all judges were themselves LLMs), that production modes like typing versus dictation and AI versus human are stored as nearly independent directions inside LLM hidden states, and that LLMs are competitive with the general public on forecasting but still trail expert superforecasters and well-engineered statistical methods.

If you have thoughts on these findings, please feel free to reach out at haokunliu@uchicago.edu. We welcome collaborations and contributions! Check out our NeuriCo repo to see how the experiments are run.


If you are interested in citing this blog, use this bibtex:

@misc{liu-week-of-05-18-2026, author = {Liu, Haokun}, title = {Week of 05/18/26-05/24/26}, year = {2026}, month = {May}, day = {25}, url = {https://hypogenic.ai/blog/weekly-entry-260518} }