Wandering Nomad: theory

9.9.25

Why language models hallucinate: blame the objectives—and the leaderboards

In a 36-page paper dated September 4, 2025, researchers from OpenAI and Georgia Tech argue that large language models don’t hallucinate because of some exotic neural quirk. They hallucinate because our training and evaluation setups make guessing the rational strategy. In their framing, hallucinations are just ordinary classification errors emerging from standard statistical pressures—then locked in because most benchmarks award points for confident attempts and zero for “I don’t know.”

The core claim

Pretraining → inevitable errors. Even with error-free corpora, the objectives used in pretraining push models to produce some invalid outputs. The authors reduce the problem to a binary task they call Is-It-Valid (IIV) and show a direct link: a model’s generative error rate is at least twice its IIV misclassification rate. Translation: some hallucination is baked in by statistics alone.
Post-training → incentives to bluff. After pretraining, models are tuned and graded on binary 0–1 metrics (accuracy, pass rate). Under that regime, a model that always guesses beats an otherwise identical model that abstains when unsure—creating an “epidemic of penalizing uncertainty.”

Concrete failure cases drive the point home (e.g., counting letters in “DEEPSEEK” where models confidently answer 2, 3, even 7) and “fact with no pattern” queries like birthdays, where statistics predict frequent errors.

What’s actually new here

A learning-theory reduction that treats hallucination as classic error in a supervised problem (IIV), not a Transformer-specific oddity. It subsumes earlier Good–Turing–style arguments about rare facts and strengthens them to include prompts and IDK behavior.
A meta-evaluation of popular leaderboards showing that binary grading dominates—so even perfect hallucination tests won’t change incentives if the primary scores still punish abstention. The paper formalizes why abstaining is never optimal under a broad class of binary graders (Observation 1).

The proposed fix: change the rules of the game

Rather than invent yet another hallucination benchmark, the authors want mainstream evaluations to adopt explicit confidence targets in the prompt and penalties for wrong answers (e.g., “answer only if > t confident; mistakes cost t/(1−t) points; IDK scores 0”). That nudges models toward behavioral calibration—saying IDK below the threshold—and makes abstention rational across many tasks (including code suites like SWE-bench).

A summary table in the paper highlights how today’s staples (GPQA, MMLU-Pro, BBH, MATH, SWE-bench, HLE) are binary and offer no credit for IDK, reinforcing bluffing.

Why this matters for builders and benchmarkers

Trust over test-taking. If your evals reward confident guesses, your models will optimize for bluffing. Changing scoring alters the gradient that RLHF/DPO and selection heuristics actually follow.
A portable recipe. The framework applies to base LMs, RAG systems, and “o1-style” reasoners alike; binary grading still incentivizes guessing when search or tools come up empty.
Measurable behavior. With stated thresholds, you can audit “behavioral calibration” (accuracy vs. error across different t) instead of relying on brittle probability calibration.

Bottom line: hallucinations aren’t just a modeling bug; they’re a measurement bug. If the industry wants less confident nonsense and more honest “IDK,” it has to stop grading like a multiple-choice exam.

Paper link: Why Language Models Hallucinate