Wandering Nomad: UQ: a benchmark where solving the test actually advances knowledge

1.9.25

UQ: a benchmark where solving the test actually advances knowledge

AI benchmarks keep getting “solved,” then patched to stay hard. Stanford’s new UQ (Unsolved Questions) flips the script: instead of rehashing problems with known answers, it evaluates models on 500 real unsolved questions pulled from 80 Stack Exchange sites—spanning CS theory, math, physics, bioacoustics, sci-fi, and history. The goal is difficulty and realism: if a model cracks one, it’s useful to humans, not just the leaderboard.

How they built it

The team filtered 3M+ unanswered posts with site-specific thresholds (age, views, upvotes, top-10% rank), used LLMs to screen for well-definedness, approachability, objectiveness and difficulty, then ran PhD-level human review to finalize the 500-item set (plus a “diamond” subset of 25). Each entry ships with full provenance.

Validation without ground truth

Because answers aren’t known, UQ introduces validator pipelines that exploit a growing generator–validator gap—frontier models are better at judging candidate answers than producing them. The pipeline stacks low-level checks (factual/logical consistency, QA cycle-consistency), mid-level judgment sampling (repeated/iterated reviews), and high-level aggregation (majority/unanimous vote, sequential verification). These validators are tuned on Humanity’s Last Exam as a surrogate and transfer to UQ’s dev set.

Early results: humbling

On the live platform (uq.stanford.edu), the best model so far passes validator screening on ~15% of questions; preliminary human review has already confirmed some of those as correct, underscoring that UQ can surface genuinely new solutions.

Why this matters

Hard and real. UQ avoids contrived exam tricks and low-value FAQ-style prompts—progress here should generalize to messy, valuable queries.
Scalable evaluation. Validator pipelines give conservative, human-helpful signals until experts weigh in, and they generalize across datasets.
Open, ongoing. A community platform lets researchers submit questions, answers, and reviews, keeping the benchmark fresh as models improve.

If your model claims “reasoning,” UQ is a reality check: can it contribute to questions that no one has answered yet—and prove it without a key in the back of the book?

Paper link: arXiv 2508.17580 (PDF)