1.9.25

UQ: a benchmark where solving the test actually advances knowledge

 AI benchmarks keep getting “solved,” then patched to stay hard. Stanford’s new UQ (Unsolved Questions) flips the script: instead of rehashing problems with known answers, it evaluates models on 500 real unsolved questions pulled from 80 Stack Exchange sites—spanning CS theory, math, physics, bioacoustics, sci-fi, and history. The goal is difficulty and realism: if a model cracks one, it’s useful to humans, not just the leaderboard.

How they built it

The team filtered 3M+ unanswered posts with site-specific thresholds (age, views, upvotes, top-10% rank), used LLMs to screen for well-definedness, approachability, objectiveness and difficulty, then ran PhD-level human review to finalize the 500-item set (plus a “diamond” subset of 25). Each entry ships with full provenance. 

Validation without ground truth

Because answers aren’t known, UQ introduces validator pipelines that exploit a growing generator–validator gap—frontier models are better at judging candidate answers than producing them. The pipeline stacks low-level checks (factual/logical consistency, QA cycle-consistency), mid-level judgment sampling (repeated/iterated reviews), and high-level aggregation (majority/unanimous vote, sequential verification). These validators are tuned on Humanity’s Last Exam as a surrogate and transfer to UQ’s dev set. 

Early results: humbling

On the live platform (uq.stanford.edu), the best model so far passes validator screening on ~15% of questions; preliminary human review has already confirmed some of those as correct, underscoring that UQ can surface genuinely new solutions. 

Why this matters

  • Hard and real. UQ avoids contrived exam tricks and low-value FAQ-style prompts—progress here should generalize to messy, valuable queries. 

  • Scalable evaluation. Validator pipelines give conservative, human-helpful signals until experts weigh in, and they generalize across datasets. 

  • Open, ongoing. A community platform lets researchers submit questions, answers, and reviews, keeping the benchmark fresh as models improve. 

If your model claims “reasoning,” UQ is a reality check: can it contribute to questions that no one has answered yet—and prove it without a key in the back of the book?

Paper link: arXiv 2508.17580 (PDF)

No comments:

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...