AI benchmarks keep getting “solved,” then patched to stay hard. Stanford’s new UQ (Unsolved Questions) flips the script: instead of rehashing problems with known answers, it evaluates models on 500 real unsolved questions pulled from 80 Stack Exchange sites—spanning CS theory, math, physics, bioacoustics, sci-fi, and history. The goal is difficulty and realism: if a model cracks one, it’s useful to humans, not just the leaderboard.
How they built it
The team filtered 3M+ unanswered posts with site-specific thresholds (age, views, upvotes, top-10% rank), used LLMs to screen for well-definedness, approachability, objectiveness and difficulty, then ran PhD-level human review to finalize the 500-item set (plus a “diamond” subset of 25). Each entry ships with full provenance.
Validation without ground truth
Because answers aren’t known, UQ introduces validator pipelines that exploit a growing generator–validator gap—frontier models are better at judging candidate answers than producing them. The pipeline stacks low-level checks (factual/logical consistency, QA cycle-consistency), mid-level judgment sampling (repeated/iterated reviews), and high-level aggregation (majority/unanimous vote, sequential verification). These validators are tuned on Humanity’s Last Exam as a surrogate and transfer to UQ’s dev set.
Early results: humbling
On the live platform (uq.stanford.edu), the best model so far passes validator screening on ~15% of questions; preliminary human review has already confirmed some of those as correct, underscoring that UQ can surface genuinely new solutions.
Why this matters
-
Hard and real. UQ avoids contrived exam tricks and low-value FAQ-style prompts—progress here should generalize to messy, valuable queries.
-
Scalable evaluation. Validator pipelines give conservative, human-helpful signals until experts weigh in, and they generalize across datasets.
-
Open, ongoing. A community platform lets researchers submit questions, answers, and reviews, keeping the benchmark fresh as models improve.
If your model claims “reasoning,” UQ is a reality check: can it contribute to questions that no one has answered yet—and prove it without a key in the back of the book?
Paper link: arXiv 2508.17580 (PDF)
No comments:
Post a Comment