Wandering Nomad: Benchmarking

1.9.25

UQ: a benchmark where solving the test actually advances knowledge

AI benchmarks keep getting “solved,” then patched to stay hard. Stanford’s new UQ (Unsolved Questions) flips the script: instead of rehashing problems with known answers, it evaluates models on 500 real unsolved questions pulled from 80 Stack Exchange sites—spanning CS theory, math, physics, bioacoustics, sci-fi, and history. The goal is difficulty and realism: if a model cracks one, it’s useful to humans, not just the leaderboard.

How they built it

The team filtered 3M+ unanswered posts with site-specific thresholds (age, views, upvotes, top-10% rank), used LLMs to screen for well-definedness, approachability, objectiveness and difficulty, then ran PhD-level human review to finalize the 500-item set (plus a “diamond” subset of 25). Each entry ships with full provenance.

Validation without ground truth

Because answers aren’t known, UQ introduces validator pipelines that exploit a growing generator–validator gap—frontier models are better at judging candidate answers than producing them. The pipeline stacks low-level checks (factual/logical consistency, QA cycle-consistency), mid-level judgment sampling (repeated/iterated reviews), and high-level aggregation (majority/unanimous vote, sequential verification). These validators are tuned on Humanity’s Last Exam as a surrogate and transfer to UQ’s dev set.

Early results: humbling

On the live platform (uq.stanford.edu), the best model so far passes validator screening on ~15% of questions; preliminary human review has already confirmed some of those as correct, underscoring that UQ can surface genuinely new solutions.

Why this matters

Hard and real. UQ avoids contrived exam tricks and low-value FAQ-style prompts—progress here should generalize to messy, valuable queries.
Scalable evaluation. Validator pipelines give conservative, human-helpful signals until experts weigh in, and they generalize across datasets.
Open, ongoing. A community platform lets researchers submit questions, answers, and reviews, keeping the benchmark fresh as models improve.

If your model claims “reasoning,” UQ is a reality check: can it contribute to questions that no one has answered yet—and prove it without a key in the back of the book?

Paper link: arXiv 2508.17580 (PDF)

16.5.25

Ultra-FineWeb: A Trillion-Token Dataset Elevating LLM Performance Across Benchmarks

In a groundbreaking development for artificial intelligence, researchers from Tsinghua University and ModelBest have unveiled Ultra-FineWeb, a massive, high-quality dataset designed to bolster the training of large language models (LLMs). Comprising approximately 1 trillion English tokens and 120 billion Chinese tokens, Ultra-FineWeb sets a new standard in dataset curation, emphasizing both scale and quality to enhance LLM performance across a spectrum of benchmarks.

Innovative Filtering Methodology

The creation of Ultra-FineWeb addresses two critical challenges in dataset preparation for LLMs: the need for efficient data verification and the selection of high-quality seed data for classifier training.

Efficient Verification Strategy: To rapidly assess data quality, the researchers implemented a verification approach that evaluates the impact of data on LLM training with minimal computational overhead. This strategy enables timely feedback, facilitating the swift refinement of the dataset.
Optimized Seed Selection: Recognizing the subjectivity in manual seed selection, the team developed a method to systematically choose positive and negative samples. By integrating the verification strategy, they enhanced the robustness and quality of the classifier used for data filtering.

A lightweight classifier based on fastText was employed to efficiently filter the dataset. This choice significantly reduced inference costs while maintaining high filtering precision, ensuring that only the most relevant and high-quality data were included in Ultra-FineWeb.

Benchmark Performance

LLMs trained on Ultra-FineWeb demonstrated remarkable improvements across various benchmarks:

English Benchmarks: Models exhibited substantial gains in tasks such as MMLU, ARC-C, ARC-E, and OpenbookQA, with average score increases of over 3% compared to those trained on previous datasets like FineWeb and FineWeb-Edu.
Chinese Benchmarks: On evaluations like C-Eval and CMMLU, models trained with Ultra-FineWeb-zh outperformed counterparts, indicating enhanced comprehension and reasoning in Chinese language tasks.

These improvements underscore the dataset's effectiveness in enhancing LLM capabilities across multiple languages and domains.

Implications for AI Development

Ultra-FineWeb's introduction marks a significant advancement in the field of AI, particularly in the training of LLMs. By addressing key challenges in data verification and seed selection, and by employing efficient filtering techniques, the dataset provides a robust foundation for developing more accurate and versatile language models.

The methodologies applied in creating Ultra-FineWeb offer a blueprint for future dataset curation efforts, emphasizing the importance of quality and efficiency in data preparation.

Access and Availability

Ultra-FineWeb is available for the research community through Hugging Face, promoting transparency and collaboration in AI development. Researchers and developers are encouraged to utilize this resource to further advance the capabilities of LLMs.

Takeaway

Ultra-FineWeb represents a pivotal resource in the evolution of large language models, combining extensive scale with meticulous quality control. Its innovative filtering methodologies and demonstrable performance enhancements across benchmarks position it as an essential tool for researchers and developers aiming to push the boundaries of AI language understanding.