Wandering Nomad

11.9.25

The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting

When logic is tricky, the most common answer isn’t always the correct one. A new Meta/Fair & CMU paper titled “The Majority is not always right: RL training for solution aggregation” challenges the standard practice of combining LLM outputs via voting or reward-scored selection. Their method—AggLM—trains a dedicated aggregator model to review, correct, and synthesize among multiple LLM-generated candidate solutions via reinforcement learning from verifiable rewards (RLVR), yielding big gains over majority voting and reward model baselines.

Solving it: learned reconciliation vs. counting

Standard aggregation in LLM reasoning often works like this: sample many candidate solutions, then pick the answer that's most frequent (majority voting) or highest scored by some reward model. While effective in many settings, these methods have a blind spot—when correct answers exist only among minority solutions. In contrast, AggLM treats aggregation itself as a reasoning task. It takes a set of candidate solutions, analyzes them, spots mistakes or partial correctness, then combines ideas or corrects missing steps to produce a final solution. Importantly, it’s trained using verifiable rewards—i.e. only when the aggregated output matches a known correct solution.

Key ingredients & experiments

Dataset & training: Using Qwen3-1.7B as the solution generator, AggLM-1.7B is trained on ~446,000 examples drawn from a mixture of “easy” and “hard” sets. Hard sets are those where the majority answer among candidates is actually incorrect; the mix helps the model learn both to follow the majority and to rescue correctness from minority solutions.
Aggregation via RLVR: The model uses Group-Relative Policy Optimization (GRPO), with a binary reward (1 for matching the ground truth, 0 otherwise). The aggregator is initialized from the Qwen3-1.7B model but is tuned via this RL signal.
Benchmarks: Evaluated on four math contest datasets: AIME24, AIME25, HMMT24, HMMT25. AggLM was tested aggregating candidate solutions from both the same generator model (Qwen3-1.7B) and stronger ones (Qwen3-8B), in both thinking and non-thinking modes.

Results & token-efficiency

On solutions from Qwen3-1.7B in thinking mode, AggLM-1.7B lifts accuracy significantly. For example, on AIME25, majority voting with 8 candidates yields ~67.9%, while AggLM pushes it to 50.0% in a different benchmark context (depending on the exact evaluation variant). More striking, when aggregating from the stronger 8B model, AggLM still outperforms majority voting, weighted voting, and reward-model selection baselines.
In non-thinking modes (i.e. when the candidate-generating model is weaker or does not use chain-of-thought reasoning), AggLM retains its lead—showing that it generalizes beyond just cherry-picking strong or specifically-formatted inputs.
Regarding cost, AggLM is more token efficient: instead of needing large numbers of candidate solutions (i.e. very large k) for majority voting to reach high accuracy, AggLM achieves similar or better accuracy with fewer candidate solutions, saving both inference time and compute.

Implications & what’s next

AggLM shifts thinking in two ways:

Aggregation as reasoning. Aggregation isn’t just picking among options—it’s an opportunity to correct, synthesize, and integrate partial truths. Models that can do that perform better, especially in instances where majority answers mislead.
Balancing examples is key. Training on a mix of easy and hard cases was essential. If you train only on “easy” majority-correct groups, or only on “hard” ones, performance suffers.
Generalization beyond training generators. AggLM works well even when aggregating from stronger models than those used during training—implying aggregation skills are transferable, not just overfitted to particular output distributions.
Efficiency trade-off. Instead of scaling k (number of solutions) to very high values, using a learned aggregator yields larger gains per additional candidate, meaning happier ceilings on tokens/time.

Bottom line: AggLM demonstrates that “the majority vote” should not be the default in reasoning aggregation. Models that are trained to look across candidate solutions—identify hidden truth, correct errors, and combine the best ideas—do better than simple heuristics. Especially in math and logic tasks where minority correct answers exist, learned aggregation via RL with verifiable reward is a strong lever. If you’re designing agents or reasoning pipelines, integrating an aggregator like AggLM can be a powerful performance boost with reasonable cost.

Paper link: arXiv 2509.06870 (PDF)

ParaThinker: parallel minds beat longer monologues

LLMs have ridden test-time compute—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s ParaThinker calls this Tunnel Vision and proposes native thought parallelism: generate several independent reasoning paths simultaneously, then fuse them into one answer.

Instead of external voting, ParaThinker trains the model itself to branch and merge: specialized control tokens (<think i>) trigger distinct trajectories, path-specific positional embeddings keep streams separate, and a two-phase attention mask enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs.

On AIME-24/25, AMC-23 and MATH-500, ParaThinker with 8 parallel paths boosts accuracy by +12.3 pts (1.5B) and +7.5 pts (7B) over sequential baselines under the same token budget, and still beats majority voting by +4.3/+2.0 pts—with only ~7.1% latency overhead. Generating up to 16 paths costs <2× single-path latency, thanks to better arithmetic intensity on GPUs.

The takeaway: scale width, not just depth. ParaThinker shows that orchestrating compute across diverse, parallel thoughts unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub.

Paper link: arXiv 2509.04475 (PDF)

10.9.25

TraceRL puts diffusion LLMs on the reasoning map

Autoregressive (AR) giants have dominated reasoning benchmarks, while diffusion language models (DLMs) were seen as “fast samplers” with limited logic chops. A new paper from Princeton and UChicago argues that’s mostly a training-objective problem—and offers TraceRL, a trajectory-aware reinforcement learning framework that aligns what a DLM learns with how it actually samples. The team also releases code and ready-to-run models under the TraDo banner.

What’s new

Trajectory-aware RL for DLMs. Instead of scoring randomly masked sequences, TraceRL optimizes against the model’s intermediate inference traces, matching the left-to-right / blockwise behavior used at decode time. A diffusion-based value model stabilizes training by reducing variance. Crucially, the method works for full-attention and block-attention DLMs.
Open stack. The release includes a framework to build/train/deploy DLMs across architectures, with KV-cache acceleration, inference engines, SFT + RL recipes for math and code, and links to TraDo-4B/8B checkpoints.

The receipts

On headline benchmarks (dynamic vs. static sampling shown in the paper), the TraDo models post the strongest DLM numbers to date and overtake AR peers at similar scale on math:

TraDo-8B-Instruct: MATH500 78.5, AIME’24 13.3, LCB-V2 25.9—a +6.1% relative lift over Qwen2.5-7B-Instruct and +51.3% over Llama-3.1-8B-Instruct on math reasoning.
TraDo-4B-Instruct: MATH500 75.6, AIME’24 10.3, LCB-V2 18.7, consistently edging 7B AR baselines on math.
TraDo-8B-Thinking (long-CoT): first long chain-of-thought diffusion LLM, hitting MATH500 87.4, AIME’24 35.5, LCB-V2 34.6 with very long answers.

The authors attribute gains to objective/trajectory alignment and show smoother curves with the value model vs. policy-only RL. They also document a speed/accuracy trade-off: dynamic sampling is faster; static top-1 decoding squeezes out extra points.

Why it matters

DLMs aren’t just “fast”—they can reason. With the right RL target, parallel generation stacks clear long-form math and coding hurdles previously ceded to AR. 2) Unifies the zoo. One RL recipe spans full-attention and block-diffusion, and even helps enlarge block size for more flexible sampling. 3) Practical path. The open framework + KV-cache tricks make DLM post-training and deployment feel product-ready, not just a lab exercise.

Setup notes

Math RL uses 8k hard MATH tasks; coding RL uses 6k verified problems from PrimeIntellect. Long-CoT training mixes TraceRL with long-form SFT as a curriculum.

Bottom line: TraceRL reframes diffusion LLMs as credible reasoners, not just fast generators—and TraDo-8B-Thinking plants the first long-CoT flag on the DLM side of the field.

Paper link: arXiv 2509.06949 (PDF)

Language Self-Play: training an LLM without adding data actually works

LLMs keep getting better by eating more data—until the data well runs dry. A new paper from Meta Superintelligence Labs proposes Language Self-Play (LSP): turn training into a game where a single model plays both sides—a Challenger that generates tougher prompts and a Solver that answers them—so the system improves without ingesting new datasets. In tests on AlpacaEval using Llama-3.2-3B-Instruct, LSP matches a strong data-driven RL baseline and even pushes beyond it when used as a follow-on stage.

How it works: one model, two roles

LSP frames training as a minimax game: Challenger tries to minimize reward by making hard queries; Solver tries to maximize reward by answering them. Crucially, both roles are instantiated by the same LLM via a role-selecting prompt (e.g., a special challenger prompt), avoiding the instability and memory overhead of training an external adversary. KL regularization keeps the Challenger from devolving into nonsense prompts.

Under the hood, LSP borrows group-relative baselines from GRPO: Challenger generates N queries, Solver samples G answers per query, and the average reward defines both a per-answer advantage (for Solver) and a “difficulty” signal (for Challenger). A practical variant, LSP-Zero, runs as a pure zero-sum game; the full LSP adds a quality self-reward scored by a reference model to prevent reward-hacking (e.g., answering everything in Python).

Results: data-free ≈ data-driven—and sometimes better

Using GPT-4o as judge on AlpacaEval, the team compares models trained from the same base:

From base (no data): Overall win rates vs. the base model—GRPO (with data) 40.9%, LSP-Zero 40.1%, LSP 40.6%. Translation: self-play without any RL data keeps pace with standard RL.
From RL (as a next stage): Starting from the GRPO model and continuing with self-play, LSP lifts overall win rate to 43.1%, with large gains on Vicuna-style conversational tasks (28.7% → 46.3%).

The setup uses Skywork-Reward-V2-Llama-3.2-3B as the reward model; the authors note that LSP (with the added quality reward) avoids the degradation seen with LSP-Zero in some splits, and acknowledge dips on “chatbot-y” Koala prompts—likely because Challenger skews toward structured, orderly instructions.

Why this matters

Data bottleneck relief. If you can translate “more practice data” into a self-generated curriculum, you can keep improving without chasing new corpora.
A clean follow-on stage. Even after data-based RL, self-play adds headroom—useful when further high-quality preference data is scarce.
Single-model simplicity. One backbone serves both roles, avoiding adversary models and the instability they bring.

Caveats and open questions

Self-play can degenerate without the quality self-reward; reward choice caps the ceiling (a weak reward model means weak training signal); and Challenger diversity remains an open knob to broaden beyond the structured style seen in examples. Still, the authors argue the method should work even better on tasks with verifiable rewards (e.g., code tests), not just preferences.

If your roadmap hits a data wall, Language Self-Play is a compelling new leg in the post-training pipeline: spin up a Challenger inside your own model, let it stress-test itself, and learn—no fresh dataset required.

Paper link: arXiv 2509.07414 (PDF)

An AI that writes expert-level scientific software—and often beats the leaderboard

A large Google team is pushing past “chatty copilot” and into AI that authors working scientific code. Their system pairs a large language model with tree search to iteratively write, run, and score programs for scorable research problems—then learns to recombine ideas from papers and prior algorithms. In benchmarks, it discovered 40 new single-cell RNA-seq methods that outperformed the top human-made entries on OpenProblems, and produced 14 COVID-19 hospitalization forecasters that beat the CDC’s ensemble and every individual competitor during the study window.

How it works. Researchers frame a scientific task as “maximize a quality metric,” let the LLM generate code variants, and use tree search to expand promising branches while pruning the rest. The agent can ingest research ideas from literature (summarized with Gemini 2.5 Pro) and also tries automatic recombinations of methods, plus proposals from Gemini Deep Research and AI co-scientist tools. In head-to-head tests on nine published algorithms, the system’s implementations beat eight of nine baselines; its best run—BBKNN(TS)—improved the bioinformatics leaderboard by 14% over the long-standing ComBat approach.

Bioinformatics at scale. The team evaluates on OpenProblems v2.0.0, spanning 1,747,937 cells and 13 metrics across six datasets. Beyond re-implementing published methods, recombination mattered: among 55 pairwise hybrids, 24 outperformed both parents and most others beat at least one—evidence that the search can synthesize competitive, novel ideas rather than just tune hyperparameters.

Public-health forecasting. For U.S. COVID-19 hospitalization forecasting (the CDC’s Forecast Hub), the system generated models that were consistently lower-error (better WIS) than the official ensemble in most jurisdictions; in an aggregate comparison, 14 strategies (10 recombinations, plus two Deep Research, one AI co-scientist, and one replicated baseline) surpassed the ensemble across the three-week hold-out period.

Not just biology. The abstract lists additional wins in geospatial image segmentation, zebrafish neural activity prediction, general time-series, and numerical integration, arguing the approach generalizes to diverse “empirical software” problems where code can be scored automatically.

Engineering notes—and guardrails. To avoid overfitting, bio experiments hill-climb on a separate CELLxGENE dataset and report on the held-out OpenProblems benchmark; metrics that fail to compute are clamped to worst-case—making robustness part of the score. The team also ran multiple replicates to show stability, and reports practical budgets: ≈500 nodes (~7 hours) per scRNA-seq search and ≈2000 nodes per COVID run on their infra.

Why it matters. Rather than waiting for domain-specific code to be hand-crafted over months, this “AI co-scientist” produces working software, tests it against public leaderboards, and composes new hybrids from the literature. If those patterns hold beyond the reported tasks, the future of scientific computing looks less like prompt engineering—and more like searching the space of programs.

Paper link: arXiv 2509.06503 (PDF)

Embedding retrievers hit a math wall—and DeepMind just mapped it

Vector embeddings power everything from RAG to enterprise search. But a new DeepMind paper argues there’s a theoretical ceiling baked into single-vector retrieval: for any embedding dimension

$d$ , there exist query-document relevance patterns that no embedding model can represent—no matter the data or training tricks. The authors connect learning-theory and geometric results to IR and then build a deliberately simple dataset, LIMIT, where leading embedders struggle.

The core result, in plain English

Treat each query’s relevant docs as a row in a binary matrix (“qrels”). The paper introduces row-wise thresholdable rank and lower-bounds it via sign-rank to show a fundamental limit: once the number of documents $n$ crosses a critical threshold for a given $d$ , there exist top-k sets that cannot be realized by any single-vector embedding retriever. That’s a property of geometry, not optimization.

LIMIT: a toy task that breaks real systems

To make the math bite, the team instantiates LIMIT with natural-language facts (“Jon Durben likes quokkas and apples…”) that encode all combinations of relevance over a small doc pool. Despite its simplicity, SoTA MTEB models score <20 recall@100, while classic BM25 is near-perfect—underscoring that the failure is specific to single-vector embedding retrieval.

In a “small” LIMIT (N≈46) sweep, ramping dimensions up to 4096 lifts recall but still doesn’t solve the task; BM25 cruises to 100% at @10/@20. Fine-tuning on in-domain LIMIT data barely helps, indicating intrinsic hardness, not domain shift.

How this differs from usual benchmark talk

LIMIT’s structure—dense overlap of query relevances—looks nothing like BEIR or typical web QA. Compared across datasets, LIMIT shows far higher “graph density” and query-similarity strength than NQ, HotpotQA, or SciFact, approximating instruction-following IR where prompts combine unrelated items with logical operators.

Numbers that sting

A table of critical document counts shows how quickly trouble arrives as $d$ grows (e.g., $d=4 \Rightarrow n\approx10$ ; $d=16 \Rightarrow n\approx79$ ; $d=32 \Rightarrow n\approx296$ ). Put differently: long before you reach enterprise-scale corpora, some seemingly trivial “return docs X and Y, not Z” requests fall outside what an embedder can express.

What to do about it (and what not to)

Don’t only crank up dimension. Bigger $d$ delays but doesn’t remove the wall.
Consider alternative architectures. Multi-vector approaches (e.g., ColBERT-style), sparse methods, or hybrid stacks escape parts of the limit that bind single-vector embedders. The paper’s head-to-heads hint why BM25 and multi-vector models fare better.
Test against LIMIT-style stressors. The team released datasets on Hugging Face and code on GitHub to reproduce results and probe your own models.

Why this matters for RAG and instruction-following IR

Modern agents increasingly ask retrieval systems to honor combinational and logical constraints (“find papers that mention A and B but not C”). The paper shows there’s a mathematical point where single-vector embedders must fail such patterns—explaining why teams often paper over issues with rerankers and handcrafted filters. As instruction-following IR grows, expect more LIMIT-like cases in the wild.

Bottom line: embedding-only retrieval won’t scale to every notion of relevance. If your roadmap leans on expressive, compositional queries, plan for hybrid retrieval and reranking—and add LIMIT to your eval suite.

Paper link: arXiv 2508.21038 (PDF)

9.9.25

UDR turns “deep research” into a programmable product feature

Most “deep research” agents hard-code their plan and lock you into one LLM. Universal Deep Research (UDR) proposes a different deal: you supply the model and the method. UDR wraps around any LLM and lets users create, edit, and refine fully custom research strategies—no extra training required. Think of it as a general-purpose chassis for web-scale and enterprise research that you can rewire on the fly.

Why this matters

Today’s tools (Gemini, Perplexity, OpenAI/Grok deep research, and enterprise stacks like NVIDIA AI-Q, SambaNova, ERP-AI) ship opinionated pipelines that work—but are hard to reshape, mix, or upgrade with a different backbone. UDR targets three pain points: (P1) limited control over sources/costs, (P2) no way to encode specialized industry workflows, and (P3) inability to swap in the newest model independently of the agent.

How UDR works (in plain English)

1) Strategy → code.
You write a numbered strategy in natural language. UDR compiles it into a single callable function that emits structured progress updates via yield and constrains tool use to what you allow. The paper found “one-shot, end-to-end” code generation—annotated step-by-step—was far more reliable than fragmentary orchestration.

2) Isolated execution with small contexts.
Instead of stuffing a giant context window, UDR stores interim artifacts as named variables in the execution state. In experiments, 8k tokens was enough for full workflows, because the controller code (CPU-side) keeps state while the LLM is invoked only for local tasks (summarize, rank, extract). Tools are synchronous function calls for deterministic behavior.

3) Transparent progress + auditable outputs.
Your strategy defines notifications (type, timestamp, description) that stream to the UI during the run, and a final “research report” built from the accumulated state—with citations and formatting under your control.

4) Safety by design.
Because UDR executes generated code, it’s meant to run inside a sandbox (e.g., Piston) so strategies can’t touch the host system—mandatory for anything beyond a trusted demo.

What you can build with it

The authors ship minimal, expansive, and intensive example strategies plus a simple UI: search bar for prompts, a strategy picker, and an editor to tweak steps—handy for teams iterating on domain-specific research recipes (finance, legal, healthcare).

The headline advantages

BYO model, BYO strategy. Pair the strongest available LLM with your best research recipe—no re-training loops.
Latency & cost discipline. Orchestration runs as code on CPU; the LLM is called sparingly on focused snippets, reducing GPU churn and token spend.
Deterministic tool use. Explicit, synchronous calls and stateful variables curb flaky agent behaviors like skipping steps or re-scraping needlessly.

Big picture

Deep research tools are already popular, but strategy rigidity and model lock-in limit how far they go inside enterprises. UDR reframes the agent as a compiler/runtime: you specify the plan, the system turns it into constrained code, and any LLM can power the reasoning. For builders eyeing compliance-friendly, auditable research automation, that’s a compelling foundation.

Paper link: arXiv 2509.00244 (PDF)

Why language models hallucinate: blame the objectives—and the leaderboards

In a 36-page paper dated September 4, 2025, researchers from OpenAI and Georgia Tech argue that large language models don’t hallucinate because of some exotic neural quirk. They hallucinate because our training and evaluation setups make guessing the rational strategy. In their framing, hallucinations are just ordinary classification errors emerging from standard statistical pressures—then locked in because most benchmarks award points for confident attempts and zero for “I don’t know.”

The core claim

Pretraining → inevitable errors. Even with error-free corpora, the objectives used in pretraining push models to produce some invalid outputs. The authors reduce the problem to a binary task they call Is-It-Valid (IIV) and show a direct link: a model’s generative error rate is at least twice its IIV misclassification rate. Translation: some hallucination is baked in by statistics alone.
Post-training → incentives to bluff. After pretraining, models are tuned and graded on binary 0–1 metrics (accuracy, pass rate). Under that regime, a model that always guesses beats an otherwise identical model that abstains when unsure—creating an “epidemic of penalizing uncertainty.”

Concrete failure cases drive the point home (e.g., counting letters in “DEEPSEEK” where models confidently answer 2, 3, even 7) and “fact with no pattern” queries like birthdays, where statistics predict frequent errors.

What’s actually new here

A learning-theory reduction that treats hallucination as classic error in a supervised problem (IIV), not a Transformer-specific oddity. It subsumes earlier Good–Turing–style arguments about rare facts and strengthens them to include prompts and IDK behavior.
A meta-evaluation of popular leaderboards showing that binary grading dominates—so even perfect hallucination tests won’t change incentives if the primary scores still punish abstention. The paper formalizes why abstaining is never optimal under a broad class of binary graders (Observation 1).

The proposed fix: change the rules of the game

Rather than invent yet another hallucination benchmark, the authors want mainstream evaluations to adopt explicit confidence targets in the prompt and penalties for wrong answers (e.g., “answer only if > t confident; mistakes cost t/(1−t) points; IDK scores 0”). That nudges models toward behavioral calibration—saying IDK below the threshold—and makes abstention rational across many tasks (including code suites like SWE-bench).

A summary table in the paper highlights how today’s staples (GPQA, MMLU-Pro, BBH, MATH, SWE-bench, HLE) are binary and offer no credit for IDK, reinforcing bluffing.

Why this matters for builders and benchmarkers

Trust over test-taking. If your evals reward confident guesses, your models will optimize for bluffing. Changing scoring alters the gradient that RLHF/DPO and selection heuristics actually follow.
A portable recipe. The framework applies to base LMs, RAG systems, and “o1-style” reasoners alike; binary grading still incentivizes guessing when search or tools come up empty.
Measurable behavior. With stated thresholds, you can audit “behavioral calibration” (accuracy vs. error across different t) instead of relying on brittle probability calibration.

Bottom line: hallucinations aren’t just a modeling bug; they’re a measurement bug. If the industry wants less confident nonsense and more honest “IDK,” it has to stop grading like a multiple-choice exam.

Paper link: Why Language Models Hallucinate

2.9.25

Memento: teach agents to learn on the fly—no LLM fine-tune required

Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep the LLM frozen and adapt the agent with a memory that learns from every episode. The team formalizes this as a Memory-augmented MDP and shows it can lift real-world “deep research” performance—without gradient updates to the underlying model.

The recipe in one diagram

Memento is a planner–executor architecture wired to a growing Case Bank of episodic traces (state, action, reward). At each step, the planner retrieves similar past cases to guide the next action; after acting, the trajectory (success or failure) is written back—so the memory rewrites itself with environmental feedback. Retrieval can be non-parametric (Top-K by similarity) or parametric via a lightweight Q(s, c) scorer trained online to prefer high-utility cases. Tools are accessed through an MCP-style interface so the executor can browse, run code, or call APIs inside the same loop.

Why this beats “prompt more” and “train more”

Unlike static RAG or handcrafted reflections, case-based reasoning (CBR) selectively reuses successful and failed traces; unlike RL-fine-tuning, it avoids catastrophic forgetting and heavy compute. In ablations, adding CBR memory yields +4.7 to +9.6 absolute points on out-of-distribution QA sets (MuSiQue, Bamboogle, PopQA).

The receipts

GAIA (long-horizon tool use): Top-1 on validation (87.88% Pass@3) and 79.40% on the private test leaderboard.
DeepResearcher (live web research): 66.6 F1 / 80.4 PM, outperforming training-based systems under the paper’s setup.
SimpleQA (single-hop factual): 95.0 PM, the highest among reported baselines.
Humanity’s Last Exam (HLE): 24.4 PM, second overall and within 0.92 of GPT-5 in the authors’ evaluation.

What this means for builders

Ship updates without re-training. Treat memory as the learning substrate; leave your production LLM untouched.
Choose your memory: start with non-parametric retrieval; add the parametric Q-head when you need sharper case selection.
Tooling that scales. MCP-based execution keeps multi-tool orchestration inside one protocol, making traces coherent and reusable.

The upshot: Memento reframes “agent improvement” as memory engineering. If your research agent gets better the more it works—without touching base weights—you’ve got a path to continual learning that’s practical outside the lab.

Paper link: arXiv 2508.16153 (PDF)