Wandering Nomad: test-time scaling

Showing posts with label test-time scaling. Show all posts

11.9.25

The Majority Isn’t Always Right: AggLM Learns to Aggregate Better Than Voting

When logic is tricky, the most common answer isn’t always the correct one. A new Meta/Fair & CMU paper titled “The Majority is not always right: RL training for solution aggregation” challenges the standard practice of combining LLM outputs via voting or reward-scored selection. Their method—AggLM—trains a dedicated aggregator model to review, correct, and synthesize among multiple LLM-generated candidate solutions via reinforcement learning from verifiable rewards (RLVR), yielding big gains over majority voting and reward model baselines.

Solving it: learned reconciliation vs. counting

Standard aggregation in LLM reasoning often works like this: sample many candidate solutions, then pick the answer that's most frequent (majority voting) or highest scored by some reward model. While effective in many settings, these methods have a blind spot—when correct answers exist only among minority solutions. In contrast, AggLM treats aggregation itself as a reasoning task. It takes a set of candidate solutions, analyzes them, spots mistakes or partial correctness, then combines ideas or corrects missing steps to produce a final solution. Importantly, it’s trained using verifiable rewards—i.e. only when the aggregated output matches a known correct solution.

Key ingredients & experiments

Dataset & training: Using Qwen3-1.7B as the solution generator, AggLM-1.7B is trained on ~446,000 examples drawn from a mixture of “easy” and “hard” sets. Hard sets are those where the majority answer among candidates is actually incorrect; the mix helps the model learn both to follow the majority and to rescue correctness from minority solutions.
Aggregation via RLVR: The model uses Group-Relative Policy Optimization (GRPO), with a binary reward (1 for matching the ground truth, 0 otherwise). The aggregator is initialized from the Qwen3-1.7B model but is tuned via this RL signal.
Benchmarks: Evaluated on four math contest datasets: AIME24, AIME25, HMMT24, HMMT25. AggLM was tested aggregating candidate solutions from both the same generator model (Qwen3-1.7B) and stronger ones (Qwen3-8B), in both thinking and non-thinking modes.

Results & token-efficiency

On solutions from Qwen3-1.7B in thinking mode, AggLM-1.7B lifts accuracy significantly. For example, on AIME25, majority voting with 8 candidates yields ~67.9%, while AggLM pushes it to 50.0% in a different benchmark context (depending on the exact evaluation variant). More striking, when aggregating from the stronger 8B model, AggLM still outperforms majority voting, weighted voting, and reward-model selection baselines.
In non-thinking modes (i.e. when the candidate-generating model is weaker or does not use chain-of-thought reasoning), AggLM retains its lead—showing that it generalizes beyond just cherry-picking strong or specifically-formatted inputs.
Regarding cost, AggLM is more token efficient: instead of needing large numbers of candidate solutions (i.e. very large k) for majority voting to reach high accuracy, AggLM achieves similar or better accuracy with fewer candidate solutions, saving both inference time and compute.

Implications & what’s next

AggLM shifts thinking in two ways:

Aggregation as reasoning. Aggregation isn’t just picking among options—it’s an opportunity to correct, synthesize, and integrate partial truths. Models that can do that perform better, especially in instances where majority answers mislead.
Balancing examples is key. Training on a mix of easy and hard cases was essential. If you train only on “easy” majority-correct groups, or only on “hard” ones, performance suffers.
Generalization beyond training generators. AggLM works well even when aggregating from stronger models than those used during training—implying aggregation skills are transferable, not just overfitted to particular output distributions.
Efficiency trade-off. Instead of scaling k (number of solutions) to very high values, using a learned aggregator yields larger gains per additional candidate, meaning happier ceilings on tokens/time.

Bottom line: AggLM demonstrates that “the majority vote” should not be the default in reasoning aggregation. Models that are trained to look across candidate solutions—identify hidden truth, correct errors, and combine the best ideas—do better than simple heuristics. Especially in math and logic tasks where minority correct answers exist, learned aggregation via RL with verifiable reward is a strong lever. If you’re designing agents or reasoning pipelines, integrating an aggregator like AggLM can be a powerful performance boost with reasonable cost.

Paper link: arXiv 2509.06870 (PDF)

1.9.25

MIRAGE: parallel GraphRAG turns test-time scaling into a team sport

Most test-time scaling schemes still walk a single, linear chain of thought—great until an early mistake snowballs. MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration) swaps that for many chains in parallel, each grounded in a medical knowledge graph and then cross-checked before answering. Think of it as ToT’s breadth, Search-o1’s retrieval, and GraphRAG’s structure—rolled into one pipeline.

How it works (and why it’s different)

Entity-grounded decomposition. The system splits a clinical question into sub-questions tied to concrete entities (symptoms, diseases, treatments). Each sub-question spawns its own reasoning chain.
Graph-based retrieval, two modes.
- Anchor mode: query the KG around a single entity (local neighborhood).
- Bridge mode: search paths between entity pairs to surface multi-hop relations.
Adaptive evidence streaming. Chains iteratively expand neighbors/multi-hop trails, keeping only deduplicated, directionally relevant facts.
Cross-chain verification. An answer synthesizer reconciles sub-answers, prefers explanations backed by broader, independent chains, and normalizes clinical terms—cutting contradictions and hallucinations. Outputs are serialized with full provenance traces for audit.

Benchmarks: consistent wins over strong baselines

Evaluated on GenMedGPT-5k, CMCQA, and ExplainCPE (with paired medical KGs), MIRAGE tops GPT-4o, GPT-4o+ToT, QWQ-32B, MindMap (GraphRAG), and Search-o1 across GPT-4o ranking and/or accuracy. Highlights:

GenMedGPT-5k: best GPT-4o rank 1.8 (lower is better).
CMCQA: rank 2.8, edging ToT, MindMap, and Search-o1.
ExplainCPE: 84.8% accuracy vs GPT-4o 77.8%, Search-o1 80.7%, MindMap 84.6%.

Swapping the backbone to DeepSeek-R1-32B preserves the lift (ExplainCPE 84.4%), suggesting MIRAGE is model-agnostic. A human study on GenMedGPT-5k prefers MIRAGE over all baselines, mirroring GPT-4o’s ranking.

What moved the needle

Structured retrieval beats flat text. Graph-aware exploration is more stable than BM25/dense retrieval and less noisy than web-first Search-o1 on medical tasks.
Right-sizing the knobs. Increasing the decomposition threshold (Nq) and retrieval depth (Nr) improves rank/accuracy up to a point—useful guidance for real deployments.
Ablations matter. Removing the Question Decomposer or Answer Synthesizer drops win rates in GPT-4o pairwise tests, confirming both stages carry weight.

Why it matters

Linear chains waste compute on dead ends; MIRAGE parallelizes exploration, grounds every claim in KG paths, and verifies across chains before speaking—exactly the traits clinicians and auditors want. The approach is plug-and-play with modern LRMs (QWQ-32B, DeepSeek-R1) and slots cleanly into safety-critical, knowledge-heavy domains beyond medicine.

Paper link: arXiv 2508.18260 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently.

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengths—low, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick.

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9.

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack.

Why this matters

Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains.
Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations.
Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers.
Open release. Code and checkpoints are public, inviting replication and adaptation.

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

14.7.25

MetaStone-S1 shows how to scale ‘thinking time’ instead of parameter count

For the past year, the mantra in large-language-model land has been simple: bigger weights, better brains. A new paper from the University of Science and Technology of China, Nanjing University and collaborators argues there’s another dial to turn—reasoning time at inference—and it introduces a purpose-built architecture called MetaStone-S1 to prove the point.

A reflective twist on the policy-reward combo

Standard alignment pipelines bolt a separate process-reward model (PRM) onto a frozen policy network, adding hundreds of millions of parameters and latency. MetaStone-S1 bundles both roles into one backbone and sprinkles in two task-specific heads: one for next-token prediction, the other for step-level scoring. The resulting Self-supervised Process Reward Model (SPRM) weighs in at just 53 M parameters—99 % smaller than conventional PRMs.

Dial-a-brain at test time

Because reward scoring lives inside the model, MetaStone-S1 can stretch or shrink its chain-of-thought on the fly:

Mode	Avg. reasoning steps	Typical use
Low	~8 steps	latency-sensitive chat
Medium	~24 steps	balanced Q&A
High	up to 64 steps	Olympiad math, code generation

The team coins this knob Test-Time Scaling (TTS) and backs it with an empirical scaling law linking “thinking FLOPs” to quality gains.

Benchmark bump without parameter bloat

Running in high mode, the 32 B-parameter MetaStone-S1 matches or beats OpenAI o3-mini across AIME ’24/’25, LiveCodeBench and C-EVAL—despite using roughly half the weights.

Why it matters

Cheaper alignment. Folding the PRM inside the policy cuts training and inference costs.
User-controllable latency. Products can trade speed for depth without model swaps.
Open playground. All code, checkpoints (1.5 B→32 B) and the reasoning-length scheduler are on GitHub under an Apache-2 license.

MetaStone-S1 won’t end the parameter-scaling race, but it offers a reminder that when and how long a model thinks can count as much as how big it is. Expect TTS dials and reflective reward heads to surface quickly in next-gen open-source stacks.

Paper link: arXiv 2507.01951 (PDF)