Showing posts with label test-time scaling. Show all posts
Showing posts with label test-time scaling. Show all posts

1.9.25

MIRAGE: parallel GraphRAG turns test-time scaling into a team sport

 Most test-time scaling schemes still walk a single, linear chain of thought—great until an early mistake snowballs. MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration) swaps that for many chains in parallel, each grounded in a medical knowledge graph and then cross-checked before answering. Think of it as ToT’s breadth, Search-o1’s retrieval, and GraphRAG’s structure—rolled into one pipeline. 

How it works (and why it’s different)

  • Entity-grounded decomposition. The system splits a clinical question into sub-questions tied to concrete entities (symptoms, diseases, treatments). Each sub-question spawns its own reasoning chain

  • Graph-based retrieval, two modes.

    • Anchor mode: query the KG around a single entity (local neighborhood).

    • Bridge mode: search paths between entity pairs to surface multi-hop relations. 

  • Adaptive evidence streaming. Chains iteratively expand neighbors/multi-hop trails, keeping only deduplicated, directionally relevant facts. 

  • Cross-chain verification. An answer synthesizer reconciles sub-answers, prefers explanations backed by broader, independent chains, and normalizes clinical terms—cutting contradictions and hallucinations. Outputs are serialized with full provenance traces for audit. 

Benchmarks: consistent wins over strong baselines

Evaluated on GenMedGPT-5k, CMCQA, and ExplainCPE (with paired medical KGs), MIRAGE tops GPT-4o, GPT-4o+ToT, QWQ-32B, MindMap (GraphRAG), and Search-o1 across GPT-4o ranking and/or accuracy. Highlights:

  • GenMedGPT-5k: best GPT-4o rank 1.8 (lower is better). 

  • CMCQA: rank 2.8, edging ToT, MindMap, and Search-o1. 

  • ExplainCPE: 84.8% accuracy vs GPT-4o 77.8%, Search-o1 80.7%, MindMap 84.6%

Swapping the backbone to DeepSeek-R1-32B preserves the lift (ExplainCPE 84.4%), suggesting MIRAGE is model-agnostic. A human study on GenMedGPT-5k prefers MIRAGE over all baselines, mirroring GPT-4o’s ranking. 

What moved the needle

  • Structured retrieval beats flat text. Graph-aware exploration is more stable than BM25/dense retrieval and less noisy than web-first Search-o1 on medical tasks. 

  • Right-sizing the knobs. Increasing the decomposition threshold (Nq) and retrieval depth (Nr) improves rank/accuracy up to a point—useful guidance for real deployments. 

  • Ablations matter. Removing the Question Decomposer or Answer Synthesizer drops win rates in GPT-4o pairwise tests, confirming both stages carry weight. 

Why it matters

Linear chains waste compute on dead ends; MIRAGE parallelizes exploration, grounds every claim in KG paths, and verifies across chains before speaking—exactly the traits clinicians and auditors want. The approach is plug-and-play with modern LRMs (QWQ-32B, DeepSeek-R1) and slots cleanly into safety-critical, knowledge-heavy domains beyond medicine.

Paper link: arXiv 2508.18260 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

 Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently. 

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengthslow, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick. 

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack. 

Why this matters

  • Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains. 

  • Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations. 

  • Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers. 

  • Open release. Code and checkpoints are public, inviting replication and adaptation. 

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

14.7.25

MetaStone-S1 shows how to scale ‘thinking time’ instead of parameter count

 For the past year, the mantra in large-language-model land has been simple: bigger weights, better brains. A new paper from the University of Science and Technology of China, Nanjing University and collaborators argues there’s another dial to turn—reasoning time at inference—and it introduces a purpose-built architecture called MetaStone-S1 to prove the point. 

A reflective twist on the policy-reward combo

Standard alignment pipelines bolt a separate process-reward model (PRM) onto a frozen policy network, adding hundreds of millions of parameters and latency. MetaStone-S1 bundles both roles into one backbone and sprinkles in two task-specific heads: one for next-token prediction, the other for step-level scoring. The resulting Self-supervised Process Reward Model (SPRM) weighs in at just 53 M parameters—99 % smaller than conventional PRMs. 

Dial-a-brain at test time

Because reward scoring lives inside the model, MetaStone-S1 can stretch or shrink its chain-of-thought on the fly:

ModeAvg. reasoning stepsTypical use
Low~8 stepslatency-sensitive chat
Medium~24 stepsbalanced Q&A
Highup to 64 stepsOlympiad math, code generation

The team coins this knob Test-Time Scaling (TTS) and backs it with an empirical scaling law linking “thinking FLOPs” to quality gains. 

Benchmark bump without parameter bloat

Running in high mode, the 32 B-parameter MetaStone-S1 matches or beats OpenAI o3-mini across AIME ’24/’25, LiveCodeBench and C-EVAL—despite using roughly half the weights. 

Why it matters

  • Cheaper alignment. Folding the PRM inside the policy cuts training and inference costs.

  • User-controllable latency. Products can trade speed for depth without model swaps.

  • Open playground. All code, checkpoints (1.5 B→32 B) and the reasoning-length scheduler are on GitHub under an Apache-2 license. 

MetaStone-S1 won’t end the parameter-scaling race, but it offers a reminder that when and how long a model thinks can count as much as how big it is. Expect TTS dials and reflective reward heads to surface quickly in next-gen open-source stacks.

Paper link: arXiv 2507.01951 (PDF)

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...