Wandering Nomad: chain-of-thought

Showing posts with label chain-of-thought. Show all posts

11.9.25

ParaThinker: parallel minds beat longer monologues

LLMs have ridden test-time compute—“think longer” chains of thought—but returns taper as early tokens lock models into bad trajectories. Tsinghua’s ParaThinker calls this Tunnel Vision and proposes native thought parallelism: generate several independent reasoning paths simultaneously, then fuse them into one answer.

Instead of external voting, ParaThinker trains the model itself to branch and merge: specialized control tokens (<think i>) trigger distinct trajectories, path-specific positional embeddings keep streams separate, and a two-phase attention mask enforces independence during thinking and controlled integration during summarization. The KV cache from the thinking stage is reused, avoiding re-prefill costs.

On AIME-24/25, AMC-23 and MATH-500, ParaThinker with 8 parallel paths boosts accuracy by +12.3 pts (1.5B) and +7.5 pts (7B) over sequential baselines under the same token budget, and still beats majority voting by +4.3/+2.0 pts—with only ~7.1% latency overhead. Generating up to 16 paths costs <2× single-path latency, thanks to better arithmetic intensity on GPUs.

The takeaway: scale width, not just depth. ParaThinker shows that orchestrating compute across diverse, parallel thoughts unlocks latent reasoning ability and makes smaller models out-punch larger sequential ones. Code is available on GitHub.

Paper link: arXiv 2509.04475 (PDF)

16.8.25

GPT-5 tops multimodal medical QA—and even edges human experts on a new benchmark

If you’ve wondered whether general-purpose LLMs can truly reason across medical text and images, a new study out of Emory University says GPT-5 can—and then some. In “Capabilities of GPT-5 on Multimodal Medical Reasoning,” the team treats GPT-5 as a generalist decision-support engine and runs it through a unified, zero-shot chain-of-thought (CoT) protocol spanning text-only and vision-augmented tasks. The short version: GPT-5 outperforms GPT-4o across the board and surpasses pre-licensed human experts on the toughest multimodal benchmark they tested.

A cleaner test: one prompting recipe, many tasks

Prior medical LLM papers often mix datasets and prompting tricks, muddying comparisons. Here, the authors standardize splits and use the same two-turn CoT prompt for every dataset—first elicit reasoning, then force a single-letter answer—so differences reflect the model, not prompt engineering. Visual items attach image URLs in the first turn; the convergence step stays textual.

The numbers

Text QA: On MedQA (US, 4-option), GPT-5 hits 95.84%—a +4.80% absolute gain over GPT-4o. MMLU medical subsets also tick up, including a perfect score in Medical Genetics.
USMLE samples: Averaged across Steps 1–3, GPT-5 reaches 95.22% (+2.88 vs. GPT-4o), with the biggest lift on Step 2’s management-heavy items.
Multimodal QA: On MedXpertQA-MM, GPT-5’s reasoning and understanding jump +29.26% and +26.18% over GPT-4o. A case study shows the model integrating CT findings, labs and symptoms to recommend a Gastrografin swallow for suspected esophageal perforation.
Radiology VQA: On VQA-RAD, GPT-5 posts 70.92%—slightly below GPT-5-mini (74.90%), which the authors attribute to small-set quirks and calibration.

Above pre-licensed human experts—at least on MedXpertQA

Compared against pre-licensed clinicians, GPT-5 clears the bar decisively on MedXpertQA: +15.22% (text reasoning), +9.40% (text understanding), +24.23% (multimodal reasoning), +29.40% (multimodal understanding). GPT-4o, by contrast, trails humans on most of these dimensions.

Why it matters

From recall to reasoning. Gains concentrate on reasoning-intensive tasks (MedXpertQA, USMLE Step 2), suggesting internal upgrades beyond raw fact lookup.
Designing safer tools. The same unified protocol that boosts accuracy also produces structured rationales—useful for audit trails in clinical decision support.
Open evals. The authors say they’ve made code public (GPT-5-Evaluation), inviting replication and deeper probing of failure modes.

Mind the caveats

This is still benchmark-world: standardized items, time-limited settings, and no messy clinic realities. The paper itself cautions that real deployments will need calibration, domain-adapted fine-tuning and prospective trials.

If those steps pan out, GPT-5 looks less like a better test-taker and more like a multimodal reasoner—one that can fuse text and images to recommend plausible next actions.

Paper link: arXiv 2508.08224 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently.

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengths—low, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick.

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9.

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack.

Why this matters

Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains.
Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations.
Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers.
Open release. Code and checkpoints are public, inviting replication and adaptation.

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)