Showing posts with label chain-of-thought. Show all posts
Showing posts with label chain-of-thought. Show all posts

16.8.25

GPT-5 tops multimodal medical QA—and even edges human experts on a new benchmark

 If you’ve wondered whether general-purpose LLMs can truly reason across medical text and images, a new study out of Emory University says GPT-5 can—and then some. In “Capabilities of GPT-5 on Multimodal Medical Reasoning,” the team treats GPT-5 as a generalist decision-support engine and runs it through a unified, zero-shot chain-of-thought (CoT) protocol spanning text-only and vision-augmented tasks. The short version: GPT-5 outperforms GPT-4o across the board and surpasses pre-licensed human experts on the toughest multimodal benchmark they tested. 

A cleaner test: one prompting recipe, many tasks

Prior medical LLM papers often mix datasets and prompting tricks, muddying comparisons. Here, the authors standardize splits and use the same two-turn CoT prompt for every dataset—first elicit reasoning, then force a single-letter answer—so differences reflect the model, not prompt engineering. Visual items attach image URLs in the first turn; the convergence step stays textual. 

The numbers

  • Text QA: On MedQA (US, 4-option), GPT-5 hits 95.84%—a +4.80% absolute gain over GPT-4o. MMLU medical subsets also tick up, including a perfect score in Medical Genetics. 

  • USMLE samples: Averaged across Steps 1–3, GPT-5 reaches 95.22% (+2.88 vs. GPT-4o), with the biggest lift on Step 2’s management-heavy items. 

  • Multimodal QA: On MedXpertQA-MM, GPT-5’s reasoning and understanding jump +29.26% and +26.18% over GPT-4o. A case study shows the model integrating CT findings, labs and symptoms to recommend a Gastrografin swallow for suspected esophageal perforation. 

  • Radiology VQA: On VQA-RAD, GPT-5 posts 70.92%—slightly below GPT-5-mini (74.90%), which the authors attribute to small-set quirks and calibration. 

Above pre-licensed human experts—at least on MedXpertQA

Compared against pre-licensed clinicians, GPT-5 clears the bar decisively on MedXpertQA: +15.22% (text reasoning), +9.40% (text understanding), +24.23% (multimodal reasoning), +29.40% (multimodal understanding). GPT-4o, by contrast, trails humans on most of these dimensions. 

Why it matters

  • From recall to reasoning. Gains concentrate on reasoning-intensive tasks (MedXpertQA, USMLE Step 2), suggesting internal upgrades beyond raw fact lookup.

  • Designing safer tools. The same unified protocol that boosts accuracy also produces structured rationales—useful for audit trails in clinical decision support. 

  • Open evals. The authors say they’ve made code public (GPT-5-Evaluation), inviting replication and deeper probing of failure modes. 

Mind the caveats

This is still benchmark-world: standardized items, time-limited settings, and no messy clinic realities. The paper itself cautions that real deployments will need calibration, domain-adapted fine-tuning and prospective trials. 

If those steps pan out, GPT-5 looks less like a better test-taker and more like a multimodal reasoner—one that can fuse text and images to recommend plausible next actions.

Paper link: arXiv 2508.08224 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

 Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently. 

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengthslow, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick. 

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack. 

Why this matters

  • Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains. 

  • Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations. 

  • Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers. 

  • Open release. Code and checkpoints are public, inviting replication and adaptation. 

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...