Wandering Nomad: GPT-5

Showing posts with label GPT-5. Show all posts

16.8.25

GPT-5 tops multimodal medical QA—and even edges human experts on a new benchmark

If you’ve wondered whether general-purpose LLMs can truly reason across medical text and images, a new study out of Emory University says GPT-5 can—and then some. In “Capabilities of GPT-5 on Multimodal Medical Reasoning,” the team treats GPT-5 as a generalist decision-support engine and runs it through a unified, zero-shot chain-of-thought (CoT) protocol spanning text-only and vision-augmented tasks. The short version: GPT-5 outperforms GPT-4o across the board and surpasses pre-licensed human experts on the toughest multimodal benchmark they tested.

A cleaner test: one prompting recipe, many tasks

Prior medical LLM papers often mix datasets and prompting tricks, muddying comparisons. Here, the authors standardize splits and use the same two-turn CoT prompt for every dataset—first elicit reasoning, then force a single-letter answer—so differences reflect the model, not prompt engineering. Visual items attach image URLs in the first turn; the convergence step stays textual.

The numbers

Text QA: On MedQA (US, 4-option), GPT-5 hits 95.84%—a +4.80% absolute gain over GPT-4o. MMLU medical subsets also tick up, including a perfect score in Medical Genetics.
USMLE samples: Averaged across Steps 1–3, GPT-5 reaches 95.22% (+2.88 vs. GPT-4o), with the biggest lift on Step 2’s management-heavy items.
Multimodal QA: On MedXpertQA-MM, GPT-5’s reasoning and understanding jump +29.26% and +26.18% over GPT-4o. A case study shows the model integrating CT findings, labs and symptoms to recommend a Gastrografin swallow for suspected esophageal perforation.
Radiology VQA: On VQA-RAD, GPT-5 posts 70.92%—slightly below GPT-5-mini (74.90%), which the authors attribute to small-set quirks and calibration.

Above pre-licensed human experts—at least on MedXpertQA

Compared against pre-licensed clinicians, GPT-5 clears the bar decisively on MedXpertQA: +15.22% (text reasoning), +9.40% (text understanding), +24.23% (multimodal reasoning), +29.40% (multimodal understanding). GPT-4o, by contrast, trails humans on most of these dimensions.

Why it matters

From recall to reasoning. Gains concentrate on reasoning-intensive tasks (MedXpertQA, USMLE Step 2), suggesting internal upgrades beyond raw fact lookup.
Designing safer tools. The same unified protocol that boosts accuracy also produces structured rationales—useful for audit trails in clinical decision support.
Open evals. The authors say they’ve made code public (GPT-5-Evaluation), inviting replication and deeper probing of failure modes.

Mind the caveats

This is still benchmark-world: standardized items, time-limited settings, and no messy clinic realities. The paper itself cautions that real deployments will need calibration, domain-adapted fine-tuning and prospective trials.

If those steps pan out, GPT-5 looks less like a better test-taker and more like a multimodal reasoner—one that can fuse text and images to recommend plausible next actions.

Paper link: arXiv 2508.08224 (PDF)

GPT-5 nails ophthalmology board questions—and shows how to buy accuracy wisely

OpenAI’s newest reasoning line just aced a specialty test. In a cross-sectional benchmark of 260 closed-access AAO BCSC multiple-choice questions, GPT-5-high scored 96.5%—beating GPT-4o and OpenAI’s earlier o1, and statistically edging most GPT-5 variants, while tying o3-high within confidence intervals. Beyond raw accuracy, the paper grades rationale quality and runs a cost-accuracy analysis, surfacing Pareto-efficient configs for budget-sensitive deployments.

What they tested—and how

Researchers evaluated 12 GPT-5 configurations (three model sizes × four reasoning_effort settings) alongside o1-high, o3-high, and GPT-4o. Prompts enforced strict JSON with a single letter answer + one-sentence rationale, zero-shot. A Bradley-Terry arena ranked head-to-head wins; an LLM-as-a-judge autograder compared rationales to reference explanations.

Key results

Top score: GPT-5-high 0.965 accuracy (95% CI 0.942–0.985); > GPT-4o and o1-high; comparable to o3-high (0.958).
Rationale quality: GPT-5-high ranked #1 in pairwise judging.
Cost–accuracy frontier: Multiple efficient picks identified; GPT-5-mini-low emerges as the best low-cost, high-performance option.
Reasoning effort matters: Minimal-effort variants underperform; higher effort boosts accuracy but costs more tokens/time.

Why it matters

Hospitals and ed-tech teams rarely buy “max accuracy at any price.” This paper provides a menu of GPT-5 settings that trade pennies for percentage points, plus an autograder recipe others can adapt to scale specialty QA beyond ophthalmology. arXiv

Paper link: arXiv 2508.09956 (PDF)

8.8.25

GPT-5 Arrives: A Quantum Leap or an Incremental Step Toward Everyday AGI?

OpenAI CEO Sam Altman opened the launch keynote with a statistic that still jolts me: 700 million weekly ChatGPT users. If accurate, that is the fastest adoption curve of any software platform in history. Altman framed GPT-5 as the model that finally feels like “talking to a PhD-level expert in anything,” capable of planning a birthday party, writing a full software stack, or parsing biopsy results in seconds. As someone who has lived through GPT-3’s flashes of brilliance and GPT-4o’s solid utility, I’m impressed by the live demos—particularly the on-the-fly 3-D castle game and the finance dashboard spun up in minutes. Yet part of me wonders how often real-world edge-cases will still trip the model, PhD metaphors aside.

Reasoning + Speed = Default
One genuine breakthrough is that GPT-5 merges OpenAI’s slow “reasoning models” and fast “standard models” into a single pipeline. The system decides—dynamically—how much chain-of-thought to spend on each request. As a developer, I love the promise of no more model-picker gymnastics. But the skeptic in me notes that latency remains physics-bound; the keynote glossed over how much extra compute the “perfect amount of thinking” really burns.

Safer, but Still a Work in Progress
Safety lead Saachi emphasized safe completions: instead of the binary comply/refuse we’ve grown used to, GPT-5 offers partial, contextual answers plus policy pointers. I applaud the nuance (the potassium perchlorate fireworks example was spot-on), and early physician-audited benchmarks suggest lower hallucination rates. Still, bi-modal safety often fails at scale. Until we see longitudinal data from millions of prompts, I reserve judgment on whether “significantly less deceptive” translates into materially fewer bad outcomes.

Coding Superpowers—and Benchmarks That May Be Peaking
On SWEBench, GPT-5 posts 74.9 %—state-of-the-art by a wide margin—and Cursor’s integration shows real autonomy: the model searches code, patches errors after compiling, and writes explanatory READMEs. That’s developer candy. Yet I can’t ignore Michael Truell’s aside that models are saturating classic evals. When a leaderboard hits 99 %, the next delta in usefulness won’t come from marginal accuracy boosts; it will come from deeper tool integration, live debugging, and sustained multi-day agent runs—areas GPT-5 only begins to address.

Health and Personalization
The on-stage story of Carolina using GPT-5 to weigh radiation options was moving and highlights the model’s strength as a patient advocate. Free-tier voice chat, Gmail/calendar integration, and memory all point toward a more personal assistant future. My worry is data consent and provenance: when GPT-5 merges personal email with medical queries, the privacy surface expands dramatically. OpenAI’s policies will need the same iterative care the model architecture received.

What I’m Excited About—and Watching Carefully
I love the 400 K context window, the new “minimal reasoning” knob for latency-sensitive tasks, and regular-expression-constrained outputs. Those are practical, developer-driven wins. I’m less convinced by the AGI framing; Altman downplayed compute bottlenecks and energy costs, and benchmark fatigue is real. GPT-5 feels like the best general-purpose model we’ve seen—but whether it inaugurates a “team of experts in your pocket” or reveals the limits of current scaling will depend on how it behaves over the next billion prompts.

Overall, GPT-5 is a thrilling upgrade—smarter, faster, and more context-aware. Just remember: even PhD-level experts can be confidently wrong, and the same will be true for the most intuitive model yet.