16.8.25

GPT-5 nails ophthalmology board questions—and shows how to buy accuracy wisely

 OpenAI’s newest reasoning line just aced a specialty test. In a cross-sectional benchmark of 260 closed-access AAO BCSC multiple-choice questions, GPT-5-high scored 96.5%—beating GPT-4o and OpenAI’s earlier o1, and statistically edging most GPT-5 variants, while tying o3-high within confidence intervals. Beyond raw accuracy, the paper grades rationale quality and runs a cost-accuracy analysis, surfacing Pareto-efficient configs for budget-sensitive deployments. 

What they tested—and how

Researchers evaluated 12 GPT-5 configurations (three model sizes × four reasoning_effort settings) alongside o1-high, o3-high, and GPT-4o. Prompts enforced strict JSON with a single letter answer + one-sentence rationale, zero-shot. A Bradley-Terry arena ranked head-to-head wins; an LLM-as-a-judge autograder compared rationales to reference explanations. 

Key results

  • Top score: GPT-5-high 0.965 accuracy (95% CI 0.942–0.985); > GPT-4o and o1-high; comparable to o3-high (0.958)

  • Rationale quality: GPT-5-high ranked #1 in pairwise judging. 

  • Cost–accuracy frontier: Multiple efficient picks identified; GPT-5-mini-low emerges as the best low-cost, high-performance option. 

  • Reasoning effort matters: Minimal-effort variants underperform; higher effort boosts accuracy but costs more tokens/time. 

Why it matters

Hospitals and ed-tech teams rarely buy “max accuracy at any price.” This paper provides a menu of GPT-5 settings that trade pennies for percentage points, plus an autograder recipe others can adapt to scale specialty QA beyond ophthalmology. arXiv

Paper link: arXiv 2508.09956 (PDF)

No comments:

 If you’ve wondered whether general-purpose LLMs can truly reason across medical text and images, a new study out of Emory University says ...