Wandering Nomad: Medical AI

16.8.25

GPT-5 nails ophthalmology board questions—and shows how to buy accuracy wisely

OpenAI’s newest reasoning line just aced a specialty test. In a cross-sectional benchmark of 260 closed-access AAO BCSC multiple-choice questions, GPT-5-high scored 96.5%—beating GPT-4o and OpenAI’s earlier o1, and statistically edging most GPT-5 variants, while tying o3-high within confidence intervals. Beyond raw accuracy, the paper grades rationale quality and runs a cost-accuracy analysis, surfacing Pareto-efficient configs for budget-sensitive deployments.

What they tested—and how

Researchers evaluated 12 GPT-5 configurations (three model sizes × four reasoning_effort settings) alongside o1-high, o3-high, and GPT-4o. Prompts enforced strict JSON with a single letter answer + one-sentence rationale, zero-shot. A Bradley-Terry arena ranked head-to-head wins; an LLM-as-a-judge autograder compared rationales to reference explanations.

Key results

Top score: GPT-5-high 0.965 accuracy (95% CI 0.942–0.985); > GPT-4o and o1-high; comparable to o3-high (0.958).
Rationale quality: GPT-5-high ranked #1 in pairwise judging.
Cost–accuracy frontier: Multiple efficient picks identified; GPT-5-mini-low emerges as the best low-cost, high-performance option.
Reasoning effort matters: Minimal-effort variants underperform; higher effort boosts accuracy but costs more tokens/time.

Why it matters

Hospitals and ed-tech teams rarely buy “max accuracy at any price.” This paper provides a menu of GPT-5 settings that trade pennies for percentage points, plus an autograder recipe others can adapt to scale specialty QA beyond ophthalmology. arXiv

Paper link: arXiv 2508.09956 (PDF)

22.5.25

Google Unveils MedGemma: Advanced Open-Source AI Models for Medical Text and Image Comprehension

At Google I/O 2025, Google announced the release of MedGemma, a collection of open-source AI models tailored for medical text and image comprehension. Built upon the Gemma 3 architecture, MedGemma aims to assist developers in creating advanced healthcare applications by providing robust tools for analyzing medical data.

MedGemma Model Variants

MedGemma is available in two distinct versions, each catering to specific needs in medical AI development:

MedGemma 4B (Multimodal Model): This 4-billion parameter model integrates both text and image processing capabilities. It employs a SigLIP image encoder pre-trained on diverse de-identified medical images, including chest X-rays, dermatology, ophthalmology, and histopathology slides. This variant is suitable for tasks like medical image classification and interpretation.
MedGemma 27B (Text-Only Model): A larger, 27-billion parameter model focused exclusively on medical text comprehension. It's optimized for tasks requiring deep clinical reasoning and analysis of complex medical literature.

Key Features and Use Cases

MedGemma offers several features that make it a valuable asset for medical AI development:

Medical Image Classification: The 4B model can be adapted for classifying various medical images, aiding in diagnostics and research.
Text-Based Medical Question Answering: Both models can be utilized to develop systems that answer medical questions based on extensive medical literature and data.
Integration with Development Tools: MedGemma models are accessible through platforms like Google Cloud Model Garden and Hugging Face, and are supported by resources such as GitHub repositories and Colab notebooks for ease of use and customization.

Access and Licensing

Developers interested in leveraging MedGemma can access the models and related resources through the following platforms:

Google Health AI Developer Foundations: MedGemma Overview
Hugging Face: MedGemma CollectionHugging Face
GitHub: MedGemma RepositoryGitHub

The use of MedGemma is governed by the Health AI Developer Foundations terms of use, ensuring responsible deployment in healthcare settings.