Wandering Nomad: GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

OpenAI’s GPT-4o may be the first flagship model to unify text, image and audio in a single stack, but a new EPFL benchmarking effort shows just how far even the best “everything” model still lags behind purpose-built computer-vision networks. In “How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks,” researchers tested GPT-4o alongside six other foundation models—o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL and Llama 3.2—on six bread-and-butter CV tasks that every undergrad knows: ImageNet-style classification, MS-COCO object detection, semantic segmentation, instance grouping, monocular depth and surface-normal prediction.

Turning text-only giants into pixel workers

Most API-level models can’t output polygons or depth maps, so the team invented a prompt-chaining framework that decomposes each vision problem into a sequence of classification subtasks that any chatty LLM can answer. A recursive “zoom-and-vote” routine localises objects, SLIC superpixels stand in for pixels in segmentation, and pairwise ranking lets the models infer relative depth.

Key takeaways

Finding	What happened	Why it matters
Generalist, not specialist	All MFMs landed well below state-of-the-art CV models on every benchmark.	Even massive cross-modal pre-training doesn’t yet replace task-specific supervision.
Semantic > geometric	Scores on classification, detection and segmentation were much higher than on depth or normals.	MFMs learn semantics from caption data but have little innate 3-D understanding.
GPT-4o still best of breed	GPT-4o topped the non-reasoning field in four of six tasks.	Its larger context window and image-generation head translate into better pixel comprehension.
Reasoning helps with 3-D	Smaller “o3” reasoning models edged GPT-4o on depth and normals.	Structured chain-of-thought may compensate for weaker raw vision priors.
Prompt sensitivity drops with quality	Higher-capacity models varied less when the researchers tweaked prompt chains.	Robustness could become a practical proxy for measuring model quality without labels.

The bigger picture

For product builders eyeing GPT-4o as a drop-in object detector, the study is a sobering reality check; you’ll still need a Mask R-CNN or SAM in the loop for pixel-perfect jobs. But the results also highlight the upside of super-general models: with zero fine-tuning and only clever prompting, GPT-4o can solve half a dozen vision tasks “well enough”—a compelling baseline for multimodal agents that prefer breadth over razor-edge accuracy.

The authors have open-sourced their fm-vision-evals framework so future models can be dropped into the same gauntlet—no weight access required. Expect the next wave of Gemini, Claude and Llama releases to cite these scores the way language-model papers brag about MMLU.

Paper link: arXiv 2507.01955 (PDF)

9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

Turning text-only giants into pixel workers

Key takeaways

The bigger picture

No comments: