9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

 OpenAI’s GPT-4o may be the first flagship model to unify text, image and audio in a single stack, but a new EPFL benchmarking effort shows just how far even the best “everything” model still lags behind purpose-built computer-vision networks. In “How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks,” researchers tested GPT-4o alongside six other foundation models—o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL and Llama 3.2—on six bread-and-butter CV tasks that every undergrad knows: ImageNet-style classification, MS-COCO object detection, semantic segmentation, instance grouping, monocular depth and surface-normal prediction.

Turning text-only giants into pixel workers

Most API-level models can’t output polygons or depth maps, so the team invented a prompt-chaining framework that decomposes each vision problem into a sequence of classification subtasks that any chatty LLM can answer. A recursive “zoom-and-vote” routine localises objects, SLIC superpixels stand in for pixels in segmentation, and pairwise ranking lets the models infer relative depth.

Key takeaways

FindingWhat happenedWhy it matters
Generalist, not specialistAll MFMs landed well below state-of-the-art CV models on every benchmark.Even massive cross-modal pre-training doesn’t yet replace task-specific supervision.
Semantic > geometricScores on classification, detection and segmentation were much higher than on depth or normals.MFMs learn semantics from caption data but have little innate 3-D understanding.
GPT-4o still best of breedGPT-4o topped the non-reasoning field in four of six tasks.Its larger context window and image-generation head translate into better pixel comprehension.
Reasoning helps with 3-DSmaller “o3” reasoning models edged GPT-4o on depth and normals.Structured chain-of-thought may compensate for weaker raw vision priors.
Prompt sensitivity drops with qualityHigher-capacity models varied less when the researchers tweaked prompt chains.Robustness could become a practical proxy for measuring model quality without labels.

The bigger picture

For product builders eyeing GPT-4o as a drop-in object detector, the study is a sobering reality check; you’ll still need a Mask R-CNN or SAM in the loop for pixel-perfect jobs. But the results also highlight the upside of super-general models: with zero fine-tuning and only clever prompting, GPT-4o can solve half a dozen vision tasks “well enough”—a compelling baseline for multimodal agents that prefer breadth over razor-edge accuracy.

The authors have open-sourced their fm-vision-evals framework so future models can be dropped into the same gauntlet—no weight access required. Expect the next wave of Gemini, Claude and Llama releases to cite these scores the way language-model papers brag about MMLU.

Paper link: arXiv 2507.01955 (PDF)

No comments:

 Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image...