Wandering Nomad: Foundation Models

9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

OpenAI’s GPT-4o may be the first flagship model to unify text, image and audio in a single stack, but a new EPFL benchmarking effort shows just how far even the best “everything” model still lags behind purpose-built computer-vision networks. In “How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks,” researchers tested GPT-4o alongside six other foundation models—o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL and Llama 3.2—on six bread-and-butter CV tasks that every undergrad knows: ImageNet-style classification, MS-COCO object detection, semantic segmentation, instance grouping, monocular depth and surface-normal prediction.

Turning text-only giants into pixel workers

Most API-level models can’t output polygons or depth maps, so the team invented a prompt-chaining framework that decomposes each vision problem into a sequence of classification subtasks that any chatty LLM can answer. A recursive “zoom-and-vote” routine localises objects, SLIC superpixels stand in for pixels in segmentation, and pairwise ranking lets the models infer relative depth.

Key takeaways

Finding	What happened	Why it matters
Generalist, not specialist	All MFMs landed well below state-of-the-art CV models on every benchmark.	Even massive cross-modal pre-training doesn’t yet replace task-specific supervision.
Semantic > geometric	Scores on classification, detection and segmentation were much higher than on depth or normals.	MFMs learn semantics from caption data but have little innate 3-D understanding.
GPT-4o still best of breed	GPT-4o topped the non-reasoning field in four of six tasks.	Its larger context window and image-generation head translate into better pixel comprehension.
Reasoning helps with 3-D	Smaller “o3” reasoning models edged GPT-4o on depth and normals.	Structured chain-of-thought may compensate for weaker raw vision priors.
Prompt sensitivity drops with quality	Higher-capacity models varied less when the researchers tweaked prompt chains.	Robustness could become a practical proxy for measuring model quality without labels.

The bigger picture

For product builders eyeing GPT-4o as a drop-in object detector, the study is a sobering reality check; you’ll still need a Mask R-CNN or SAM in the loop for pixel-perfect jobs. But the results also highlight the upside of super-general models: with zero fine-tuning and only clever prompting, GPT-4o can solve half a dozen vision tasks “well enough”—a compelling baseline for multimodal agents that prefer breadth over razor-edge accuracy.

The authors have open-sourced their fm-vision-evals framework so future models can be dropped into the same gauntlet—no weight access required. Expect the next wave of Gemini, Claude and Llama releases to cite these scores the way language-model papers brag about MMLU.

Paper link: arXiv 2507.01955 (PDF)

27.5.25

Microsoft's Aurora AI Revolutionizes Environmental Forecasting with High-Speed, Accurate Predictions

Microsoft has introduced Aurora, an advanced AI foundation model designed to enhance environmental forecasting capabilities. Trained on over a million hours of diverse atmospheric data—including satellite imagery, radar readings, and weather station reports—Aurora delivers rapid and accurate predictions for various environmental phenomena.

Key Features and Achievements

High-Speed Forecasting: Aurora generates forecasts in seconds, a significant improvement over the hours required by traditional supercomputer-based systems.
Enhanced Accuracy: In tests, Aurora outperformed the National Hurricane Center in forecasting five-day tropical cyclone tracks for the 2022–2023 season and accurately predicted the landfall of Typhoon Doksuri in the Philippines four days in advance.
Versatile Environmental Predictions: Beyond weather forecasting, Aurora has been fine-tuned to predict air quality, ocean wave heights, and other atmospheric events, demonstrating its adaptability to various environmental forecasting tasks.
Public Accessibility: Microsoft has made Aurora's source code and model weights publicly available, promoting transparency and collaboration within the scientific community.

Implications for the Future

Aurora represents a significant advancement in the field of meteorology and environmental science. Its ability to provide rapid, accurate forecasts can aid in disaster preparedness, environmental monitoring, and climate research. By making the model publicly accessible, Microsoft encourages further innovation and application of AI in understanding and responding to environmental challenges.