9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

 OpenAI’s GPT-4o may be the first flagship model to unify text, image and audio in a single stack, but a new EPFL benchmarking effort shows just how far even the best “everything” model still lags behind purpose-built computer-vision networks. In “How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks,” researchers tested GPT-4o alongside six other foundation models—o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL and Llama 3.2—on six bread-and-butter CV tasks that every undergrad knows: ImageNet-style classification, MS-COCO object detection, semantic segmentation, instance grouping, monocular depth and surface-normal prediction.

Turning text-only giants into pixel workers

Most API-level models can’t output polygons or depth maps, so the team invented a prompt-chaining framework that decomposes each vision problem into a sequence of classification subtasks that any chatty LLM can answer. A recursive “zoom-and-vote” routine localises objects, SLIC superpixels stand in for pixels in segmentation, and pairwise ranking lets the models infer relative depth.

Key takeaways

FindingWhat happenedWhy it matters
Generalist, not specialistAll MFMs landed well below state-of-the-art CV models on every benchmark.Even massive cross-modal pre-training doesn’t yet replace task-specific supervision.
Semantic > geometricScores on classification, detection and segmentation were much higher than on depth or normals.MFMs learn semantics from caption data but have little innate 3-D understanding.
GPT-4o still best of breedGPT-4o topped the non-reasoning field in four of six tasks.Its larger context window and image-generation head translate into better pixel comprehension.
Reasoning helps with 3-DSmaller “o3” reasoning models edged GPT-4o on depth and normals.Structured chain-of-thought may compensate for weaker raw vision priors.
Prompt sensitivity drops with qualityHigher-capacity models varied less when the researchers tweaked prompt chains.Robustness could become a practical proxy for measuring model quality without labels.

The bigger picture

For product builders eyeing GPT-4o as a drop-in object detector, the study is a sobering reality check; you’ll still need a Mask R-CNN or SAM in the loop for pixel-perfect jobs. But the results also highlight the upside of super-general models: with zero fine-tuning and only clever prompting, GPT-4o can solve half a dozen vision tasks “well enough”—a compelling baseline for multimodal agents that prefer breadth over razor-edge accuracy.

The authors have open-sourced their fm-vision-evals framework so future models can be dropped into the same gauntlet—no weight access required. Expect the next wave of Gemini, Claude and Llama releases to cite these scores the way language-model papers brag about MMLU.

Paper link: arXiv 2507.01955 (PDF)

8.7.25

Context Engineering in AI: Designing the Right Inputs for Smarter, Safer Large-Language Models

 

What Is Context Engineering?

In classic software, developers write deterministic code; in today’s AI systems, we compose contexts. Context engineering is the systematic craft of designing, organizing and manipulating every token fed into a large-language model (LLM) at inference time—instructions, examples, retrieved documents, API results, user profiles, safety policies, even intermediate chain-of-thought. Well-engineered context turns a general model into a domain expert; poor context produces hallucinations, leakage or policy violations. 


Core Techniques

TechniqueGoalTypical Tools / Patterns
Prompt Design & TemplatesGive the model clear role, task, format and constraintsSystem + user role prompts; XML / JSON schemas; function-calling specs
Retrieval-Augmented Generation (RAG)Supply fresh, external knowledge just-in-timeVector search, hybrid BM25+embedding, GraphRAG
Context CompressionFit more signal into limited tokensSummarisation, saliency ranking, LLM-powered “short-former” rewriters
Chunking & WindowingPreserve locality in extra-long inputsHierarchical windows, sliding attention, FlashMask / Ring Attention
Scratchpads & CoT ScaffoldsExpose model reasoning for better accuracy and debuggabilitySelf-consistency, tree-of-thought, DST (Directed Self-Testing)
Memory & ProfilesPersonalise without retrainingVector memories, episodic caches, preference embeddings
Tool / API ContextLet models call and interpret external systemsModel Context Protocol (MCP), JSON-schema function calls, structured tool output
Policy & GuardrailsEnforce safety and brand styleContent filters, regex validators, policy adapters, YAML instruction blocks

Why It Matters

  1. Accuracy & Trust – Fact-filled, well-structured context slashes hallucination rates and citation errors.

  2. Privacy & Governance – Explicit control over what leaves the organisation or reaches the model helps meet GDPR, HIPAA and the EU AI Act.

  3. Cost Efficiency – Compressing or caching context can cut token bills by 50-80 %.

  4. Scalability – Multi-step agent systems live or die by fast, machine-readable context routing; good design tames complexity.


High-Impact Use Cases

SectorHow Context Engineering Delivers Value
Customer SupportRAG surfaces the exact policy paragraph and recent ticket history, enabling a single prompt to draft compliant replies.
Coding AgentsFunction-calling + repository retrieval feed IDE paths, diffs and test logs, letting models patch bugs autonomously.
Healthcare Q&AContext filters strip PHI before retrieval; clinically-approved guidelines injected to guide safe advice.
Legal AnalysisLong-context models read entire case bundles; chunk ranking highlights precedent sections for argument drafting.
Manufacturing IoTStreaming sensor data is summarised every minute and appended to a rolling window for predictive-maintenance agents.

Designing a Context Pipeline: Four Practical Steps

  1. Map the Task Surface
    • What knowledge is static vs. dynamic?
    • Which external tools or databases are authoritative?

  2. Define Context Layers
    Base prompt: role, format, policy
    Ephemeral layer: user query, tool results
    Memory layer: user or session history
    Safety layer: filters, refusal templates

  3. Choose Retrieval & Compression Strategies
    • Exact text (BM25) for short policies; dense vectors for semantic match
    • Summaries or selective quoting for large PDFs

  4. Instrument & Iterate
    • Log token mixes, latency, cost
    • A/B test different ordering, chunking, or reasoning scaffolds
    • Use self-reflection or eval suites (e.g., TruthfulQA-Context) to measure gains


Emerging Tools & Standards

  • MCP (Model Context Protocol) – open JSON schema for passing tool output and trace metadata to any LLM, adopted by Claude Code, Gemini CLI and IBM MCP Gateway.

  • Context-Aware Runtimes – vLLM, Flash-Infer and Infinity Lite stream 128 K-1 M tokens with optimized KV caches.

  • Context Observability Dashboards – Startups like ContextHub show token-level diff, attribution and cost per layer.


The Road Ahead

As context windows expand to a million tokens and multi-agent systems proliferate, context engineering will sit alongside model training and fine-tuning as a first-class AI discipline. Teams that master it will ship assistants that feel domain-expert-smart, honest and cost-efficient—while everyone else will chase unpredictable black boxes.

Whether you’re building a retrieval chatbot, a self-healing codebase or an autonomous research agent, remember: the model is only as good as the context you feed it.

AIRA shows how better operators — not just bigger models — turbo-charge AI research agents

 Large language models that write code have already stormed GitHub, but turning them into full-blown research agents—systems that iterate on entire ML pipelines until they medal on Kaggle—has proved trickier. The latest state-of-the-art, AIDE, could grab a medal on roughly 40 % of MLE-bench tasks. Now Meta AI and UCL push that rate to 47.7 % with AIRA, a rethink that says the secret isn’t a flashier LLM, it’s the operators and search policy you wrap around it. 

From one-shot “Draft, Debug, Improve” to a toolbox of surgical edits

AIRA introduces OAIRA, a new operator set that goes beyond AIDE’s three blunt actions. Scoped memory keeps prompts lean, “think tokens” force structured reasoning, and a prompt-adaptive complexity cue decides whether the agent should sketch a quick baseline or engineer a deep ensemble. The result: twice the reasoning tokens per call and far less mode collapse. 

Search policies finally get room to shine

When AIDE’s old operators were plugged into greedy, MCTS and evolutionary searches, the fancier algorithms gained zero ground—operator bottlenecks were that severe. Swap in OAIRA and those same policies leapfrog greedy search, proving that exploration muscle only pays off once edits are expressive enough. 

The scoreboard (MLE-bench Lite, 22 Kaggle tasks)

  • AIDE (o1-preview, greedy): 39.6 % medal rate

  • AIRA (greedy + OAIRA): 45.5 %

  • AIRA (MCTS + OAIRA): 47.7 %

  • AIRA (Evolutionary + OAIRA): 47.3 %
    All agents ran under identical 24-hour, single-GPU budgets inside AIRA-dojo, a new sandbox that hands each run a root-privileged H200 container yet isolates filesystem side effects. 

Mind the generalization gap

The study also spotlights a pitfall for auto-ML agents: validation scores routinely over-estimate test-set gains, steering greedy searches into dead ends. By examining thousands of runs, the team quantifies that “proxy-test gap” and urges future benchmarks to track it explicitly. 

Why it matters

  • Agent design ≠ model scale. The leap came without touching the underlying LLM (DeepSeek-R1 or GPT-4o). That’s good news for teams capped by API limits.

  • Composable recipe. OAIRA operators, MCTS search and the open-source aira-dojo testbed (GitHub link in the paper) can bolt onto any ReAct-style coding agent.

  • Toward autonomous ML ops. AIRA’s 24-hour, single-GPU constraint mirrors real-world hack-day budgets, making the findings immediately useful for startups chasing continuous Kaggle pipelines or internal model tuning bots.

Auto-ML agents are no longer judged solely by the size of their LLM brains; the tools they wield and the ways they explore the search space may count just as much. AIRA’s 8-point jump on MLE-bench suggests that the next frontier in agentic ML will be won with sharper scalpels, not bigger hammers.

Paper link: arXiv 2507.02554 (PDF)

 There's a video making the rounds where someone claims to build an entire affiliate marketing business in about an hour — a website, Pi...