Wandering Nomad

9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

OpenAI’s GPT-4o may be the first flagship model to unify text, image and audio in a single stack, but a new EPFL benchmarking effort shows just how far even the best “everything” model still lags behind purpose-built computer-vision networks. In “How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks,” researchers tested GPT-4o alongside six other foundation models—o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL and Llama 3.2—on six bread-and-butter CV tasks that every undergrad knows: ImageNet-style classification, MS-COCO object detection, semantic segmentation, instance grouping, monocular depth and surface-normal prediction.

Turning text-only giants into pixel workers

Most API-level models can’t output polygons or depth maps, so the team invented a prompt-chaining framework that decomposes each vision problem into a sequence of classification subtasks that any chatty LLM can answer. A recursive “zoom-and-vote” routine localises objects, SLIC superpixels stand in for pixels in segmentation, and pairwise ranking lets the models infer relative depth.

Key takeaways

Finding	What happened	Why it matters
Generalist, not specialist	All MFMs landed well below state-of-the-art CV models on every benchmark.	Even massive cross-modal pre-training doesn’t yet replace task-specific supervision.
Semantic > geometric	Scores on classification, detection and segmentation were much higher than on depth or normals.	MFMs learn semantics from caption data but have little innate 3-D understanding.
GPT-4o still best of breed	GPT-4o topped the non-reasoning field in four of six tasks.	Its larger context window and image-generation head translate into better pixel comprehension.
Reasoning helps with 3-D	Smaller “o3” reasoning models edged GPT-4o on depth and normals.	Structured chain-of-thought may compensate for weaker raw vision priors.
Prompt sensitivity drops with quality	Higher-capacity models varied less when the researchers tweaked prompt chains.	Robustness could become a practical proxy for measuring model quality without labels.

The bigger picture

For product builders eyeing GPT-4o as a drop-in object detector, the study is a sobering reality check; you’ll still need a Mask R-CNN or SAM in the loop for pixel-perfect jobs. But the results also highlight the upside of super-general models: with zero fine-tuning and only clever prompting, GPT-4o can solve half a dozen vision tasks “well enough”—a compelling baseline for multimodal agents that prefer breadth over razor-edge accuracy.

The authors have open-sourced their fm-vision-evals framework so future models can be dropped into the same gauntlet—no weight access required. Expect the next wave of Gemini, Claude and Llama releases to cite these scores the way language-model papers brag about MMLU.

Paper link: arXiv 2507.01955 (PDF)

8.7.25

Context Engineering in AI: Designing the Right Inputs for Smarter, Safer Large-Language Models

What Is Context Engineering?

In classic software, developers write deterministic code; in today’s AI systems, we compose contexts. Context engineering is the systematic craft of designing, organizing and manipulating every token fed into a large-language model (LLM) at inference time—instructions, examples, retrieved documents, API results, user profiles, safety policies, even intermediate chain-of-thought. Well-engineered context turns a general model into a domain expert; poor context produces hallucinations, leakage or policy violations.

Core Techniques

Technique	Goal	Typical Tools / Patterns
Prompt Design & Templates	Give the model clear role, task, format and constraints	System + user role prompts; XML / JSON schemas; function-calling specs
Retrieval-Augmented Generation (RAG)	Supply fresh, external knowledge just-in-time	Vector search, hybrid BM25+embedding, GraphRAG
Context Compression	Fit more signal into limited tokens	Summarisation, saliency ranking, LLM-powered “short-former” rewriters
Chunking & Windowing	Preserve locality in extra-long inputs	Hierarchical windows, sliding attention, FlashMask / Ring Attention
Scratchpads & CoT Scaffolds	Expose model reasoning for better accuracy and debuggability	Self-consistency, tree-of-thought, DST (Directed Self-Testing)
Memory & Profiles	Personalise without retraining	Vector memories, episodic caches, preference embeddings
Tool / API Context	Let models call and interpret external systems	Model Context Protocol (MCP), JSON-schema function calls, structured tool output
Policy & Guardrails	Enforce safety and brand style	Content filters, regex validators, policy adapters, YAML instruction blocks

Why It Matters

Accuracy & Trust – Fact-filled, well-structured context slashes hallucination rates and citation errors.
Privacy & Governance – Explicit control over what leaves the organisation or reaches the model helps meet GDPR, HIPAA and the EU AI Act.
Cost Efficiency – Compressing or caching context can cut token bills by 50-80 %.
Scalability – Multi-step agent systems live or die by fast, machine-readable context routing; good design tames complexity.

High-Impact Use Cases

Sector	How Context Engineering Delivers Value
Customer Support	RAG surfaces the exact policy paragraph and recent ticket history, enabling a single prompt to draft compliant replies.
Coding Agents	Function-calling + repository retrieval feed IDE paths, diffs and test logs, letting models patch bugs autonomously.
Healthcare Q&A	Context filters strip PHI before retrieval; clinically-approved guidelines injected to guide safe advice.
Legal Analysis	Long-context models read entire case bundles; chunk ranking highlights precedent sections for argument drafting.
Manufacturing IoT	Streaming sensor data is summarised every minute and appended to a rolling window for predictive-maintenance agents.

Designing a Context Pipeline: Four Practical Steps

Map the Task Surface
• What knowledge is static vs. dynamic?
• Which external tools or databases are authoritative?
Define Context Layers
• Base prompt: role, format, policy
• Ephemeral layer: user query, tool results
• Memory layer: user or session history
• Safety layer: filters, refusal templates
Choose Retrieval & Compression Strategies
• Exact text (BM25) for short policies; dense vectors for semantic match
• Summaries or selective quoting for large PDFs
Instrument & Iterate
• Log token mixes, latency, cost
• A/B test different ordering, chunking, or reasoning scaffolds
• Use self-reflection or eval suites (e.g., TruthfulQA-Context) to measure gains

Emerging Tools & Standards

MCP (Model Context Protocol) – open JSON schema for passing tool output and trace metadata to any LLM, adopted by Claude Code, Gemini CLI and IBM MCP Gateway.
Context-Aware Runtimes – vLLM, Flash-Infer and Infinity Lite stream 128 K-1 M tokens with optimized KV caches.
Context Observability Dashboards – Startups like ContextHub show token-level diff, attribution and cost per layer.

The Road Ahead

As context windows expand to a million tokens and multi-agent systems proliferate, context engineering will sit alongside model training and fine-tuning as a first-class AI discipline. Teams that master it will ship assistants that feel domain-expert-smart, honest and cost-efficient—while everyone else will chase unpredictable black boxes.

Whether you’re building a retrieval chatbot, a self-healing codebase or an autonomous research agent, remember: the model is only as good as the context you feed it.

AIRA shows how better operators — not just bigger models — turbo-charge AI research agents

Large language models that write code have already stormed GitHub, but turning them into full-blown research agents—systems that iterate on entire ML pipelines until they medal on Kaggle—has proved trickier. The latest state-of-the-art, AIDE, could grab a medal on roughly 40 % of MLE-bench tasks. Now Meta AI and UCL push that rate to 47.7 % with AIRA, a rethink that says the secret isn’t a flashier LLM, it’s the operators and search policy you wrap around it.

From one-shot “Draft, Debug, Improve” to a toolbox of surgical edits

AIRA introduces OAIRA, a new operator set that goes beyond AIDE’s three blunt actions. Scoped memory keeps prompts lean, “think tokens” force structured reasoning, and a prompt-adaptive complexity cue decides whether the agent should sketch a quick baseline or engineer a deep ensemble. The result: twice the reasoning tokens per call and far less mode collapse.

Search policies finally get room to shine

When AIDE’s old operators were plugged into greedy, MCTS and evolutionary searches, the fancier algorithms gained zero ground—operator bottlenecks were that severe. Swap in OAIRA and those same policies leapfrog greedy search, proving that exploration muscle only pays off once edits are expressive enough.

The scoreboard (MLE-bench Lite, 22 Kaggle tasks)

AIDE (o1-preview, greedy): 39.6 % medal rate
AIRA (greedy + OAIRA): 45.5 %
AIRA (MCTS + OAIRA): 47.7 %
AIRA (Evolutionary + OAIRA): 47.3 %
All agents ran under identical 24-hour, single-GPU budgets inside AIRA-dojo, a new sandbox that hands each run a root-privileged H200 container yet isolates filesystem side effects.

Mind the generalization gap

The study also spotlights a pitfall for auto-ML agents: validation scores routinely over-estimate test-set gains, steering greedy searches into dead ends. By examining thousands of runs, the team quantifies that “proxy-test gap” and urges future benchmarks to track it explicitly.

Why it matters

Agent design ≠ model scale. The leap came without touching the underlying LLM (DeepSeek-R1 or GPT-4o). That’s good news for teams capped by API limits.
Composable recipe. OAIRA operators, MCTS search and the open-source aira-dojo testbed (GitHub link in the paper) can bolt onto any ReAct-style coding agent.
Toward autonomous ML ops. AIRA’s 24-hour, single-GPU constraint mirrors real-world hack-day budgets, making the findings immediately useful for startups chasing continuous Kaggle pipelines or internal model tuning bots.

Auto-ML agents are no longer judged solely by the size of their LLM brains; the tools they wield and the ways they explore the search space may count just as much. AIRA’s 8-point jump on MLE-bench suggests that the next frontier in agentic ML will be won with sharper scalpels, not bigger hammers.

Paper link: arXiv 2507.02554 (PDF)

DeepMesh makes artist-quality 3D meshes a one-click affair

Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images.

Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.

Two upgrades, one big leap

Pillar	What they did	Why it matters
72 % shorter sequences	A hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail.	Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RL	Collected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model.	Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes.

Results: fewer faces, better faces

Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences.
Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.
Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).
Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds.

Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

7.7.25

ARAG puts a multi-agent brain inside your RAG stack — and Walmart’s numbers look eye-popping

Retrieval-augmented generation (RAG) has become the go-to recipe for giving large language models real-world context, but most deployments still treat retrieval as a dumb, one-shot lookup. Researchers at Walmart Global Tech think that leaves serious money on the table — especially in e-commerce, where user intent shifts by the minute. Their new framework, ARAG (Agentic Retrieval-Augmented Generation), adds a four-agent reasoning layer on top of vanilla RAG and reports double-digit gains across every metric that matters.

Four specialists, one conversation

User-Understanding Agent distills long-term history and the current session into a natural-language profile.
NLI Agent performs sentence-level entailment to see whether each candidate item actually supports that intent.
Context-Summary Agent compresses only the NLI-approved evidence into a focused prompt.
Item-Ranker Agent fuses all signals and produces the final ranked list.

Each agent writes to — and reads from — a shared blackboard-style memory, so later agents can reason over earlier rationales rather than raw text alone.

How much better? Try 42 %

On three Amazon Review subsets (Clothing, Electronics, Home), ARAG beats both a recency heuristic and a strong cosine-similarity RAG baseline:

Dataset	NDCG@5 ↑	Hit@5 ↑
Clothing	+42.1 %	+35.5 %
Electronics	+37.9 %	+30.9 %
Home & Kitchen	+25.6 %	+22.7 %

An ablation test shows that yanking either the NLI or context-summary modules knocks as much as 14 points off NDCG, underlining how critical cross-agent reasoning is to the win.

Why it matters

Personalization that actually reasons. By turning retrieval and ranking into cooperative LLM agents, ARAG captures the nuance of why an item fits, not just whether embeddings are close.
No model surgery required. The team wraps any existing RAG stack; there’s no need to fine-tune the base LLM, making the upgrade cloud-budget friendly.
Explainability for free. Each agent logs its own JSON-structured evidence, giving product managers a breadcrumb trail for every recommendation.

The bigger picture

Agentic pipelines have taken off in code generation and web browsing; ARAG shows the same trick pays dividends in recommender systems, a multi-billion-dollar battleground where percent-level lifts translate into real revenue. Expect retailers and streaming platforms to test-drive multi-agent RAG as they chase post-cookie personalization.

Paper link: arXiv 2506.21931 (PDF)

6.7.25

LangGraph Rollout: how VeRL leveled-up multi-turn Agent RL

Why this matters

If you’ve ever tried to train an LLM-powered agent with many tool calls spread across a genuine back-and-forth conversation, you’ve probably discovered that “multi-turn” means different things to different frameworks. Yanbin Jiang’s latest post shows how the VeRL team punched through that ceiling by grafting LangGraph directly onto VeRL’s reinforcement-learning rollout engine. The result is a training loop that speaks the same language as production code.

1. Where they started

Native VeRL multi-turn – great for quick experiments. You enable multi_turn: True, write a YAML schema for each tool, implement an async Python class, and you’re off; their GSM8K benchmark ran in two days.
Pain points
1. Double bookkeeping: every tool had to be declared twice (YAML + Python).
2. Drift: schema and code fell out of sync, and prod tools (written for LangChain/LangGraph) diverged from the “training” clones.

2. A quick stop-gap: automatic tool wrapping

Yanbin added BaseTool.from_callable(), which introspects any plain Python function with transformers.utils.get_json_schema, then fabricates a VeRL-compatible wrapper on the fly. One list of callables (tool_list = [multiply, add, …]) now powers both training and prod.

My dev take: this is the same pattern I use in LangChain when I decorate business logic with @tool. Nice to see VeRL admit “if you can’t beat reflection, join it.”

3. The real blocker: orchestration power

Research quickly outgrew VeRL’s built-in rollout:

Need	Why VeRL fell short
Dynamic branches & backtracking	Native graph was too rigid.
True multi-turn dialogue (user follow-ups)	Any assistant message without tool calls ended the convo.
Per-node sampling / chat-template tweaks	Global settings only.

Enter LangGraph: a lightweight DAG engine already shipping in production.

4. Architectural insight: separation of concerns

“Let VeRL manage actor weights & hardware; let LangGraph drive the conversation.”

So they built a LangChain-compatible chat-model client for VeRL’s SGLang server. Training now works like this:

VeRL hands the initial messages + model handle to the user’s LangGraph.
The graph does its thing—branching, retrying, invoking tools—using the exact actor weights being optimized.
When the graph stops, VeRL collects the message history and rewards.

The PR shows a seven-line YAML snippet that swaps the old rollout for:

yaml
multi_turn:
  chat_template_kwargs: {enable_thinking: false}
  langgraph:
    path: /path/to/graph.py
    graph_config: {recursion_limit: 100}

…and a 60-line example graph that binds tools, counts turns, and lets you vary temperature node-by-node.

5. Why I’m excited

One graph to rule them all – deployment and training share code; no more “but it worked in prod!”
Easier ablations – want to test a new branch strategy? Edit the graph script; RL pipeline stays untouched.
Framework-agnostic future – the same bridge pattern could plug VeRL into OpenAI Function Calling, Microsoft’s AutoGen, or whatever framework wins next year.

My takeaway

VeRL just became a lot more attractive for serious agent RL work. By leaning on LangGraph instead of extending an in-house orchestration DSL, the team keeps VeRL laser-focused on fast rollouts, leaves graph logic to a dedicated library, and—crucially—lets devs iterate on one codebase. If you’re juggling duplicate tool definitions or fighting mismatch between training and production, clone Yanbin’s PR and breathe easier.

Explore it more here: https://jybsuper.github.io/posts/langgraph_rollout/

FreeMorph turns Stable Diffusion into a one-click image-morphing engine

Image morphing has been around since Michael Jackson’s Black or White video, but most modern AI pipelines still demand per-pair fine-tuning or laborious warping to keep shapes and textures coherent. A new paper from NTU, Nanjing University and CUHK drops that baggage. FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model repurposes an off-the-shelf Stable Diffusion 2.1 checkpoint to generate frame-perfect transitions between any two images—faces, cars, even cat-to-dog mash-ups—without touching a single weight.

Two tricks make the magic happen

Guidance-aware spherical interpolation (GaSI). Instead of naive latent mixing, FreeMorph blends the key-value pairs inside Stable Diffusion’s self-attention, injecting “identity anchors” from both source images so the morph stays on course.
Step-oriented variation trend (SoVT). A second module dials in how much of each image shows up at every denoising step, taming the non-linear chaos that usually derails tuning-free edits.

Faster and smoother than the competition

Running on a single NVIDIA A100, FreeMorph spits out a full transition sequence in under 30 seconds, beating DiffMorpher and IMPUS—which both require minutes of LoRA fine-tuning—while delivering sharper edges and fewer identity slips.

A new benchmark to prove it

Because existing datasets skew toward near-identical pairs, the authors collected Morph4Data,  four classes of image pairs ranging from “same layout, different semantics” to “totally unrelated.” On this tougher mix, FreeMorph tops every published method in quantitative metrics and user studies alike.

Why this matters

For creative-tool startups, FreeMorph means morphing features can ship as a call to Stable Diffusion rather than a 30-minute fine-tune. For researchers, GaSI + SoVT point to a broader lesson: you can co-opt diffusion attention layers for structural edits without sacrificing model generality.

The code, demo video and ready-to-run Colab notebook are already live on GitHub, so expect FreeMorph-powered GIF makers to surface on your timeline before summer’s out.

Paper link: arXiv 2507.01953 (PDF)

WebSailor charts an open-source course to super-human web reasoning

For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.”

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning.

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize.

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools.

Why it matters

Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.
A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.
Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks.

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers.

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592 (PDF)

4.7.25

MoCa turns your favorite VLM into a bidirectional embedding powerhous

Causal-attention vision–language models (VLMs) are great storytellers, but they’re not ideal when you just need a single, rock-solid vector that fuses pixels and prose. A joint team from Renmin University of China, Stanford and Microsoft Research Asia thinks it has a fix. In a paper released this week, the researchers introduce MoCa — Modality-aware Continual Pre-training, a plug-and-play recipe that transforms any off-the-shelf VLM into a bidirectional, retrieval-grade multimodal embedder.

Two stages, three big problems solved

Modality-aware Continual Pre-training (CPT)
Joint reconstruction denoises interleaved text tokens via masked-language modeling and masked image patches via a lightweight decoder in one go. The tweak injects bidirectional attention and lets the model learn from billions of unlabeled, mixed-modality tokens.
Heterogeneous Contrastive Fine-tuning (HCF)
Moving beyond garden-variety image-caption pairs, MoCa mixes long-form query-document sets, curated visual-text pairs and plain text-only examples. Task-aware batching throws all three into every mini-batch, forcing deeper cross-modal reasoning instead of surface-level matching.

Together, the stages tackle the trio of headaches plaguing existing embedding retrofits: causal attention, dependence on labeled pairs and narrow training objectives.

Numbers that matter

Model	Params	MMEB (overall ↑)	ViDoRe-v2 (avg ↑)
mmE5	11 B	69.8	50.5
VLM2Vec	7 B	62.9	38.7
MoCa-3B	3 B	67.5	59.8
MoCa-7B	7 B	71.5	58.8

A 7-billion-parameter MoCa variant tops all published baselines across MMEB’s 36 tasks, while the lighter 3-B version jumps almost 10 points on ViDoRe-v2’s document-level retrieval suite. Even more telling: a 3-B MoCa with CPT beats 7-B models trained only with contrastive learning.

Ablations spotlight CPT’s punch

Yank out either the masked-language (MLM) or masked-autoencoding (MAE) objectives during CPT, and MMEB scores slide by up to 1.3 points. Drop the entire CPT stage and you lose nearly 2 points—proof that modality-aware reconstruction, not just more contrastive data, drives the gains.

Why it matters

Retrieval is eating the multimodal world. Search, RAG pipelines and recommender systems need embeddings, not prose. A bidirectional retrofit averts the cost of training from scratch.
Scales with unlabeled data. By exploiting noisy Web corpora, MoCa sidesteps the image-caption bottleneck hobbling many CLIP-style updates.
Open VLM agnostic. The authors demo on Qwen-2.5-VL backbones, but the training recipe is architecture-neutral—anything with a ViT and Transformer decoder should drop in.

What’s next

The paper hints at a public GitHub release with checkpoints, data loaders and task-aware batching helpers. If the repo ships soon, expect MoCa-style CPT to become a default step for teams building multimodal RAG or e-commerce search engines on lightweight hardware.

Paper link: arXiv 2506.23115 (PDF)