Wandering Nomad: MoCa turns your favorite VLM into a bidirectional embedding powerhous

Causal-attention vision–language models (VLMs) are great storytellers, but they’re not ideal when you just need a single, rock-solid vector that fuses pixels and prose. A joint team from Renmin University of China, Stanford and Microsoft Research Asia thinks it has a fix. In a paper released this week, the researchers introduce MoCa — Modality-aware Continual Pre-training, a plug-and-play recipe that transforms any off-the-shelf VLM into a bidirectional, retrieval-grade multimodal embedder.

Two stages, three big problems solved

Modality-aware Continual Pre-training (CPT)
Joint reconstruction denoises interleaved text tokens via masked-language modeling and masked image patches via a lightweight decoder in one go. The tweak injects bidirectional attention and lets the model learn from billions of unlabeled, mixed-modality tokens.
Heterogeneous Contrastive Fine-tuning (HCF)
Moving beyond garden-variety image-caption pairs, MoCa mixes long-form query-document sets, curated visual-text pairs and plain text-only examples. Task-aware batching throws all three into every mini-batch, forcing deeper cross-modal reasoning instead of surface-level matching.

Together, the stages tackle the trio of headaches plaguing existing embedding retrofits: causal attention, dependence on labeled pairs and narrow training objectives.

Numbers that matter

Model	Params	MMEB (overall ↑)	ViDoRe-v2 (avg ↑)
mmE5	11 B	69.8	50.5
VLM2Vec	7 B	62.9	38.7
MoCa-3B	3 B	67.5	59.8
MoCa-7B	7 B	71.5	58.8

A 7-billion-parameter MoCa variant tops all published baselines across MMEB’s 36 tasks, while the lighter 3-B version jumps almost 10 points on ViDoRe-v2’s document-level retrieval suite. Even more telling: a 3-B MoCa with CPT beats 7-B models trained only with contrastive learning.

Ablations spotlight CPT’s punch

Yank out either the masked-language (MLM) or masked-autoencoding (MAE) objectives during CPT, and MMEB scores slide by up to 1.3 points. Drop the entire CPT stage and you lose nearly 2 points—proof that modality-aware reconstruction, not just more contrastive data, drives the gains.

Why it matters

Retrieval is eating the multimodal world. Search, RAG pipelines and recommender systems need embeddings, not prose. A bidirectional retrofit averts the cost of training from scratch.
Scales with unlabeled data. By exploiting noisy Web corpora, MoCa sidesteps the image-caption bottleneck hobbling many CLIP-style updates.
Open VLM agnostic. The authors demo on Qwen-2.5-VL backbones, but the training recipe is architecture-neutral—anything with a ViT and Transformer decoder should drop in.

What’s next

The paper hints at a public GitHub release with checkpoints, data loaders and task-aware batching helpers. If the repo ships soon, expect MoCa-style CPT to become a default step for teams building multimodal RAG or e-commerce search engines on lightweight hardware.

Paper link: arXiv 2506.23115 (PDF)

Wandering Nomad

4.7.25

MoCa turns your favorite VLM into a bidirectional embedding powerhous

Two stages, three big problems solved

Numbers that matter

Ablations spotlight CPT’s punch

Why it matters

What’s next

No comments: