4.7.25

MoCa turns your favorite VLM into a bidirectional embedding powerhous

 Causal-attention vision–language models (VLMs) are great storytellers, but they’re not ideal when you just need a single, rock-solid vector that fuses pixels and prose. A joint team from Renmin University of China, Stanford and Microsoft Research Asia thinks it has a fix. In a paper released this week, the researchers introduce MoCa — Modality-aware Continual Pre-training, a plug-and-play recipe that transforms any off-the-shelf VLM into a bidirectional, retrieval-grade multimodal embedder.

Two stages, three big problems solved

  1. Modality-aware Continual Pre-training (CPT)
    Joint reconstruction denoises interleaved text tokens via masked-language modeling and masked image patches via a lightweight decoder in one go. The tweak injects bidirectional attention and lets the model learn from billions of unlabeled, mixed-modality tokens.

  2. Heterogeneous Contrastive Fine-tuning (HCF)
    Moving beyond garden-variety image-caption pairs, MoCa mixes long-form query-document sets, curated visual-text pairs and plain text-only examples. Task-aware batching throws all three into every mini-batch, forcing deeper cross-modal reasoning instead of surface-level matching.

Together, the stages tackle the trio of headaches plaguing existing embedding retrofits: causal attention, dependence on labeled pairs and narrow training objectives.

Numbers that matter

ModelParamsMMEB (overall ↑)ViDoRe-v2 (avg ↑)
mmE511 B69.850.5
VLM2Vec7 B62.938.7
MoCa-3B3 B67.559.8
MoCa-7B7 B71.558.8

A 7-billion-parameter MoCa variant tops all published baselines across MMEB’s 36 tasks, while the lighter 3-B version jumps almost 10 points on ViDoRe-v2’s document-level retrieval suite. Even more telling: a 3-B MoCa with CPT beats 7-B models trained only with contrastive learning.

Ablations spotlight CPT’s punch

Yank out either the masked-language (MLM) or masked-autoencoding (MAE) objectives during CPT, and MMEB scores slide by up to 1.3 points. Drop the entire CPT stage and you lose nearly 2 points—proof that modality-aware reconstruction, not just more contrastive data, drives the gains.

Why it matters

  • Retrieval is eating the multimodal world. Search, RAG pipelines and recommender systems need embeddings, not prose. A bidirectional retrofit averts the cost of training from scratch.

  • Scales with unlabeled data. By exploiting noisy Web corpora, MoCa sidesteps the image-caption bottleneck hobbling many CLIP-style updates.

  • Open VLM agnostic. The authors demo on Qwen-2.5-VL backbones, but the training recipe is architecture-neutral—anything with a ViT and Transformer decoder should drop in.

What’s next

The paper hints at a public GitHub release with checkpoints, data loaders and task-aware batching helpers. If the repo ships soon, expect MoCa-style CPT to become a default step for teams building multimodal RAG or e-commerce search engines on lightweight hardware.

Paper link: arXiv 2506.23115 (PDF)

No comments:

 Causal-attention vision–language models (VLMs) are great storytellers, but they’re not ideal when you just need a single, rock-solid vector...