22.7.25

Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

 Modular multimodal models bolt a vision encoder onto a language model—simple but memory-hungry. Monolithic MLLMs promise sleeker deployment by folding both roles into one network, yet they struggle with catastrophic forgetting and GPU burn. Mono-InternVL-1.5—unveiled this week by OpenGVLab, Shanghai AI Lab and Tsinghua collaborators—takes a big step toward solving both problems.

How they rebuilt the brain

  • Standalone visual parameter space. Instead of retraining the whole LLM, the team delta-tunes a fresh set of visual parameters—packed as a multimodal Mixture-of-Experts—so language weights stay frozen and stable.

  • EViP → EViP++. Their Endogenous Visual Pre-training pipeline now adds visual-attention experts and a progressive schedule that learns from noisy web data without wiping language skills.

  • Fused CUDA kernel for MoE inference. A custom kernel collapses expert routing into one GPU call, trimming real-time latency.

Numbers that matter

MetricMono-InternVLMono-InternVL-1.5Δ
Pre-training data1.1 B tokens0.5 B tokens−58 %
Inference speed61 tok/s77 tok/s+26 %
VQA Bench70.170.4+0.3
MLLM Bench53.755.6+1.9

Across 15 public benchmarks the older Mono-InternVL already led on 12; the new model keeps that edge while slashing first-token latency by up to 69 % against the modular InternVL-1.5 baseline. It even lands a headline-grabbing +114-point jump over Emu-3 on OCRBench.

Why it matters

  1. Design simplicity meets deployment thrift. One model now sees and talks without an external vision tower, fits in fewer VRAM GBs, and spools responses faster—handy for edge boxes or consumer GPUs.

  2. Delta-tuning shows its muscle. Freezing language weights while grafting “visual experts” offers a clean recipe other labs can copy to preserve text quality.

  3. Open weights, real code. Checkpoints, the fused CUDA kernel and training scripts are live on GitHub, inviting startups to fine-tune for retail search, doc-QA or AR glasses.

Mono-InternVL-1.5 won’t end the debate between modular and monolithic designs, but it proves you don’t need billion-token budgets or exotic hardware to get state-of-the-art multimodal accuracy—and you might even gain a few milliseconds back for the user.

Paper link: arXiv 2507.12566 (PDF)

No comments:

 Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no ...