Wandering Nomad: Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

22.7.25

Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

Modular multimodal models bolt a vision encoder onto a language model—simple but memory-hungry. Monolithic MLLMs promise sleeker deployment by folding both roles into one network, yet they struggle with catastrophic forgetting and GPU burn. Mono-InternVL-1.5—unveiled this week by OpenGVLab, Shanghai AI Lab and Tsinghua collaborators—takes a big step toward solving both problems.

How they rebuilt the brain

Standalone visual parameter space. Instead of retraining the whole LLM, the team delta-tunes a fresh set of visual parameters—packed as a multimodal Mixture-of-Experts—so language weights stay frozen and stable.
EViP → EViP++. Their Endogenous Visual Pre-training pipeline now adds visual-attention experts and a progressive schedule that learns from noisy web data without wiping language skills.
Fused CUDA kernel for MoE inference. A custom kernel collapses expert routing into one GPU call, trimming real-time latency.

Numbers that matter

Metric	Mono-InternVL	Mono-InternVL-1.5	Δ
Pre-training data	1.1 B tokens	0.5 B tokens	−58 %
Inference speed	61 tok/s	77 tok/s	+26 %
VQA Bench	70.1	70.4	+0.3
MLLM Bench	53.7	55.6	+1.9

Across 15 public benchmarks the older Mono-InternVL already led on 12; the new model keeps that edge while slashing first-token latency by up to 69 % against the modular InternVL-1.5 baseline. It even lands a headline-grabbing +114-point jump over Emu-3 on OCRBench.

Why it matters

Design simplicity meets deployment thrift. One model now sees and talks without an external vision tower, fits in fewer VRAM GBs, and spools responses faster—handy for edge boxes or consumer GPUs.
Delta-tuning shows its muscle. Freezing language weights while grafting “visual experts” offers a clean recipe other labs can copy to preserve text quality.
Open weights, real code. Checkpoints, the fused CUDA kernel and training scripts are live on GitHub, inviting startups to fine-tune for retail search, doc-QA or AR glasses.

Mono-InternVL-1.5 won’t end the debate between modular and monolithic designs, but it proves you don’t need billion-token budgets or exotic hardware to get state-of-the-art multimodal accuracy—and you might even gain a few milliseconds back for the user.

Paper link: arXiv 2507.12566 (PDF)

22.7.25

Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

How they rebuilt the brain

Numbers that matter

Why it matters

No comments: