Modular multimodal models bolt a vision encoder onto a language model—simple but memory-hungry. Monolithic MLLMs promise sleeker deployment by folding both roles into one network, yet they struggle with catastrophic forgetting and GPU burn. Mono-InternVL-1.5—unveiled this week by OpenGVLab, Shanghai AI Lab and Tsinghua collaborators—takes a big step toward solving both problems.
How they rebuilt the brain
-
Standalone visual parameter space. Instead of retraining the whole LLM, the team delta-tunes a fresh set of visual parameters—packed as a multimodal Mixture-of-Experts—so language weights stay frozen and stable.
-
EViP → EViP++. Their Endogenous Visual Pre-training pipeline now adds visual-attention experts and a progressive schedule that learns from noisy web data without wiping language skills.
-
Fused CUDA kernel for MoE inference. A custom kernel collapses expert routing into one GPU call, trimming real-time latency.
Numbers that matter
Metric | Mono-InternVL | Mono-InternVL-1.5 | Δ |
---|---|---|---|
Pre-training data | 1.1 B tokens | 0.5 B tokens | −58 % |
Inference speed | 61 tok/s | 77 tok/s | +26 % |
VQA Bench | 70.1 | 70.4 | +0.3 |
MLLM Bench | 53.7 | 55.6 | +1.9 |
Why it matters
-
Design simplicity meets deployment thrift. One model now sees and talks without an external vision tower, fits in fewer VRAM GBs, and spools responses faster—handy for edge boxes or consumer GPUs.
-
Delta-tuning shows its muscle. Freezing language weights while grafting “visual experts” offers a clean recipe other labs can copy to preserve text quality.
-
Open weights, real code. Checkpoints, the fused CUDA kernel and training scripts are live on GitHub, inviting startups to fine-tune for retail search, doc-QA or AR glasses.
Mono-InternVL-1.5 won’t end the debate between modular and monolithic designs, but it proves you don’t need billion-token budgets or exotic hardware to get state-of-the-art multimodal accuracy—and you might even gain a few milliseconds back for the user.
Paper link: arXiv 2507.12566 (PDF)
No comments:
Post a Comment