Showing posts with label Xiaomi AI. Show all posts
Showing posts with label Xiaomi AI. Show all posts

3.6.25

MiMo-VL-7B: Xiaomi's Advanced Vision-Language Model Elevating Multimodal AI Reasoning

 Xiaomi has unveiled MiMo-VL-7B, a cutting-edge vision-language model (VLM) that combines compact architecture with exceptional performance in multimodal reasoning tasks. Designed to process and understand both visual and textual data, MiMo-VL-7B sets a new benchmark in the field of AI.

Innovative Architecture and Training

MiMo-VL-7B comprises three key components:

  • A native-resolution Vision Transformer (ViT) encoder that preserves fine-grained visual details.

  • A Multi-Layer Perceptron (MLP) projector for efficient cross-modal alignment.

  • The MiMo-7B language model, specifically optimized for complex reasoning tasks.

The model undergoes a two-phase training process:

  1. Four-Stage Pre-Training: This phase includes projector warmup, vision-language alignment, general multimodal pre-training, and long-context supervised fine-tuning (SFT), resulting in the MiMo-VL-7B-SFT model.

  2. Mixed On-Policy Reinforcement Learning (MORL): In this phase, diverse reward signals—such as perception accuracy, visual grounding precision, logical reasoning capabilities, and human preferences—are integrated to produce the MiMo-VL-7B-RL model.

Performance Highlights

MiMo-VL-7B demonstrates state-of-the-art performance in various benchmarks:

  • Excels in general visual-language understanding tasks.

  • Outperforms existing open-source models in multimodal reasoning tasks.

  • Exhibits exceptional GUI understanding and grounding capabilities, rivaling specialized models.

Notably, MiMo-VL-7B-RL achieves the highest Elo rating among all evaluated open-source vision-language models, ranking first across models ranging from 7B to 72B parameters.

Accessibility and Deployment

Xiaomi has open-sourced the MiMo-VL-7B series, including both the SFT and RL models, making them available for the research community and developers. The models are compatible with the Qwen2_5_VLForConditionalGeneration architecture, facilitating seamless deployment and inference.

Conclusion

MiMo-VL-7B represents a significant advancement in vision-language modeling, combining compact design with high performance. Through innovative training methodologies and open-source availability, Xiaomi contributes to the broader AI community's efforts in developing sophisticated multimodal systems.

 If large language models have one redeeming feature for safety researchers, it’s that many of them think out loud . Ask GPT-4o or Claude 3....