Wandering Nomad: MLLM

20.6.25

ReVisual‑R1: A New Open‑Source 7B Multimodal LLM with Deep, Verbose Reasoning

ReVisual‑R1: A New Open‑Source 7B Multimodal LLM with Deep, Thoughtful Reasoning

Researchers from Tsinghua University, Shanghai Jiao Tong University, and the Shanghai Artificial Intelligence Laboratory have released ReVisual‑R1, a pioneering 7 billion‑parameter multimodal large language model (MLLM) open‑sourced for public use. It offers advanced, context‑rich reasoning across both vision and text—unveiling new possibilities for explainable AI.

🧠 Why ReVisual‑R1 Matters

Training multimodal models to reason—not just perceive—poses a significant challenge. Previous efforts in multimodal chain‑of‑thought (CoT) reasoning were limited by training instability and superficial outputs. ReVisual‑R1 addresses these issues by blending text‑only and multimodal reinforcement learning (RL), yielding deeper and more accurate analysis.

🚀 Innovative Three‑Stage Training Pipeline

Cold‑Start Pretraining (Text Only)
Leveraging carefully curated text datasets to build strong reasoning foundations that outperform many zero‑shot models, even before RL is applied.
Multimodal RL with Prioritized Advantage Distillation (PAD)
Enhances visual–text reasoning through progressive RL, avoiding gradient stagnation typical in previous GRPO approaches.
Final Text‑Only RL Refinement
Further improves reasoning fluency and depth, producing coherent and context‑aware multimodal outputs.

📚 The GRAMMAR Dataset: Key to Quality Reasoning

ReVisual‑R1 is trained on GRAMMAR, a meticulously curated dataset combining text and multimodal data. It offers nuanced reasoning tasks with coherent logic—unlike shallow, noisy alternatives—ensuring the model learns quality thinking patterns.

🏆 Benchmark‑Topping Performance

On nine out of ten benchmarks—including MathVerse, MathVision, WeMath, LogicVista, DynaMath, AIME 2024, and AIME 2025—ReVisual‑R1 outperforms open‑source peers and competes with commercial models, emerging as a top-performing open‑source 7B MLLM.

🔍 What This Means for AI Research

Staged Training Works: Combining text-based pretraining with multimodal RL produces better reasoning than one-step methods.
PAD Innovation: Stabilizes multimodal learning by focusing on high‑quality signals.
Model Accessibility: At 7B parameters and fully open-source, ReVisual‑R1 drives multimodal AI research beyond large-scale labs.

✅ Final Takeaway

ReVisual‑R1 delivers long‑form, image‑grounded reasoning at the open‑source level—transforming the landscape for explainable AI. Its innovative training pipeline, multi-modal fluency, and benchmark dominance make it a new foundation for small, intelligent agents across education, robotics, and data analysis.

3.6.25

LLaDA-V: A Diffusion-Based Multimodal Language Model Redefining Visual Instruction Tuning

In a significant advancement in artificial intelligence, researchers from Renmin University of China and Ant Group have introduced LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning. This model represents a departure from the prevalent autoregressive paradigms in current multimodal approaches, offering a fresh perspective on how AI can process and understand combined textual and visual data.

A Novel Approach to Multimodal Learning

Traditional MLLMs often rely on autoregressive methods, predicting the next token in a sequence based on previous tokens. LLaDA-V, however, employs a diffusion-based approach, constructing outputs through iterative denoising processes. This method allows for more flexible and potentially more accurate modeling of complex data distributions, especially when integrating multiple modalities like text and images.

Architectural Highlights

Built upon the foundation of LLaDA, a large language diffusion model, LLaDA-V incorporates a vision encoder and a Multi-Layer Perceptron (MLP) connector. This design projects visual features into the language embedding space, enabling effective multimodal alignment. The integration facilitates the model's ability to process and generate responses based on combined textual and visual inputs, enhancing its applicability in tasks requiring comprehensive understanding.

Performance and Comparisons

Despite its language model being weaker on purely textual tasks compared to counterparts like LLaMA3-8B and Qwen2-7B, LLaDA-V demonstrates promising multimodal performance. When trained on the same instruction data, it is highly competitive with LLaMA3-V across multimodal tasks and exhibits better data scalability. Additionally, LLaDA-V narrows the performance gap with Qwen2-VL, suggesting the effectiveness of its architecture for multimodal applications.

Implications for Future Research

The introduction of LLaDA-V underscores the potential of diffusion-based models in the realm of multimodal AI. Its success challenges the dominance of autoregressive models and opens avenues for further exploration into diffusion-based approaches for complex AI tasks. As the field progresses, such innovations may lead to more robust and versatile AI systems capable of nuanced understanding and generation across diverse data types.

Access and Further Information

For those interested in exploring LLaDA-V further, the research paper is available on arX iv, and the project's code and demos can be accessed via the official project page.