Showing posts with label multimodal LLM. Show all posts
Showing posts with label multimodal LLM. Show all posts

13.7.25

PyVision lets multimodal models write their own vision tools—and the accuracy jump is eye-opening

 Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image, run a captioner, answer the question—done. PyVision blows up that rut. The 26-page technical report shows GPT-4.1 and Claude-4 Sonnet literally writing Python code mid-conversation, executing it, checking the output and iterating until they solve the task. The result is an agent that treats PIL, NumPy and Matplotlib as an expandable toolbox rather than a fixed pipeline. 

From static workflows to dynamic “code-as-tool”

A traditional vision agent might have 10 pre-defined ops; PyVision can spawn hundreds. The authors catalogue the emergent tools into four buckets—basic image processing, advanced processing, visual sketching and numerical analysis—plus a long-tail of creative task-specific snippets. On perception-heavy problems the model leans on cropping and contrast boosts; on math puzzles it sketches diagrams or counts pixels. 

Multi-turn loop under the hood

  1. System prompt primes the LLM to plan, code, run and reflect.

  2. Python sandbox executes each snippet and streams results back.

  3. Reflection step lets the model critique outputs, revise code or answer.

The dance repeats until the agent is confident—or it times out. Crucially, there’s no fixed library list; the model imports what it thinks it needs. 

Benchmarks: big wins, bigger where it hurts most

BackendMathVista ↑Visual-Puzzles ↑V* ↑VLMsAreBlind-mini ↑
GPT-4.1+1.8+2.5+7.8+2.6
Claude-4 Sonnet+3.3+8.3+0.3+31.1

Claude-4’s massive jump on VLMsAreBlind-mini—a dataset designed to fool pattern-matchers—suggests PyVision’s code probes puncture spurious visual shortcuts. GPT-4.1, already strong at fine-grained perception, gains most on the V* visual-search test. 

Why this matters

  • Grounded answers, verifiable steps. The agent surfaces intermediate plots, masks and arrays, giving product teams a check-pointable audit trail.

  • Amplifier, not crutch. PyVision “dials up” whatever the base model is best at—perception for GPT-4.1, abstract reasoning for Claude-4—rather than papering over weaknesses.

  • Tool invention is the new frontier. Instead of waiting for human engineers to wire in functions, the LLM autogenerates them, inching closer to Benjamin Franklin’s “tool-making animal.”

What’s next

The paper’s GitHub repo ships inference code, a dockerised Python sandbox and demo notebooks. The authors hint at plugging reinforcement learning into the loop and expanding beyond vision to 3-D geometry and web interaction tooling. Expect startups to wrap this framework into agents that can diagnose X-ray anomalies, audit engineering schematics or spot product-label defects—without a human ever defining “defect detector.”

Paper link: arXiv 2507.07998 (PDF)

20.6.25

ReVisual‑R1: A New Open‑Source 7B Multimodal LLM with Deep, Verbose Reasoning

 

ReVisual‑R1: A New Open‑Source 7B Multimodal LLM with Deep, Thoughtful Reasoning

Researchers from Tsinghua University, Shanghai Jiao Tong University, and the Shanghai Artificial Intelligence Laboratory have released ReVisual‑R1, a pioneering 7 billion‑parameter multimodal large language model (MLLM) open‑sourced for public use. It offers advanced, context‑rich reasoning across both vision and text—unveiling new possibilities for explainable AI.


🧠 Why ReVisual‑R1 Matters

Training multimodal models to reason—not just perceive—poses a significant challenge. Previous efforts in multimodal chain‑of‑thought (CoT) reasoning were limited by training instability and superficial outputs. ReVisual‑R1 addresses these issues by blending text‑only and multimodal reinforcement learning (RL), yielding deeper and more accurate analysis.


🚀 Innovative Three‑Stage Training Pipeline

  1. Cold‑Start Pretraining (Text Only)
    Leveraging carefully curated text datasets to build strong reasoning foundations that outperform many zero‑shot models, even before RL is applied.

  2. Multimodal RL with Prioritized Advantage Distillation (PAD)
    Enhances visual–text reasoning through progressive RL, avoiding gradient stagnation typical in previous GRPO approaches.

  3. Final Text‑Only RL Refinement
    Further improves reasoning fluency and depth, producing coherent and context‑aware multimodal outputs.


📚 The GRAMMAR Dataset: Key to Quality Reasoning

ReVisual‑R1 is trained on GRAMMAR, a meticulously curated dataset combining text and multimodal data. It offers nuanced reasoning tasks with coherent logic—unlike shallow, noisy alternatives—ensuring the model learns quality thinking patterns.


🏆 Benchmark‑Topping Performance

On nine out of ten benchmarks—including MathVerse, MathVision, WeMath, LogicVista, DynaMath, AIME 2024, and AIME 2025—ReVisual‑R1 outperforms open‑source peers and competes with commercial models, emerging as a top-performing open‑source 7B MLLM.


🔍 What This Means for AI Research

  • Staged Training Works: Combining text-based pretraining with multimodal RL produces better reasoning than one-step methods.

  • PAD Innovation: Stabilizes multimodal learning by focusing on high‑quality signals.

  • Model Accessibility: At 7B parameters and fully open-source, ReVisual‑R1 drives multimodal AI research beyond large-scale labs.


✅ Final Takeaway

ReVisual‑R1 delivers long‑form, image‑grounded reasoning at the open‑source level—transforming the landscape for explainable AI. Its innovative training pipeline, multi-modal fluency, and benchmark dominance make it a new foundation for small, intelligent agents across education, robotics, and data analysis.

 Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image...