Wandering Nomad: multimodal LLM

Showing posts with label multimodal LLM. Show all posts

30.7.25

ARC‑Hunyuan‑Video‑7B: Tencent’s bid to finally understand short videos

Swipe through WeChat Channels or TikTok and you’ll find an AI nightmare: jump‑cuts, dense captions, meme audio and zero breathing room. Generic vision‑language behemoths struggle to keep up. ARC‑Hunyuan‑Video‑7B—unveiled this week by Tencent’s ARC Lab—takes direct aim at that chaos with an end‑to‑end, tri‑modal stack that ingests RGB frames, raw audio and ASR text before producing structured outputs.

What makes it different?

Design choice	Why it matters
Audio encoder fused with ViT backbone	Lets the model pinpoint punch‑lines, product mentions or sudden sound cues that pure‑vision systems miss.
Timestamp overlay on every frame	Gives the LLM hard temporal anchors for grounding, enabling second‑level captions and event logs.
Automated annotation pipeline with “millions” of real shorts	Avoids domain shift that plagues models trained on movie trailers or HowTo videos.
Five‑stage training (pre‑train → SFT → cold‑start → RLHF → SFT)	RL on objective tasks (e.g., exact‑match grounding) unlocks the subjective “explain the joke” style understanding.

Early numbers

ShortVid‑Bench (new internal benchmark): authors report “strong performance” across captioning, QA, grounding and reasoning—outpacing prior open models, though exact deltas remain embargoed.
Latency: stress tests show 10 s end‑to‑end for a 60‑second clip on an H20 (≈A100‑class) GPU—fast enough to power feed ranking or real‑time moderation.

Why builders should care

One stop for multi‑task video NLP – The same checkpoint handles highlight extraction, event logs, scene QA and clip‑level summaries, reducing pipeline sprawl.
Audio is first‑class – Brands, educators and accessibility teams can finally query both what’s shown and what’s said in user‑generated shorts.
Edge‑friendly – At 7 B parameters (≈8.6 B in BF16), it’s small enough for a single A100 or dual consumer GPUs under vLLM.
Open weights & code – Hugging Face repo, training scripts and a vLLM deployment guide are already public, licensing‑friendly for commercial use.

The bigger picture

OpenAI’s GPT‑4o Vision and Google’s Gemini 1.5 Pro handle long vids but lean on frame sampling and text prompts. ARC‑Hunyuan‑Video‑7B instead streams raw pixels + sound and returns a structured JSON‑style digest—closer to what feed‑ranking or search engines need. Tencent claims the model is already in production, lifting short‑video engagement metrics; if those gains hold, expect other platforms to pivot toward structured rather than free‑text video understanding.

Paper link: arXiv 2507.20939 (PDF)

13.7.25

PyVision lets multimodal models write their own vision tools—and the accuracy jump is eye-opening

Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image, run a captioner, answer the question—done. PyVision blows up that rut. The 26-page technical report shows GPT-4.1 and Claude-4 Sonnet literally writing Python code mid-conversation, executing it, checking the output and iterating until they solve the task. The result is an agent that treats PIL, NumPy and Matplotlib as an expandable toolbox rather than a fixed pipeline.

From static workflows to dynamic “code-as-tool”

A traditional vision agent might have 10 pre-defined ops; PyVision can spawn hundreds. The authors catalogue the emergent tools into four buckets—basic image processing, advanced processing, visual sketching and numerical analysis—plus a long-tail of creative task-specific snippets. On perception-heavy problems the model leans on cropping and contrast boosts; on math puzzles it sketches diagrams or counts pixels.

Multi-turn loop under the hood

System prompt primes the LLM to plan, code, run and reflect.
Python sandbox executes each snippet and streams results back.
Reflection step lets the model critique outputs, revise code or answer.

The dance repeats until the agent is confident—or it times out. Crucially, there’s no fixed library list; the model imports what it thinks it needs.

Benchmarks: big wins, bigger where it hurts most

Backend	MathVista ↑	Visual-Puzzles ↑	V* ↑	VLMsAreBlind-mini ↑
GPT-4.1	+1.8	+2.5	+7.8	+2.6
Claude-4 Sonnet	+3.3	+8.3	+0.3	+31.1

Claude-4’s massive jump on VLMsAreBlind-mini—a dataset designed to fool pattern-matchers—suggests PyVision’s code probes puncture spurious visual shortcuts. GPT-4.1, already strong at fine-grained perception, gains most on the V* visual-search test.

Why this matters

Grounded answers, verifiable steps. The agent surfaces intermediate plots, masks and arrays, giving product teams a check-pointable audit trail.
Amplifier, not crutch. PyVision “dials up” whatever the base model is best at—perception for GPT-4.1, abstract reasoning for Claude-4—rather than papering over weaknesses.
Tool invention is the new frontier. Instead of waiting for human engineers to wire in functions, the LLM autogenerates them, inching closer to Benjamin Franklin’s “tool-making animal.”

What’s next

The paper’s GitHub repo ships inference code, a dockerised Python sandbox and demo notebooks. The authors hint at plugging reinforcement learning into the loop and expanding beyond vision to 3-D geometry and web interaction tooling. Expect startups to wrap this framework into agents that can diagnose X-ray anomalies, audit engineering schematics or spot product-label defects—without a human ever defining “defect detector.”

Paper link: arXiv 2507.07998 (PDF)

20.6.25

ReVisual‑R1: A New Open‑Source 7B Multimodal LLM with Deep, Verbose Reasoning

ReVisual‑R1: A New Open‑Source 7B Multimodal LLM with Deep, Thoughtful Reasoning

Researchers from Tsinghua University, Shanghai Jiao Tong University, and the Shanghai Artificial Intelligence Laboratory have released ReVisual‑R1, a pioneering 7 billion‑parameter multimodal large language model (MLLM) open‑sourced for public use. It offers advanced, context‑rich reasoning across both vision and text—unveiling new possibilities for explainable AI.

🧠 Why ReVisual‑R1 Matters

Training multimodal models to reason—not just perceive—poses a significant challenge. Previous efforts in multimodal chain‑of‑thought (CoT) reasoning were limited by training instability and superficial outputs. ReVisual‑R1 addresses these issues by blending text‑only and multimodal reinforcement learning (RL), yielding deeper and more accurate analysis.

🚀 Innovative Three‑Stage Training Pipeline

Cold‑Start Pretraining (Text Only)
Leveraging carefully curated text datasets to build strong reasoning foundations that outperform many zero‑shot models, even before RL is applied.
Multimodal RL with Prioritized Advantage Distillation (PAD)
Enhances visual–text reasoning through progressive RL, avoiding gradient stagnation typical in previous GRPO approaches.
Final Text‑Only RL Refinement
Further improves reasoning fluency and depth, producing coherent and context‑aware multimodal outputs.

📚 The GRAMMAR Dataset: Key to Quality Reasoning

ReVisual‑R1 is trained on GRAMMAR, a meticulously curated dataset combining text and multimodal data. It offers nuanced reasoning tasks with coherent logic—unlike shallow, noisy alternatives—ensuring the model learns quality thinking patterns.

🏆 Benchmark‑Topping Performance

On nine out of ten benchmarks—including MathVerse, MathVision, WeMath, LogicVista, DynaMath, AIME 2024, and AIME 2025—ReVisual‑R1 outperforms open‑source peers and competes with commercial models, emerging as a top-performing open‑source 7B MLLM.

🔍 What This Means for AI Research

Staged Training Works: Combining text-based pretraining with multimodal RL produces better reasoning than one-step methods.
PAD Innovation: Stabilizes multimodal learning by focusing on high‑quality signals.
Model Accessibility: At 7B parameters and fully open-source, ReVisual‑R1 drives multimodal AI research beyond large-scale labs.

✅ Final Takeaway

ReVisual‑R1 delivers long‑form, image‑grounded reasoning at the open‑source level—transforming the landscape for explainable AI. Its innovative training pipeline, multi-modal fluency, and benchmark dominance make it a new foundation for small, intelligent agents across education, robotics, and data analysis.