Wandering Nomad: August 2025

28.8.25

Anemoi: a semi-centralized agent system that lets bots talk to each other—literally

Most generalist multi-agent stacks still look like a relay race: a central planner prompts specialist workers, who pass back long context blobs for the planner to stitch together. It works—until you downsize the planner or hit token limits. Anemoi proposes a different wiring: keep a light planner, but let agents communicate directly over an Agent-to-Agent (A2A) MCP server so everyone can see progress, flag bottlenecks, and propose fixes in real time.

What’s actually new

Anemoi replaces unidirectional prompt passing with a threaded A2A server (built on the Model Context Protocol) that exposes primitives like list_agents, create_thread, send_message, and wait_for_mentions. Any agent can join a thread, address peers, and update plans mid-flight—reducing redundant context stuffing and information loss.

The cast of agents (and why it matters)

Planner: drafts the initial plan and spins up a thread.
Critique: continuously audits intermediate results.
Answer-Finder: compiles the final submission.
Workers: Web, Document Processing, and Reasoning & Coding—mirroring OWL’s tool set for a fair head-to-head. All are MCP-enabled so they can monitor progress and coordinate directly.

This design reduces reliance on one overpowered planner, supports adaptive plan updates, and cuts token overhead from repeated context injection.

Numbers that move the needle (GAIA validation)

Framework	Planner / Workers	Avg. Acc.
OWL-rep (pass@3)	GPT-4.1-mini / GPT-4o	43.64%
OWL (paper, pass@3)	GPT-4o-mini / GPT-4o	47.27%
Anemoi (pass@3)	GPT-4.1-mini / GPT-4o	52.73%

With a small planner (GPT-4.1-mini), Anemoi tops a strong open-source baseline by +9.09 points under identical tools and models—and is competitive with several proprietary systems that rely on larger planners.

How the A2A workflow runs

Discover agents → 2) Create thread with participants → 3) Workers execute subtasks; Critique labels outputs accept/uncertain while any agent can contribute revisions → 4) Consensus vote before finalization → 5) Answer-Finder submits. All via MCP messaging in a single conversation context.

Where it wins—and where it trips

Wins: Of the tasks Anemoi solved that OWL missed, 52% were due to collaborative refinement enabled by A2A; another 8% came from less context redundancy.
Failures: Remaining errors skew to LLM/tool limits (≈46%/21%), incorrect plans (≈12%), and some communication latency (≈10%)—notably when the web agent is busy and can’t respond to peers.

Why this matters

If your agent system juggles web search, file I/O, and coding, direct inter-agent communication can deliver better results without upgrading to an expensive planner. Anemoi shows a practical blueprint: keep the planner lightweight, move coordination into an A2A layer, and let specialists negotiate in-thread instead of bloating prompts.

Paper link: arXiv 2508.17068 (PDF)

Vision-SR1: a self-rewarding recipe that makes VLMs “see” before they “think”

Most reinforcement-learning recipes for vision-language models (VLMs) grade only the final answer—so models learn to lean on text priors and hallucinate what isn’t in the image. Vision-SR1 flips that: it decomposes reasoning into visual perception → language reasoning, and rewards the model for producing a self-contained visual description that alone suffices to solve the task. No external teacher, no human labels—just the model validating its own perception.

How the self-reward works

Vision-SR1 runs two rollouts of the same policy per example:

Standard pass: image + question → visual perception + CoT + answer → reward on answer (and format).
Self-reward pass: question + the model’s visual perception (no image) → CoT + answer → reward if correct, signalling that the perception captured what mattered. Rewards are combined under GRPO for stable updates.

Training setup

The team builds a 47K-example RL set spanning math (30.5%), science/commonsense (30%), and general visual reasoning (39.5%). A 9K “cold-start” SFT subset teaches the output format before the short RL run (1 epoch). Backbones: Qwen-2.5-VL-3B and 7B. Code is public on GitHub.

Benchmarks: fewer shortcuts, fewer hallucinations

On a broad suite—MMMU, MMMU-Pro, MM-Vet, RealWorldQA, VisNumBench, MathVerse, MATH-Vision, HallusionBench—Vision-SR1 consistently edges strong Vision-R1 baselines trained on the same 47K. With the 7B backbone, Vision-SR1 averages 58.8 vs 57.4 for Vision-R1; at 3B it’s 52.9 vs 50.6.

The paper also introduces Language Shortcut Rate (LSR)—how often a model answers correctly with an insufficient perception. SR1 lowers LSR across datasets, indicating less “answering from priors.”

Not just vision: text-only reasoning stays solid

On textual suites (MMLU-Pro, SuperGPQA, GSM8K, MATH-500), SR1 keeps or improves accuracy relative to Vision-R1—evidence that strengthening perception doesn’t degrade language-side reasoning.

Why it matters

Balances see vs. think. Adding a perception reward raises dependence on pixels, not just prompts—curbing hallucinations without expensive human labels or external teachers.
Simple to adopt. The “see-think-answer” format and two-pass self-reward bolt onto standard GRPO pipelines.
Open and reproducible. Data recipe, SFT cold-start, and code are released for quick replication.

Paper link: arXiv 2508.19652 (PDF)

HunyuanVideo-Foley brings studio-grade Foley to AI-generated video

Text-to-video has gone cinematic, but most clips still sound like a silent movie. HunyuanVideo-Foley aims to fix that: it’s an end-to-end text-video-to-audio (TV2A) system that generates synchronized, high-quality Foley from pixels and prompts—no sound library or manual sound design required. The team marries a multimodal diffusion transformer with representation alignment and a large, purpose-built dataset, and reports state-of-the-art results on fidelity and sync.

What’s new

100k-hour TV2A dataset. A scalable pipeline filters web video into 8-second segments, drops silent/low-bandwidth clips, scores audio aesthetics/SNR, and checks both semantic (ImageBind) and temporal (AV-align) match before tagging and captioning with GenAU.
Dual-phase multimodal attention. Video and audio are fused with joint self-attention (interleaved RoPE) for frame-level sync; text cues are injected later via cross-attention to avoid text dominating the mix.
REPA loss for audio. A Representation Alignment (REPA) objective pulls internal DiT features toward self-supervised audio embeddings (ATST-Frame), stabilizing training and improving timbre/semantics.
Continuous-latent DAC-VAE. Replaces RVQ with a VAE (128-dim latents @48 kHz, 50 Hz latent rate) for cleaner reconstructions and fewer artifacts.

How it’s built

HunyuanVideo-Foley stacks N₁ multimodal (audio-video) DiT blocks followed by N₂ audio-only blocks, modulated by Synchformer-derived sync features. The model used 18 MMDiT + 36 audio DiT layers (1536 hidden, 12 heads) and was trained 200k steps on the 100k-hour corpus; autoencoder pretraining ran 700k steps. The main run used 128 H20 GPUs with an effective batch size of 2048.

The receipts

Across three testbeds—Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench—the paper reports new SOTA on multiple axes, including audio quality (PQ), visual-semantic alignment (IB) and temporal sync (DeSync), plus higher human MOS scores on MovieGen-Audio-Bench. A sample from Kling-Audio-Eval: the model improves FD (PANNs) and KL vs. prior systems and lifts IB while keeping DeSync low.

Example objective results (Kling-Audio-Eval)

Metric	Best prior (sample)	HunyuanVideo-Foley
FD (PANNs) ↓	9.01 (MMAudio)	6.07
PQ ↑	6.05 (FoleyCrafter)	6.12
IB ↑	0.30 (MMAudio)	0.38
DeSync ↓	0.56 (MMAudio)	0.54

Why it matters

Sound that matches the shot. By separating frame-sync (video↔audio) from semantic guidance (text↔audio), the model avoids the classic failure where captions drown out visual cues.
Production-friendly fidelity. REPA and the continuous-latent DAC-VAE cut hiss, mushy transients, and texture mismatch—key for believable footsteps, doors, and crowd beds.
Built to scale. A reproducible data pipeline and a demo page suggest this is more than a lab toy; it’s an audio stack teams can evaluate today.

If generative video is to replace B-roll and animatics, it needs audio that lands. HunyuanVideo-Foley offers a blueprint: curate better multimodal data, align internal representations to robust audio features, and architect attention so text helps—without hijacking—the soundscape.

Paper link: arXiv 2508.16930 (PDF)

Gemini Now Runs Anywhere: Deploy Google’s AI Models on Your On‑Premises Infrastructure with Full Confidence

Google has taken a major step in enterprise AI by announcing that Gemini is now available anywhere—including your on-premises data centers via Google Distributed Cloud (GDC). After months of previews, Gemini on GDC is now generally available (GA) for air-gapped environments, with an ongoing preview for connected deployments.

Why This Matters — AI, Sovereignty, No Compromise

For organizations operating under stringent data governance, compliance rules, or data sovereignty requirements, Gemini on GDC lets you deploy Google's most capable AI models—like Gemini 2.5 Flash or Pro—directly within your secure infrastructure. Now, there's no longer a trade-off between AI innovation and enterprise control.

Key capabilities unlocked for on-prem deployments include:

Multimodal reasoning across text, images, audio, and video
Automated intelligence for insights, summarization, and analysis
AI-enhanced productivity—from code generation to virtual agents
Embedded safety features, like content filters and policy enforcement

Enterprise-Grade Infrastructure & Security Stack

Google’s solution is more than just AI—we're talking enterprise-ready infrastructure:

High-performance GPU clusters, built on NVIDIA Hopper and Blackwell hardware
Zero-touch managed endpoints, complete with auto-scaling and L7 load balancing
Full audit logs, access control, and Confidential Computing for both CPU (Intel TDX) and GPU

Together, these foundations support secure, compliant, and scalable AI across air-gapped or hybrid environments.

Customer Endorsements — Early Adoption & Trust

Several government and enterprise organizations are already leveraging Gemini on GDC:

GovTech Singapore (CSIT) appreciates the combo of generative AI and compliance controls
HTX (Home Team Science & Technology) credits the deployment framework for bridging their AI roadmap with sovereign data
KDDI (Japan) and Liquid C2 similarly highlight the AI-local, governance-first advantage

Getting Started & What it Enables

Actions you can take today:

Request a strategy session via Google Cloud to plan deployment architecture
Access Gemini 2.5 Flash/Pro endpoints as managed services inside your infrastructure
Build enterprise AI agents over on-prem data with Vertex AI APIs

Use cases include:

Secure document summarization or sentiment analysis on internal or classified datasets
Intelligent chatbots and virtual agents that stay within corporate networks
AI-powered CI/CD workflows—code generation, testing, bug triage—all without calling home

Final Takeaway

With Gemini now available anywhere, Google is giving organizations the power to scale AI ambition without sacrificing security or compliance. This move removes a long-standing blocker for enterprise and public-sector AI adoption. Whether you’re a government agency, regulated financial group, or global manufacturer, deploying AI inside your walls is no longer hypothetical—it’s fully real and ready.

Want help evaluating on-prem AI options or building trusted agentic workflows? I’d love to walk you through the integration path with Vertex AI and GDC.

27.8.25

From Helicopters to Google Brain: What I Learned About AI as a Noob Listening to Andrew Ng

I’ll be honest: I’m still a total beginner when it comes to AI. Most of the time I hear people talk about things like “neural networks,” “transformers,” or “TPUs,” it sounds like another language. But I recently listened to Andrew Ng on the Moonshot Podcast, and it gave me a way to see AI not as something intimidating, but as something that could change everyday life—even for people like me.

Here are the biggest lessons I picked up.

1. AI as a Great Equalizer

One of the first things Andrew said struck me right away: intelligence is expensive. Hiring a doctor, a tutor, or even a consultant costs a lot because human expertise takes years to develop. But AI has the potential to make that kind of intelligence cheap and accessible.

Imagine everyone having their own team of “digital staff”—a tutor for your child, a health advisor, or even a personal coach. Right now, only the wealthy can afford that kind of help. But in the future, AI could democratize it. As someone who’s just trying to figure this whole AI thing out, that idea excites me. AI might not just be about flashy tech—it could really level the playing field.

2. Scale Matters (Even When People Doubt You)

I didn’t realize that when Andrew Ng and others were pushing for bigger and bigger neural networks in the late 2000s, people thought they were wasting their time. Senior researchers told him not to do it, that it was bad for his career.

But Andrew had data showing that the bigger the models, the better they performed. He stuck with it, even when people literally yelled at him at conferences. That persistence eventually led to the creation of Google Brain and a major shift in AI research.

For me, the lesson is clear: sometimes the thing that seems “too simple” or “too obvious” is actually the breakthrough. If the data shows promise, don’t ignore it just because experts frown at it.

3. One Algorithm to Learn Them All

Another mind-blowing takeaway was Andrew’s idea of the “one learning algorithm.” Instead of inventing separate algorithms for vision, speech, and text, maybe there could be one system that learns to handle different types of data.

That sounded crazy back then—but it’s basically what we see today with large models like Gemini or ChatGPT. You give them text, audio, or images, and they adapt. To me, this shows how powerful it is to think in terms of general solutions rather than endless one-off fixes.

4. People Using AI Will Replace People Who Don’t

Andrew made a simple but scary point: AI won’t replace people, but people who use AI will replace people who don’t.

It’s kind of like Google Search. Imagine hiring someone today who doesn’t know how to use it—it just wouldn’t make sense. Soon, knowing how to use AI will be just as basic. That’s a wake-up call for me personally. If I don’t learn to use these tools, I’ll fall behind.

Final Reflection

Listening to Andrew Ng, I realized that AI history isn’t just about algorithms and hardware—it’s about people who dared to think differently and stick to their vision. Even as a noob, I can see that the future of AI isn’t only in giant labs—it’s in how we, ordinary people, learn to use it in our daily lives.

Maybe I won’t be building neural networks anytime soon, but I can start by being curious, experimenting with AI tools, and seeing where that curiosity leads me. If AI really is going to democratize intelligence, then even beginners like me have a place in this story.

DALL·E 3 vs. Nano Banana: Which AI Image Generator Leads the Future of Creativity?

The rapid evolution of AI image generation has brought incredible tools into the hands of creators. Two of the most talked-about models today are DALL·E 3 by OpenAI and Nano Banana, a newly released AI image editor that’s taking the community by storm. Both are reshaping digital art, but they differ in performance, flexibility, and target use cases.

In this blog, we’ll compare DALL·E 3 vs. Nano Banana, highlight their key features, and help you decide which one suits your creative workflow.

DALL·E 3: Context-Aware and Seamlessly Integrated

DALL·E 3 is the latest evolution of OpenAI’s generative art family, deeply integrated into ChatGPT. Its strength lies in contextual understanding—meaning it follows detailed prompts with high accuracy, even when generating complex scenes with multiple characters or objects.

Key Features of DALL·E 3:

Deep integration with ChatGPT for conversational prompt refinement
Ability to generate illustrations with coherent detail
Inpainting support for editing portions of an image
Robust safety filters for responsible use

DALL·E 3 is best for illustrators, marketers, and storytellers who want to generate consistent, context-aware imagery with minimal prompt engineering.

Nano Banana: Precision Editing with Next-Level Control

While DALL·E 3 excels at storytelling, Nano Banana shines in precision editing. First discovered on LM Arena under its code name, this new model has gained traction because of its uncanny ability to handle image editing like never before.

Key Features of Nano Banana:

Add or remove elements within existing images with pixel-level precision
Unmatched character and object consistency across edits
Faster turnaround for design iterations
High-quality outputs suitable for marketing, product design, and concept art

Nano Banana is ideal for graphic designers, product teams, and digital artists who need control and flexibility rather than just prompt-to-image creativity.

Head-to-Head: Which One Wins?

Feature	DALL·E 3	Nano Banana
Strength	Contextual storytelling	Precision editing & object control
Integration	ChatGPT ecosystem	Standalone editor (LM Arena roots)
Best Use Case	Marketing visuals, comics, books	Design workflows, product mockups
Learning Curve	Beginner-friendly	Requires hands-on experimenting

If your goal is to create narrative-rich visuals, DALL·E 3 is the natural choice. But if you need fine-grained image editing and creative flexibility, Nano Banana is the rising star.

The Future of AI Image Generation

Both tools reflect a broader trend in AI-powered creativity—a move from simply generating images to intelligently editing, refining, and contextualizing them. It’s no longer about asking AI to draw something new; it’s about co-creating with AI at every stage of the design process.

For most creators, the real power may lie in using both: DALL·E 3 for initial storytelling and Nano Banana for polishing and refining outputs.

Takeaway:
The debate of DALL·E 3 vs. Nano Banana isn’t about which one replaces the other—it’s about how they complement each other in shaping the future of AI image generation. Together, they point toward a creative ecosystem where AI becomes a true collaborator.

Introducing Gemini 2.5 Flash Image — Fast, Consistent, and Context‑Aware Image Generation from Google

Google has launched Gemini 2.5 Flash Image (codenamed nano‑banana), a powerful update to its image model offering fast generation, precise editing, and content-aware intelligence. The release builds on Gemini’s low-latency image generation, adding rich storytelling, character fidelity, and template reusability. The model is available now via the Gemini API, Google AI Studio, and Vertex AI for developers and enterprises.

Key Features & Capabilities

Character Consistency: Maintain appearance across prompts—ideal for branding, storytelling, and product mockups.
Example: Swap a character’s environment while preserving their look using Google AI Studio templates.
Prompt-Based Image Edits: Perform fine-grained edits using text, like blurring backgrounds, removing objects, changing poses, or applying color to B&W photos—all with a single prompt.
World Knowledge Integration: Understand diagrams, answer questions, and follow complex instructions seamlessly by combining vision with conceptual reasoning.
Multi-Image Fusion: Merge multiple inputs—objects into scenes, room restyling, texture adjustments—using drag-and-drop via Google AI Studio templates.
Vibe‑Coding Experience: Pre-built template apps in AI Studio enable fast prototyping—build image editors by prompts and deploy or export as code.
Invisible SynthID Watermark: All generated or edited images include a non-intrusive watermark for AI provenance.

Where to Try It

Gemini 2.5 Flash Image is offered through:

Gemini API — ready for integration into apps.
Google AI Studio — experiment with visual templates and exportable builds.
Vertex AI — enterprise-grade deployment and scalability.
It’s priced at $30 per 1 million output tokens (~$0.039 per image) and supports input/output pricing consistent with Gemini 2.5 Flash.

Why It Matters

Seamless creative iterations — Designers save time when characters, layouts, and templates stay consistent across edits.
Smart editing with intuition — Natural-language edits reduce the complexity of pixel-level manipulation.
Use-case versatility — From education to real estate mockups, creative marketing, and diagram analysis.
Responsible AI use — Embedded watermarking helps with transparency and traceability.

22.8.25

Chain-of-Agents turns a whole agent swarm into a single end-to-end model

Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a cleaner path: Chain-of-Agents (CoA), where a single model dynamically “plays” multiple roles and tools, simulating agent collaboration end-to-end without external orchestration. The team trains Agent Foundation Models (AFMs) with a two-step recipe: multi-agent distillation (learning from the best existing agent systems) followed by agentic RL on verifiable tasks. Result: a compact, data-trainable alternative to sprawling agent stacks.

How it works

CoA paradigm: the model can activate role-specific and tool-specific “agents” inside its own prompt scaffolding, supporting multi-turn, multi-tool problem solving in one pass.
Multi-agent distillation: successful trajectories from SOTA frameworks (e.g., OAgents) are converted into CoA-compatible traces, then used for supervised tuning so the AFM internalizes collaboration patterns.
Agentic RL: verifiable tasks (search, code, math) provide reward signals that sharpen when to plan, call tools, and switch roles.

The scoreboard

A 32B AFM posts new highs across web and code agents—and strong math gains: 55.3% GAIA, 11.1% BrowseComp, 18.0% HLE, 47.9% LiveCodeBench-v5, 32.7% CodeContests, and 59.8% AIME’25, surpassing recent tool-integrated reasoning baselines like ReTool and SimpleTIR.

Beyond accuracy, CoA slashes runtime waste: the paper reports an 84.6% reduction in inference token cost versus traditional multi-agent frameworks while keeping performance competitive—thanks to fewer round-trips and no inter-agent chatter.

Why it matters

From frameworks to foundations. Distilling orchestration into the model itself turns agent systems into trainable objects, not just prompt graphs.
Generalization & scaling knobs. Analyses show transfer to unseen agents/tools and test-time scaling behaviors (think “try more plans” without changing weights).
Open everything. OPPO releases weights, code, and training data, giving startups a reproducible base to study agentic RL beyond ReAct-style pipelines.

CoA’s pitch is simple: keep the multi-tool, multi-role superpowers—but train them into one model. If the reported GAIA/BrowseComp gains hold up, expect more teams to swap brittle agent graphs for AFMs that plan, act, and coordinate natively.

Paper link: arXiv 2508.13167 (PDF)

ComputerRL scales online RL for “desktop agents,” unifying APIs and GUIs

The next wave of computer-use agents won’t just click around UIs—they’ll mix API calls and GUI interaction in one policy. That’s the bet behind ComputerRL, a new framework that treats desktop work as an end-to-end reinforcement learning problem and introduces an API-GUI paradigm so agents can call services and operate human-oriented interfaces within the same loop.

The missing infrastructure for scale

Training desktop agents with online RL has been hamstrung by slow, brittle environments. ComputerRL ships a distributed RL stack that orchestrates thousands of parallel virtual desktops, making long-horizon, on-policy training runs practical for general computer use.

Stabilizing long runs: Entropulse

Pure RL on complex desktops tends to collapse exploration entropy over time. The authors propose Entropulse, a simple but effective schedule that alternates RL with supervised fine-tuning, restoring healthy entropy while retaining the gains from policy improvement.

Results & models

Using open backbones (GLM-4-9B-0414 and Qwen2.5-14B), the team evaluates on OSWorld and reports 48.1% accuracy with AutoGLM-OS-9B, a new state of the art for general desktop automation in their setup. The framework underpins the group’s AutoGLM system.

Why it matters

Bridging the modality gap: Real workflows mix API calls with UI operations; ComputerRL trains a single policy to do both.
Throughput for RL: Parallelized desktops unlock the scale online RL has needed for computer agents.
Simple stability trick: Entropulse offers a practical recipe any lab can try to keep long runs from collapsing.

If your roadmap includes agents that file expenses, reconcile sheets, or run web apps end-to-end, ComputerRL reads like a blueprint for turning brittle demos into trainable, scalable systems.

Paper link: arXiv 2508.14040 (PDF)

20.8.25

DINOv3: Meta’s Self-Supervised Vision Backbone Built to Scale—and Transfer

Meta has unveiled DINOv3, the latest in its family of self-supervised vision models aimed at learning from raw images—no labels required—and transferring those features cleanly across tasks. The release pairs a readable training recipe with open implementations and model suites, positioning DINOv3 as a practical foundation for detection, segmentation, retrieval, and zero-shot classification in real products.

What’s new in DINOv3

Scale without supervision. The core idea remains simple: pretrain on massive, diverse image data using self-distillation and augmentation, then reuse the frozen backbone downstream. DINOv3 pushes this further with careful data prep, optimization, and—crucially—two new strategies to keep features robust at large scale.

1) Gram anchoring for dense features. Long training runs can erode fine local details that dense tasks (e.g., segmentation, depth) depend on. DINOv3 introduces gram anchoring, a constraint that preserves local feature structure so dense predictions stay sharp even as the backbone learns global invariances. This noticeably lifts dense-task scores relative to prior SSL baselines.

2) Post-hoc high-resolution adaptation. After pretraining, DINOv3 applies a light-touch adaptation to handle higher input resolutions and different model sizes without retraining from scratch—useful when you need 1024-px inputs for instance or semantic segmentation.

3) Optional text alignment. For open-vocabulary or zero-shot use, DINOv3 supports a compact text-alignment step, enabling image-text matching and classification without full supervised fine-tuning of the vision backbone.

Why it matters

DINOv3 is pitched as a universal vision backbone: a single, frozen model that outperforms specialized systems across a broad set of benchmarks—often without task-specific fine-tuning—by producing high-quality dense and global features alike. For teams, this means fewer bespoke models to train and a clearer path from pretraining to deployment.

What you can build today

Object detection & instance/semantic segmentation. Drop DINOv3 into your detector or segmentor head to improve transfer, especially at higher resolutions.
Zero-shot and open-vocabulary classification. Pair the frozen backbone with the text alignment step to classify new categories without labels.
Image retrieval and similarity search. Use embeddings from the backbone for robust retrieval in e-commerce, media, or industrial archives.

Developer on-ramp

Meta has released a reference PyTorch implementation with pretrained checkpoints, scripts, and configs, along with a public paper and model cards. If you’re migrating from DINO/DINOv2, the training and evaluation stacks are familiar; adding gram anchoring and the post-hoc adapters is straightforward.

Blog & overview: how the method scales and where it shines.
Paper (arXiv): full method, ablations, and benchmark details.
Code & weights (GitHub): ready-to-run training/eval pipelines.
Model hub page: consolidated resources and model suite.

Practical tips

Choose resolution by task. Start with the default pretraining size; enable the high-res adapter for dense tasks that benefit from finer detail.
Freeze first, tune later. Many gains show up with a frozen backbone and light heads; reserve end-to-end tuning for domain shifts that remain stubborn.
Mind augmentation & data mix. DINOv3’s results rely on carefully designed augmentations and large, diverse pretraining data—replicate that discipline in your own pipelines.

The takeaway

DINOv3 turns self-supervised pretraining into a dependable, production-minded recipe for vision. With gram anchoring to protect dense signals, post-hoc adaptation for resolution and scale, and optional text alignment for zero-shot scenarios, it offers one backbone you can reuse across many tasks—supported by open code and clear documentation. For teams balancing accuracy, versatility, and engineering simplicity, DINOv3 is a strong default choice for 2025-era computer vision.

19.8.25

AutoCodeBench turns LLMs into benchmark factories — and today’s coders sweat at ~52%

Code benchmarks have a scaling problem: hand-written tasks don’t keep up with fast-improving models, and multilingual coverage is thin. Tencent Hunyuan’s new paper proposes a fix: AutoCodeGen, an automated workflow that inverts dataset creation—generate solutions and tests first, then ask the LLM to write the problem—validated by a multilingual execution sandbox. The result is AutoCodeBench (ACB), a 3,920-problem suite evenly spread over 20 languages, with ~9.6 tests per problem and a deliberate bias toward hard tasks. Even frontier “think-mode” models top out around ~52% Pass@1, signaling real headroom.

How they build hard, correct problems

AutoCodeGen runs in four steps: (1) LLMs evolve self-contained code solutions from real multilingual snippets; (2) LLMs propose public and private test inputs, which the sandbox executes to compute ground-truth outputs; (3) the LLM then writes the problem description constrained by strict specs (language, entry points, naming); (4) a three-stage filter (multi-sampling for difficulty, LLM-as-critic for quality, diversity tagging) trims the set. This “reverse-order” recipe yields correct, executable tests without humans in the loop.

What’s inside ACB

Scale & spread: 3,920 problems, 37,777 tests, 20 languages (Python→TypeScript), 14 task categories from data structures to systems programming. >60% are “hard.”
Sandbox: open-sourced, 20+ languages, high-concurrency, request-based calls—usable for eval and data synthesis.
Lite & Complete: ACB-Lite (≈1,586 problems) for faster evals; ACB-Complete (1,000 completion-style tasks, 3-shot) targets base models rather than chat-tuned ones.

The scoreboard: even elites struggle

Averaged across 20 languages, the leaderboard’s top tier lands ~50–52% Pass@1, led by Claude Opus 4 (Think) at 52.4%, with o3-high, Grok-4, Claude Sonnet 4 (Think), and DeepSeek-R1-0528 clustered close behind. Mid-tier open models sit in the 30s–40s; smaller coders drop to the 20s. Translation: the multilingual + multi-logical mix is punishing.

Iterating with sandbox feedback helps

Across three refinement turns using execution error messages, models like DeepSeek-V3-0324 and Qwen2.5-Coder-32B-Instruct gain ~8–12 points, with the biggest jump on turn one—evidence that automated error signals materially improve code generation.

Base-model check: ACB-Complete

On the 1,000-item, 3-shot ACB-Complete, Seed-Coder-8B-Base leads its class (≤8B) at 31.6% Pass@1, edging OpenCoder-8B-Base and Qwen2.5-Coder-7B—useful signal for pre-instruct comparisons that classic HumanEval/MBPP miss.

Why it matters

Human-free, multilingual, hard. ACB shows you can scale quality and coverage without armies of annotators.
Better evals for code agents. Emphasis on multi-logical tasks (several core functions per problem) aligns with agent workflows like SWE-Bench.
Sandbox as a lever. Open, concurrent execution infra doubles as a training-data factory and an iterative-repair oracle.

Benchmarks drive progress. If your coding model cruises through Python puzzles but face-plants in Kotlin, Shell, or Elixir, AutoCodeBench will make that obvious—and give you a reproducible path to fix it.

Paper link: arXiv 2508.09101 (PDF)

16.8.25

GPT-5 tops multimodal medical QA—and even edges human experts on a new benchmark

If you’ve wondered whether general-purpose LLMs can truly reason across medical text and images, a new study out of Emory University says GPT-5 can—and then some. In “Capabilities of GPT-5 on Multimodal Medical Reasoning,” the team treats GPT-5 as a generalist decision-support engine and runs it through a unified, zero-shot chain-of-thought (CoT) protocol spanning text-only and vision-augmented tasks. The short version: GPT-5 outperforms GPT-4o across the board and surpasses pre-licensed human experts on the toughest multimodal benchmark they tested.

A cleaner test: one prompting recipe, many tasks

Prior medical LLM papers often mix datasets and prompting tricks, muddying comparisons. Here, the authors standardize splits and use the same two-turn CoT prompt for every dataset—first elicit reasoning, then force a single-letter answer—so differences reflect the model, not prompt engineering. Visual items attach image URLs in the first turn; the convergence step stays textual.

The numbers

Text QA: On MedQA (US, 4-option), GPT-5 hits 95.84%—a +4.80% absolute gain over GPT-4o. MMLU medical subsets also tick up, including a perfect score in Medical Genetics.
USMLE samples: Averaged across Steps 1–3, GPT-5 reaches 95.22% (+2.88 vs. GPT-4o), with the biggest lift on Step 2’s management-heavy items.
Multimodal QA: On MedXpertQA-MM, GPT-5’s reasoning and understanding jump +29.26% and +26.18% over GPT-4o. A case study shows the model integrating CT findings, labs and symptoms to recommend a Gastrografin swallow for suspected esophageal perforation.
Radiology VQA: On VQA-RAD, GPT-5 posts 70.92%—slightly below GPT-5-mini (74.90%), which the authors attribute to small-set quirks and calibration.

Above pre-licensed human experts—at least on MedXpertQA

Compared against pre-licensed clinicians, GPT-5 clears the bar decisively on MedXpertQA: +15.22% (text reasoning), +9.40% (text understanding), +24.23% (multimodal reasoning), +29.40% (multimodal understanding). GPT-4o, by contrast, trails humans on most of these dimensions.

Why it matters

From recall to reasoning. Gains concentrate on reasoning-intensive tasks (MedXpertQA, USMLE Step 2), suggesting internal upgrades beyond raw fact lookup.
Designing safer tools. The same unified protocol that boosts accuracy also produces structured rationales—useful for audit trails in clinical decision support.
Open evals. The authors say they’ve made code public (GPT-5-Evaluation), inviting replication and deeper probing of failure modes.

Mind the caveats

This is still benchmark-world: standardized items, time-limited settings, and no messy clinic realities. The paper itself cautions that real deployments will need calibration, domain-adapted fine-tuning and prospective trials.

If those steps pan out, GPT-5 looks less like a better test-taker and more like a multimodal reasoner—one that can fuse text and images to recommend plausible next actions.

Paper link: arXiv 2508.08224 (PDF)

GPT-5 nails ophthalmology board questions—and shows how to buy accuracy wisely

OpenAI’s newest reasoning line just aced a specialty test. In a cross-sectional benchmark of 260 closed-access AAO BCSC multiple-choice questions, GPT-5-high scored 96.5%—beating GPT-4o and OpenAI’s earlier o1, and statistically edging most GPT-5 variants, while tying o3-high within confidence intervals. Beyond raw accuracy, the paper grades rationale quality and runs a cost-accuracy analysis, surfacing Pareto-efficient configs for budget-sensitive deployments.

What they tested—and how

Researchers evaluated 12 GPT-5 configurations (three model sizes × four reasoning_effort settings) alongside o1-high, o3-high, and GPT-4o. Prompts enforced strict JSON with a single letter answer + one-sentence rationale, zero-shot. A Bradley-Terry arena ranked head-to-head wins; an LLM-as-a-judge autograder compared rationales to reference explanations.

Key results

Top score: GPT-5-high 0.965 accuracy (95% CI 0.942–0.985); > GPT-4o and o1-high; comparable to o3-high (0.958).
Rationale quality: GPT-5-high ranked #1 in pairwise judging.
Cost–accuracy frontier: Multiple efficient picks identified; GPT-5-mini-low emerges as the best low-cost, high-performance option.
Reasoning effort matters: Minimal-effort variants underperform; higher effort boosts accuracy but costs more tokens/time.

Why it matters

Hospitals and ed-tech teams rarely buy “max accuracy at any price.” This paper provides a menu of GPT-5 settings that trade pennies for percentage points, plus an autograder recipe others can adapt to scale specialty QA beyond ophthalmology. arXiv

Paper link: arXiv 2508.09956 (PDF)

“Speed Always Wins” is the field guide to building faster, cheaper LLMs

Transformers scaled LLMs to jaw-dropping capabilities—but quadratic attention and ballooning KV caches are throttling real-world deployment. A new survey from Shanghai AI Lab, HKUST(GZ) and collaborators takes stock of what’s next, categorizing the ecosystem of efficient LLM architectures and where each shines. Think of it as a build sheet for teams trying to cut latency and cost without giving up quality.

The efficiency playbook, in seven parts

Linear sequence modeling: from linearized attention to linear RNNs and state-space models that drop KV cache and push complexity toward O(N).
Sparse sequence modeling: static, dynamic, and training-free sparsity to compute only the most useful token-token interactions.
Efficient full attention: keep softmax attention but make it practical with IO-aware, grouped, mixture, and quantized attention variants.
Sparse Mixture-of-Experts: routing, expert designs and MoE conversion to grow capacity without proportional FLOPs.
Hybrid architectures: inter-layer and intra-layer mixes that blend linear blocks with full attention for a better speed/quality trade-off.
Diffusion LLMs: non-autoregressive generation, bridges back to AR, and early steps to extend diffusion approaches to multimodality.
Beyond text: how these efficiency ideas transfer to vision, audio, and multimodal stacks.

Why this matters now

Long-context patterns—RAG, agentic tool use, deliberate reasoning, and multimodal inputs—are pushing sequence lengths and memory pressure through the roof. The survey frames these usage patterns and argues that architectural efficiency, not just better prompts or hardware, is the lever that scales the next wave of applications.

A roadmap, not just a reading list

Beyond taxonomy, the paper stitches trends into a blueprint: pick linear/sparse methods to kill KV bloat, use efficient-full-attention where fidelity matters, layer in MoE for capacity, and consider hybrids or diffusion LLMs where generation style allows. There’s also a companion GitHub “Awesome-Efficient-Arch” list to track the space as it moves.

If you’re building agents that browse, reason and call tools all day—or multimodal systems juggling video and audio—this survey is a timely map of the fastest lanes through today’s LLM bottlenecks.

Paper link: arXiv 2508.09834 (PDF)