Wandering Nomad

28.8.25

Anemoi: a semi-centralized agent system that lets bots talk to each other—literally

Most generalist multi-agent stacks still look like a relay race: a central planner prompts specialist workers, who pass back long context blobs for the planner to stitch together. It works—until you downsize the planner or hit token limits. Anemoi proposes a different wiring: keep a light planner, but let agents communicate directly over an Agent-to-Agent (A2A) MCP server so everyone can see progress, flag bottlenecks, and propose fixes in real time.

What’s actually new

Anemoi replaces unidirectional prompt passing with a threaded A2A server (built on the Model Context Protocol) that exposes primitives like list_agents, create_thread, send_message, and wait_for_mentions. Any agent can join a thread, address peers, and update plans mid-flight—reducing redundant context stuffing and information loss.

The cast of agents (and why it matters)

Planner: drafts the initial plan and spins up a thread.
Critique: continuously audits intermediate results.
Answer-Finder: compiles the final submission.
Workers: Web, Document Processing, and Reasoning & Coding—mirroring OWL’s tool set for a fair head-to-head. All are MCP-enabled so they can monitor progress and coordinate directly.

This design reduces reliance on one overpowered planner, supports adaptive plan updates, and cuts token overhead from repeated context injection.

Numbers that move the needle (GAIA validation)

Framework	Planner / Workers	Avg. Acc.
OWL-rep (pass@3)	GPT-4.1-mini / GPT-4o	43.64%
OWL (paper, pass@3)	GPT-4o-mini / GPT-4o	47.27%
Anemoi (pass@3)	GPT-4.1-mini / GPT-4o	52.73%

With a small planner (GPT-4.1-mini), Anemoi tops a strong open-source baseline by +9.09 points under identical tools and models—and is competitive with several proprietary systems that rely on larger planners.

How the A2A workflow runs

Discover agents → 2) Create thread with participants → 3) Workers execute subtasks; Critique labels outputs accept/uncertain while any agent can contribute revisions → 4) Consensus vote before finalization → 5) Answer-Finder submits. All via MCP messaging in a single conversation context.

Where it wins—and where it trips

Wins: Of the tasks Anemoi solved that OWL missed, 52% were due to collaborative refinement enabled by A2A; another 8% came from less context redundancy.
Failures: Remaining errors skew to LLM/tool limits (≈46%/21%), incorrect plans (≈12%), and some communication latency (≈10%)—notably when the web agent is busy and can’t respond to peers.

Why this matters

If your agent system juggles web search, file I/O, and coding, direct inter-agent communication can deliver better results without upgrading to an expensive planner. Anemoi shows a practical blueprint: keep the planner lightweight, move coordination into an A2A layer, and let specialists negotiate in-thread instead of bloating prompts.

Paper link: arXiv 2508.17068 (PDF)

Vision-SR1: a self-rewarding recipe that makes VLMs “see” before they “think”

Most reinforcement-learning recipes for vision-language models (VLMs) grade only the final answer—so models learn to lean on text priors and hallucinate what isn’t in the image. Vision-SR1 flips that: it decomposes reasoning into visual perception → language reasoning, and rewards the model for producing a self-contained visual description that alone suffices to solve the task. No external teacher, no human labels—just the model validating its own perception.

How the self-reward works

Vision-SR1 runs two rollouts of the same policy per example:

Standard pass: image + question → visual perception + CoT + answer → reward on answer (and format).
Self-reward pass: question + the model’s visual perception (no image) → CoT + answer → reward if correct, signalling that the perception captured what mattered. Rewards are combined under GRPO for stable updates.

Training setup

The team builds a 47K-example RL set spanning math (30.5%), science/commonsense (30%), and general visual reasoning (39.5%). A 9K “cold-start” SFT subset teaches the output format before the short RL run (1 epoch). Backbones: Qwen-2.5-VL-3B and 7B. Code is public on GitHub.

Benchmarks: fewer shortcuts, fewer hallucinations

On a broad suite—MMMU, MMMU-Pro, MM-Vet, RealWorldQA, VisNumBench, MathVerse, MATH-Vision, HallusionBench—Vision-SR1 consistently edges strong Vision-R1 baselines trained on the same 47K. With the 7B backbone, Vision-SR1 averages 58.8 vs 57.4 for Vision-R1; at 3B it’s 52.9 vs 50.6.

The paper also introduces Language Shortcut Rate (LSR)—how often a model answers correctly with an insufficient perception. SR1 lowers LSR across datasets, indicating less “answering from priors.”

Not just vision: text-only reasoning stays solid

On textual suites (MMLU-Pro, SuperGPQA, GSM8K, MATH-500), SR1 keeps or improves accuracy relative to Vision-R1—evidence that strengthening perception doesn’t degrade language-side reasoning.

Why it matters

Balances see vs. think. Adding a perception reward raises dependence on pixels, not just prompts—curbing hallucinations without expensive human labels or external teachers.
Simple to adopt. The “see-think-answer” format and two-pass self-reward bolt onto standard GRPO pipelines.
Open and reproducible. Data recipe, SFT cold-start, and code are released for quick replication.

Paper link: arXiv 2508.19652 (PDF)

HunyuanVideo-Foley brings studio-grade Foley to AI-generated video

Text-to-video has gone cinematic, but most clips still sound like a silent movie. HunyuanVideo-Foley aims to fix that: it’s an end-to-end text-video-to-audio (TV2A) system that generates synchronized, high-quality Foley from pixels and prompts—no sound library or manual sound design required. The team marries a multimodal diffusion transformer with representation alignment and a large, purpose-built dataset, and reports state-of-the-art results on fidelity and sync.

What’s new

100k-hour TV2A dataset. A scalable pipeline filters web video into 8-second segments, drops silent/low-bandwidth clips, scores audio aesthetics/SNR, and checks both semantic (ImageBind) and temporal (AV-align) match before tagging and captioning with GenAU.
Dual-phase multimodal attention. Video and audio are fused with joint self-attention (interleaved RoPE) for frame-level sync; text cues are injected later via cross-attention to avoid text dominating the mix.
REPA loss for audio. A Representation Alignment (REPA) objective pulls internal DiT features toward self-supervised audio embeddings (ATST-Frame), stabilizing training and improving timbre/semantics.
Continuous-latent DAC-VAE. Replaces RVQ with a VAE (128-dim latents @48 kHz, 50 Hz latent rate) for cleaner reconstructions and fewer artifacts.

How it’s built

HunyuanVideo-Foley stacks N₁ multimodal (audio-video) DiT blocks followed by N₂ audio-only blocks, modulated by Synchformer-derived sync features. The model used 18 MMDiT + 36 audio DiT layers (1536 hidden, 12 heads) and was trained 200k steps on the 100k-hour corpus; autoencoder pretraining ran 700k steps. The main run used 128 H20 GPUs with an effective batch size of 2048.

The receipts

Across three testbeds—Kling-Audio-Eval, VGGSound-Test, and MovieGen-Audio-Bench—the paper reports new SOTA on multiple axes, including audio quality (PQ), visual-semantic alignment (IB) and temporal sync (DeSync), plus higher human MOS scores on MovieGen-Audio-Bench. A sample from Kling-Audio-Eval: the model improves FD (PANNs) and KL vs. prior systems and lifts IB while keeping DeSync low.

Example objective results (Kling-Audio-Eval)

Metric	Best prior (sample)	HunyuanVideo-Foley
FD (PANNs) ↓	9.01 (MMAudio)	6.07
PQ ↑	6.05 (FoleyCrafter)	6.12
IB ↑	0.30 (MMAudio)	0.38
DeSync ↓	0.56 (MMAudio)	0.54

Why it matters

Sound that matches the shot. By separating frame-sync (video↔audio) from semantic guidance (text↔audio), the model avoids the classic failure where captions drown out visual cues.
Production-friendly fidelity. REPA and the continuous-latent DAC-VAE cut hiss, mushy transients, and texture mismatch—key for believable footsteps, doors, and crowd beds.
Built to scale. A reproducible data pipeline and a demo page suggest this is more than a lab toy; it’s an audio stack teams can evaluate today.

If generative video is to replace B-roll and animatics, it needs audio that lands. HunyuanVideo-Foley offers a blueprint: curate better multimodal data, align internal representations to robust audio features, and architect attention so text helps—without hijacking—the soundscape.

Paper link: arXiv 2508.16930 (PDF)

Gemini Now Runs Anywhere: Deploy Google’s AI Models on Your On‑Premises Infrastructure with Full Confidence

Google has taken a major step in enterprise AI by announcing that Gemini is now available anywhere—including your on-premises data centers via Google Distributed Cloud (GDC). After months of previews, Gemini on GDC is now generally available (GA) for air-gapped environments, with an ongoing preview for connected deployments.

Why This Matters — AI, Sovereignty, No Compromise

For organizations operating under stringent data governance, compliance rules, or data sovereignty requirements, Gemini on GDC lets you deploy Google's most capable AI models—like Gemini 2.5 Flash or Pro—directly within your secure infrastructure. Now, there's no longer a trade-off between AI innovation and enterprise control.

Key capabilities unlocked for on-prem deployments include:

Multimodal reasoning across text, images, audio, and video
Automated intelligence for insights, summarization, and analysis
AI-enhanced productivity—from code generation to virtual agents
Embedded safety features, like content filters and policy enforcement

Enterprise-Grade Infrastructure & Security Stack

Google’s solution is more than just AI—we're talking enterprise-ready infrastructure:

High-performance GPU clusters, built on NVIDIA Hopper and Blackwell hardware
Zero-touch managed endpoints, complete with auto-scaling and L7 load balancing
Full audit logs, access control, and Confidential Computing for both CPU (Intel TDX) and GPU

Together, these foundations support secure, compliant, and scalable AI across air-gapped or hybrid environments.

Customer Endorsements — Early Adoption & Trust

Several government and enterprise organizations are already leveraging Gemini on GDC:

GovTech Singapore (CSIT) appreciates the combo of generative AI and compliance controls
HTX (Home Team Science & Technology) credits the deployment framework for bridging their AI roadmap with sovereign data
KDDI (Japan) and Liquid C2 similarly highlight the AI-local, governance-first advantage

Getting Started & What it Enables

Actions you can take today:

Request a strategy session via Google Cloud to plan deployment architecture
Access Gemini 2.5 Flash/Pro endpoints as managed services inside your infrastructure
Build enterprise AI agents over on-prem data with Vertex AI APIs

Use cases include:

Secure document summarization or sentiment analysis on internal or classified datasets
Intelligent chatbots and virtual agents that stay within corporate networks
AI-powered CI/CD workflows—code generation, testing, bug triage—all without calling home

Final Takeaway

With Gemini now available anywhere, Google is giving organizations the power to scale AI ambition without sacrificing security or compliance. This move removes a long-standing blocker for enterprise and public-sector AI adoption. Whether you’re a government agency, regulated financial group, or global manufacturer, deploying AI inside your walls is no longer hypothetical—it’s fully real and ready.

Want help evaluating on-prem AI options or building trusted agentic workflows? I’d love to walk you through the integration path with Vertex AI and GDC.

27.8.25

From Helicopters to Google Brain: What I Learned About AI as a Noob Listening to Andrew Ng

I’ll be honest: I’m still a total beginner when it comes to AI. Most of the time I hear people talk about things like “neural networks,” “transformers,” or “TPUs,” it sounds like another language. But I recently listened to Andrew Ng on the Moonshot Podcast, and it gave me a way to see AI not as something intimidating, but as something that could change everyday life—even for people like me.

Here are the biggest lessons I picked up.

1. AI as a Great Equalizer

One of the first things Andrew said struck me right away: intelligence is expensive. Hiring a doctor, a tutor, or even a consultant costs a lot because human expertise takes years to develop. But AI has the potential to make that kind of intelligence cheap and accessible.

Imagine everyone having their own team of “digital staff”—a tutor for your child, a health advisor, or even a personal coach. Right now, only the wealthy can afford that kind of help. But in the future, AI could democratize it. As someone who’s just trying to figure this whole AI thing out, that idea excites me. AI might not just be about flashy tech—it could really level the playing field.

2. Scale Matters (Even When People Doubt You)

I didn’t realize that when Andrew Ng and others were pushing for bigger and bigger neural networks in the late 2000s, people thought they were wasting their time. Senior researchers told him not to do it, that it was bad for his career.

But Andrew had data showing that the bigger the models, the better they performed. He stuck with it, even when people literally yelled at him at conferences. That persistence eventually led to the creation of Google Brain and a major shift in AI research.

For me, the lesson is clear: sometimes the thing that seems “too simple” or “too obvious” is actually the breakthrough. If the data shows promise, don’t ignore it just because experts frown at it.

3. One Algorithm to Learn Them All

Another mind-blowing takeaway was Andrew’s idea of the “one learning algorithm.” Instead of inventing separate algorithms for vision, speech, and text, maybe there could be one system that learns to handle different types of data.

That sounded crazy back then—but it’s basically what we see today with large models like Gemini or ChatGPT. You give them text, audio, or images, and they adapt. To me, this shows how powerful it is to think in terms of general solutions rather than endless one-off fixes.

4. People Using AI Will Replace People Who Don’t

Andrew made a simple but scary point: AI won’t replace people, but people who use AI will replace people who don’t.

It’s kind of like Google Search. Imagine hiring someone today who doesn’t know how to use it—it just wouldn’t make sense. Soon, knowing how to use AI will be just as basic. That’s a wake-up call for me personally. If I don’t learn to use these tools, I’ll fall behind.

Final Reflection

Listening to Andrew Ng, I realized that AI history isn’t just about algorithms and hardware—it’s about people who dared to think differently and stick to their vision. Even as a noob, I can see that the future of AI isn’t only in giant labs—it’s in how we, ordinary people, learn to use it in our daily lives.

Maybe I won’t be building neural networks anytime soon, but I can start by being curious, experimenting with AI tools, and seeing where that curiosity leads me. If AI really is going to democratize intelligence, then even beginners like me have a place in this story.

DALL·E 3 vs. Nano Banana: Which AI Image Generator Leads the Future of Creativity?

The rapid evolution of AI image generation has brought incredible tools into the hands of creators. Two of the most talked-about models today are DALL·E 3 by OpenAI and Nano Banana, a newly released AI image editor that’s taking the community by storm. Both are reshaping digital art, but they differ in performance, flexibility, and target use cases.

In this blog, we’ll compare DALL·E 3 vs. Nano Banana, highlight their key features, and help you decide which one suits your creative workflow.

DALL·E 3: Context-Aware and Seamlessly Integrated

DALL·E 3 is the latest evolution of OpenAI’s generative art family, deeply integrated into ChatGPT. Its strength lies in contextual understanding—meaning it follows detailed prompts with high accuracy, even when generating complex scenes with multiple characters or objects.

Key Features of DALL·E 3:

Deep integration with ChatGPT for conversational prompt refinement
Ability to generate illustrations with coherent detail
Inpainting support for editing portions of an image
Robust safety filters for responsible use

DALL·E 3 is best for illustrators, marketers, and storytellers who want to generate consistent, context-aware imagery with minimal prompt engineering.

Nano Banana: Precision Editing with Next-Level Control

While DALL·E 3 excels at storytelling, Nano Banana shines in precision editing. First discovered on LM Arena under its code name, this new model has gained traction because of its uncanny ability to handle image editing like never before.

Key Features of Nano Banana:

Add or remove elements within existing images with pixel-level precision
Unmatched character and object consistency across edits
Faster turnaround for design iterations
High-quality outputs suitable for marketing, product design, and concept art

Nano Banana is ideal for graphic designers, product teams, and digital artists who need control and flexibility rather than just prompt-to-image creativity.

Head-to-Head: Which One Wins?

Feature	DALL·E 3	Nano Banana
Strength	Contextual storytelling	Precision editing & object control
Integration	ChatGPT ecosystem	Standalone editor (LM Arena roots)
Best Use Case	Marketing visuals, comics, books	Design workflows, product mockups
Learning Curve	Beginner-friendly	Requires hands-on experimenting

If your goal is to create narrative-rich visuals, DALL·E 3 is the natural choice. But if you need fine-grained image editing and creative flexibility, Nano Banana is the rising star.

The Future of AI Image Generation

Both tools reflect a broader trend in AI-powered creativity—a move from simply generating images to intelligently editing, refining, and contextualizing them. It’s no longer about asking AI to draw something new; it’s about co-creating with AI at every stage of the design process.

For most creators, the real power may lie in using both: DALL·E 3 for initial storytelling and Nano Banana for polishing and refining outputs.

Takeaway:
The debate of DALL·E 3 vs. Nano Banana isn’t about which one replaces the other—it’s about how they complement each other in shaping the future of AI image generation. Together, they point toward a creative ecosystem where AI becomes a true collaborator.

Introducing Gemini 2.5 Flash Image — Fast, Consistent, and Context‑Aware Image Generation from Google

Google has launched Gemini 2.5 Flash Image (codenamed nano‑banana), a powerful update to its image model offering fast generation, precise editing, and content-aware intelligence. The release builds on Gemini’s low-latency image generation, adding rich storytelling, character fidelity, and template reusability. The model is available now via the Gemini API, Google AI Studio, and Vertex AI for developers and enterprises.

Key Features & Capabilities

Character Consistency: Maintain appearance across prompts—ideal for branding, storytelling, and product mockups.
Example: Swap a character’s environment while preserving their look using Google AI Studio templates.
Prompt-Based Image Edits: Perform fine-grained edits using text, like blurring backgrounds, removing objects, changing poses, or applying color to B&W photos—all with a single prompt.
World Knowledge Integration: Understand diagrams, answer questions, and follow complex instructions seamlessly by combining vision with conceptual reasoning.
Multi-Image Fusion: Merge multiple inputs—objects into scenes, room restyling, texture adjustments—using drag-and-drop via Google AI Studio templates.
Vibe‑Coding Experience: Pre-built template apps in AI Studio enable fast prototyping—build image editors by prompts and deploy or export as code.
Invisible SynthID Watermark: All generated or edited images include a non-intrusive watermark for AI provenance.

Where to Try It

Gemini 2.5 Flash Image is offered through:

Gemini API — ready for integration into apps.
Google AI Studio — experiment with visual templates and exportable builds.
Vertex AI — enterprise-grade deployment and scalability.
It’s priced at $30 per 1 million output tokens (~$0.039 per image) and supports input/output pricing consistent with Gemini 2.5 Flash.

Why It Matters

Seamless creative iterations — Designers save time when characters, layouts, and templates stay consistent across edits.
Smart editing with intuition — Natural-language edits reduce the complexity of pixel-level manipulation.
Use-case versatility — From education to real estate mockups, creative marketing, and diagram analysis.
Responsible AI use — Embedded watermarking helps with transparency and traceability.

22.8.25

Chain-of-Agents turns a whole agent swarm into a single end-to-end model

Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a cleaner path: Chain-of-Agents (CoA), where a single model dynamically “plays” multiple roles and tools, simulating agent collaboration end-to-end without external orchestration. The team trains Agent Foundation Models (AFMs) with a two-step recipe: multi-agent distillation (learning from the best existing agent systems) followed by agentic RL on verifiable tasks. Result: a compact, data-trainable alternative to sprawling agent stacks.

How it works

CoA paradigm: the model can activate role-specific and tool-specific “agents” inside its own prompt scaffolding, supporting multi-turn, multi-tool problem solving in one pass.
Multi-agent distillation: successful trajectories from SOTA frameworks (e.g., OAgents) are converted into CoA-compatible traces, then used for supervised tuning so the AFM internalizes collaboration patterns.
Agentic RL: verifiable tasks (search, code, math) provide reward signals that sharpen when to plan, call tools, and switch roles.

The scoreboard

A 32B AFM posts new highs across web and code agents—and strong math gains: 55.3% GAIA, 11.1% BrowseComp, 18.0% HLE, 47.9% LiveCodeBench-v5, 32.7% CodeContests, and 59.8% AIME’25, surpassing recent tool-integrated reasoning baselines like ReTool and SimpleTIR.

Beyond accuracy, CoA slashes runtime waste: the paper reports an 84.6% reduction in inference token cost versus traditional multi-agent frameworks while keeping performance competitive—thanks to fewer round-trips and no inter-agent chatter.

Why it matters

From frameworks to foundations. Distilling orchestration into the model itself turns agent systems into trainable objects, not just prompt graphs.
Generalization & scaling knobs. Analyses show transfer to unseen agents/tools and test-time scaling behaviors (think “try more plans” without changing weights).
Open everything. OPPO releases weights, code, and training data, giving startups a reproducible base to study agentic RL beyond ReAct-style pipelines.

CoA’s pitch is simple: keep the multi-tool, multi-role superpowers—but train them into one model. If the reported GAIA/BrowseComp gains hold up, expect more teams to swap brittle agent graphs for AFMs that plan, act, and coordinate natively.

Paper link: arXiv 2508.13167 (PDF)

ComputerRL scales online RL for “desktop agents,” unifying APIs and GUIs

The next wave of computer-use agents won’t just click around UIs—they’ll mix API calls and GUI interaction in one policy. That’s the bet behind ComputerRL, a new framework that treats desktop work as an end-to-end reinforcement learning problem and introduces an API-GUI paradigm so agents can call services and operate human-oriented interfaces within the same loop.

The missing infrastructure for scale

Training desktop agents with online RL has been hamstrung by slow, brittle environments. ComputerRL ships a distributed RL stack that orchestrates thousands of parallel virtual desktops, making long-horizon, on-policy training runs practical for general computer use.

Stabilizing long runs: Entropulse

Pure RL on complex desktops tends to collapse exploration entropy over time. The authors propose Entropulse, a simple but effective schedule that alternates RL with supervised fine-tuning, restoring healthy entropy while retaining the gains from policy improvement.

Results & models

Using open backbones (GLM-4-9B-0414 and Qwen2.5-14B), the team evaluates on OSWorld and reports 48.1% accuracy with AutoGLM-OS-9B, a new state of the art for general desktop automation in their setup. The framework underpins the group’s AutoGLM system.

Why it matters

Bridging the modality gap: Real workflows mix API calls with UI operations; ComputerRL trains a single policy to do both.
Throughput for RL: Parallelized desktops unlock the scale online RL has needed for computer agents.
Simple stability trick: Entropulse offers a practical recipe any lab can try to keep long runs from collapsing.

If your roadmap includes agents that file expenses, reconcile sheets, or run web apps end-to-end, ComputerRL reads like a blueprint for turning brittle demos into trainable, scalable systems.

Paper link: arXiv 2508.14040 (PDF)