Wandering Nomad

27.8.25

Introducing Gemini 2.5 Flash Image — Fast, Consistent, and Context‑Aware Image Generation from Google

Google has launched Gemini 2.5 Flash Image (codenamed nano‑banana), a powerful update to its image model offering fast generation, precise editing, and content-aware intelligence. The release builds on Gemini’s low-latency image generation, adding rich storytelling, character fidelity, and template reusability. The model is available now via the Gemini API, Google AI Studio, and Vertex AI for developers and enterprises.

Key Features & Capabilities

Character Consistency: Maintain appearance across prompts—ideal for branding, storytelling, and product mockups.
Example: Swap a character’s environment while preserving their look using Google AI Studio templates.
Prompt-Based Image Edits: Perform fine-grained edits using text, like blurring backgrounds, removing objects, changing poses, or applying color to B&W photos—all with a single prompt.
World Knowledge Integration: Understand diagrams, answer questions, and follow complex instructions seamlessly by combining vision with conceptual reasoning.
Multi-Image Fusion: Merge multiple inputs—objects into scenes, room restyling, texture adjustments—using drag-and-drop via Google AI Studio templates.
Vibe‑Coding Experience: Pre-built template apps in AI Studio enable fast prototyping—build image editors by prompts and deploy or export as code.
Invisible SynthID Watermark: All generated or edited images include a non-intrusive watermark for AI provenance.

Where to Try It

Gemini 2.5 Flash Image is offered through:

Gemini API — ready for integration into apps.
Google AI Studio — experiment with visual templates and exportable builds.
Vertex AI — enterprise-grade deployment and scalability.
It’s priced at $30 per 1 million output tokens (~$0.039 per image) and supports input/output pricing consistent with Gemini 2.5 Flash.

Why It Matters

Seamless creative iterations — Designers save time when characters, layouts, and templates stay consistent across edits.
Smart editing with intuition — Natural-language edits reduce the complexity of pixel-level manipulation.
Use-case versatility — From education to real estate mockups, creative marketing, and diagram analysis.
Responsible AI use — Embedded watermarking helps with transparency and traceability.

22.8.25

Chain-of-Agents turns a whole agent swarm into a single end-to-end model

Multi-agent frameworks can crush complex tasks—but they’re brittle, hand-engineered, and expensive to run. OPPO’s AI Agent team proposes a cleaner path: Chain-of-Agents (CoA), where a single model dynamically “plays” multiple roles and tools, simulating agent collaboration end-to-end without external orchestration. The team trains Agent Foundation Models (AFMs) with a two-step recipe: multi-agent distillation (learning from the best existing agent systems) followed by agentic RL on verifiable tasks. Result: a compact, data-trainable alternative to sprawling agent stacks.

How it works

CoA paradigm: the model can activate role-specific and tool-specific “agents” inside its own prompt scaffolding, supporting multi-turn, multi-tool problem solving in one pass.
Multi-agent distillation: successful trajectories from SOTA frameworks (e.g., OAgents) are converted into CoA-compatible traces, then used for supervised tuning so the AFM internalizes collaboration patterns.
Agentic RL: verifiable tasks (search, code, math) provide reward signals that sharpen when to plan, call tools, and switch roles.

The scoreboard

A 32B AFM posts new highs across web and code agents—and strong math gains: 55.3% GAIA, 11.1% BrowseComp, 18.0% HLE, 47.9% LiveCodeBench-v5, 32.7% CodeContests, and 59.8% AIME’25, surpassing recent tool-integrated reasoning baselines like ReTool and SimpleTIR.

Beyond accuracy, CoA slashes runtime waste: the paper reports an 84.6% reduction in inference token cost versus traditional multi-agent frameworks while keeping performance competitive—thanks to fewer round-trips and no inter-agent chatter.

Why it matters

From frameworks to foundations. Distilling orchestration into the model itself turns agent systems into trainable objects, not just prompt graphs.
Generalization & scaling knobs. Analyses show transfer to unseen agents/tools and test-time scaling behaviors (think “try more plans” without changing weights).
Open everything. OPPO releases weights, code, and training data, giving startups a reproducible base to study agentic RL beyond ReAct-style pipelines.

CoA’s pitch is simple: keep the multi-tool, multi-role superpowers—but train them into one model. If the reported GAIA/BrowseComp gains hold up, expect more teams to swap brittle agent graphs for AFMs that plan, act, and coordinate natively.

Paper link: arXiv 2508.13167 (PDF)

ComputerRL scales online RL for “desktop agents,” unifying APIs and GUIs

The next wave of computer-use agents won’t just click around UIs—they’ll mix API calls and GUI interaction in one policy. That’s the bet behind ComputerRL, a new framework that treats desktop work as an end-to-end reinforcement learning problem and introduces an API-GUI paradigm so agents can call services and operate human-oriented interfaces within the same loop.

The missing infrastructure for scale

Training desktop agents with online RL has been hamstrung by slow, brittle environments. ComputerRL ships a distributed RL stack that orchestrates thousands of parallel virtual desktops, making long-horizon, on-policy training runs practical for general computer use.

Stabilizing long runs: Entropulse

Pure RL on complex desktops tends to collapse exploration entropy over time. The authors propose Entropulse, a simple but effective schedule that alternates RL with supervised fine-tuning, restoring healthy entropy while retaining the gains from policy improvement.

Results & models

Using open backbones (GLM-4-9B-0414 and Qwen2.5-14B), the team evaluates on OSWorld and reports 48.1% accuracy with AutoGLM-OS-9B, a new state of the art for general desktop automation in their setup. The framework underpins the group’s AutoGLM system.

Why it matters

Bridging the modality gap: Real workflows mix API calls with UI operations; ComputerRL trains a single policy to do both.
Throughput for RL: Parallelized desktops unlock the scale online RL has needed for computer agents.
Simple stability trick: Entropulse offers a practical recipe any lab can try to keep long runs from collapsing.

If your roadmap includes agents that file expenses, reconcile sheets, or run web apps end-to-end, ComputerRL reads like a blueprint for turning brittle demos into trainable, scalable systems.

Paper link: arXiv 2508.14040 (PDF)

20.8.25

DINOv3: Meta’s Self-Supervised Vision Backbone Built to Scale—and Transfer

Meta has unveiled DINOv3, the latest in its family of self-supervised vision models aimed at learning from raw images—no labels required—and transferring those features cleanly across tasks. The release pairs a readable training recipe with open implementations and model suites, positioning DINOv3 as a practical foundation for detection, segmentation, retrieval, and zero-shot classification in real products.

What’s new in DINOv3

Scale without supervision. The core idea remains simple: pretrain on massive, diverse image data using self-distillation and augmentation, then reuse the frozen backbone downstream. DINOv3 pushes this further with careful data prep, optimization, and—crucially—two new strategies to keep features robust at large scale.

1) Gram anchoring for dense features. Long training runs can erode fine local details that dense tasks (e.g., segmentation, depth) depend on. DINOv3 introduces gram anchoring, a constraint that preserves local feature structure so dense predictions stay sharp even as the backbone learns global invariances. This noticeably lifts dense-task scores relative to prior SSL baselines.

2) Post-hoc high-resolution adaptation. After pretraining, DINOv3 applies a light-touch adaptation to handle higher input resolutions and different model sizes without retraining from scratch—useful when you need 1024-px inputs for instance or semantic segmentation.

3) Optional text alignment. For open-vocabulary or zero-shot use, DINOv3 supports a compact text-alignment step, enabling image-text matching and classification without full supervised fine-tuning of the vision backbone.

Why it matters

DINOv3 is pitched as a universal vision backbone: a single, frozen model that outperforms specialized systems across a broad set of benchmarks—often without task-specific fine-tuning—by producing high-quality dense and global features alike. For teams, this means fewer bespoke models to train and a clearer path from pretraining to deployment.

What you can build today

Object detection & instance/semantic segmentation. Drop DINOv3 into your detector or segmentor head to improve transfer, especially at higher resolutions.
Zero-shot and open-vocabulary classification. Pair the frozen backbone with the text alignment step to classify new categories without labels.
Image retrieval and similarity search. Use embeddings from the backbone for robust retrieval in e-commerce, media, or industrial archives.

Developer on-ramp

Meta has released a reference PyTorch implementation with pretrained checkpoints, scripts, and configs, along with a public paper and model cards. If you’re migrating from DINO/DINOv2, the training and evaluation stacks are familiar; adding gram anchoring and the post-hoc adapters is straightforward.

Blog & overview: how the method scales and where it shines.
Paper (arXiv): full method, ablations, and benchmark details.
Code & weights (GitHub): ready-to-run training/eval pipelines.
Model hub page: consolidated resources and model suite.

Practical tips

Choose resolution by task. Start with the default pretraining size; enable the high-res adapter for dense tasks that benefit from finer detail.
Freeze first, tune later. Many gains show up with a frozen backbone and light heads; reserve end-to-end tuning for domain shifts that remain stubborn.
Mind augmentation & data mix. DINOv3’s results rely on carefully designed augmentations and large, diverse pretraining data—replicate that discipline in your own pipelines.

The takeaway

DINOv3 turns self-supervised pretraining into a dependable, production-minded recipe for vision. With gram anchoring to protect dense signals, post-hoc adaptation for resolution and scale, and optional text alignment for zero-shot scenarios, it offers one backbone you can reuse across many tasks—supported by open code and clear documentation. For teams balancing accuracy, versatility, and engineering simplicity, DINOv3 is a strong default choice for 2025-era computer vision.

19.8.25

AutoCodeBench turns LLMs into benchmark factories — and today’s coders sweat at ~52%

Code benchmarks have a scaling problem: hand-written tasks don’t keep up with fast-improving models, and multilingual coverage is thin. Tencent Hunyuan’s new paper proposes a fix: AutoCodeGen, an automated workflow that inverts dataset creation—generate solutions and tests first, then ask the LLM to write the problem—validated by a multilingual execution sandbox. The result is AutoCodeBench (ACB), a 3,920-problem suite evenly spread over 20 languages, with ~9.6 tests per problem and a deliberate bias toward hard tasks. Even frontier “think-mode” models top out around ~52% Pass@1, signaling real headroom.

How they build hard, correct problems

AutoCodeGen runs in four steps: (1) LLMs evolve self-contained code solutions from real multilingual snippets; (2) LLMs propose public and private test inputs, which the sandbox executes to compute ground-truth outputs; (3) the LLM then writes the problem description constrained by strict specs (language, entry points, naming); (4) a three-stage filter (multi-sampling for difficulty, LLM-as-critic for quality, diversity tagging) trims the set. This “reverse-order” recipe yields correct, executable tests without humans in the loop.

What’s inside ACB

Scale & spread: 3,920 problems, 37,777 tests, 20 languages (Python→TypeScript), 14 task categories from data structures to systems programming. >60% are “hard.”
Sandbox: open-sourced, 20+ languages, high-concurrency, request-based calls—usable for eval and data synthesis.
Lite & Complete: ACB-Lite (≈1,586 problems) for faster evals; ACB-Complete (1,000 completion-style tasks, 3-shot) targets base models rather than chat-tuned ones.

The scoreboard: even elites struggle

Averaged across 20 languages, the leaderboard’s top tier lands ~50–52% Pass@1, led by Claude Opus 4 (Think) at 52.4%, with o3-high, Grok-4, Claude Sonnet 4 (Think), and DeepSeek-R1-0528 clustered close behind. Mid-tier open models sit in the 30s–40s; smaller coders drop to the 20s. Translation: the multilingual + multi-logical mix is punishing.

Iterating with sandbox feedback helps

Across three refinement turns using execution error messages, models like DeepSeek-V3-0324 and Qwen2.5-Coder-32B-Instruct gain ~8–12 points, with the biggest jump on turn one—evidence that automated error signals materially improve code generation.

Base-model check: ACB-Complete

On the 1,000-item, 3-shot ACB-Complete, Seed-Coder-8B-Base leads its class (≤8B) at 31.6% Pass@1, edging OpenCoder-8B-Base and Qwen2.5-Coder-7B—useful signal for pre-instruct comparisons that classic HumanEval/MBPP miss.

Why it matters

Human-free, multilingual, hard. ACB shows you can scale quality and coverage without armies of annotators.
Better evals for code agents. Emphasis on multi-logical tasks (several core functions per problem) aligns with agent workflows like SWE-Bench.
Sandbox as a lever. Open, concurrent execution infra doubles as a training-data factory and an iterative-repair oracle.

Benchmarks drive progress. If your coding model cruises through Python puzzles but face-plants in Kotlin, Shell, or Elixir, AutoCodeBench will make that obvious—and give you a reproducible path to fix it.

Paper link: arXiv 2508.09101 (PDF)

16.8.25

GPT-5 tops multimodal medical QA—and even edges human experts on a new benchmark

If you’ve wondered whether general-purpose LLMs can truly reason across medical text and images, a new study out of Emory University says GPT-5 can—and then some. In “Capabilities of GPT-5 on Multimodal Medical Reasoning,” the team treats GPT-5 as a generalist decision-support engine and runs it through a unified, zero-shot chain-of-thought (CoT) protocol spanning text-only and vision-augmented tasks. The short version: GPT-5 outperforms GPT-4o across the board and surpasses pre-licensed human experts on the toughest multimodal benchmark they tested.

A cleaner test: one prompting recipe, many tasks

Prior medical LLM papers often mix datasets and prompting tricks, muddying comparisons. Here, the authors standardize splits and use the same two-turn CoT prompt for every dataset—first elicit reasoning, then force a single-letter answer—so differences reflect the model, not prompt engineering. Visual items attach image URLs in the first turn; the convergence step stays textual.

The numbers

Text QA: On MedQA (US, 4-option), GPT-5 hits 95.84%—a +4.80% absolute gain over GPT-4o. MMLU medical subsets also tick up, including a perfect score in Medical Genetics.
USMLE samples: Averaged across Steps 1–3, GPT-5 reaches 95.22% (+2.88 vs. GPT-4o), with the biggest lift on Step 2’s management-heavy items.
Multimodal QA: On MedXpertQA-MM, GPT-5’s reasoning and understanding jump +29.26% and +26.18% over GPT-4o. A case study shows the model integrating CT findings, labs and symptoms to recommend a Gastrografin swallow for suspected esophageal perforation.
Radiology VQA: On VQA-RAD, GPT-5 posts 70.92%—slightly below GPT-5-mini (74.90%), which the authors attribute to small-set quirks and calibration.

Above pre-licensed human experts—at least on MedXpertQA

Compared against pre-licensed clinicians, GPT-5 clears the bar decisively on MedXpertQA: +15.22% (text reasoning), +9.40% (text understanding), +24.23% (multimodal reasoning), +29.40% (multimodal understanding). GPT-4o, by contrast, trails humans on most of these dimensions.

Why it matters

From recall to reasoning. Gains concentrate on reasoning-intensive tasks (MedXpertQA, USMLE Step 2), suggesting internal upgrades beyond raw fact lookup.
Designing safer tools. The same unified protocol that boosts accuracy also produces structured rationales—useful for audit trails in clinical decision support.
Open evals. The authors say they’ve made code public (GPT-5-Evaluation), inviting replication and deeper probing of failure modes.

Mind the caveats

This is still benchmark-world: standardized items, time-limited settings, and no messy clinic realities. The paper itself cautions that real deployments will need calibration, domain-adapted fine-tuning and prospective trials.

If those steps pan out, GPT-5 looks less like a better test-taker and more like a multimodal reasoner—one that can fuse text and images to recommend plausible next actions.

Paper link: arXiv 2508.08224 (PDF)

GPT-5 nails ophthalmology board questions—and shows how to buy accuracy wisely

OpenAI’s newest reasoning line just aced a specialty test. In a cross-sectional benchmark of 260 closed-access AAO BCSC multiple-choice questions, GPT-5-high scored 96.5%—beating GPT-4o and OpenAI’s earlier o1, and statistically edging most GPT-5 variants, while tying o3-high within confidence intervals. Beyond raw accuracy, the paper grades rationale quality and runs a cost-accuracy analysis, surfacing Pareto-efficient configs for budget-sensitive deployments.

What they tested—and how

Researchers evaluated 12 GPT-5 configurations (three model sizes × four reasoning_effort settings) alongside o1-high, o3-high, and GPT-4o. Prompts enforced strict JSON with a single letter answer + one-sentence rationale, zero-shot. A Bradley-Terry arena ranked head-to-head wins; an LLM-as-a-judge autograder compared rationales to reference explanations.

Key results

Top score: GPT-5-high 0.965 accuracy (95% CI 0.942–0.985); > GPT-4o and o1-high; comparable to o3-high (0.958).
Rationale quality: GPT-5-high ranked #1 in pairwise judging.
Cost–accuracy frontier: Multiple efficient picks identified; GPT-5-mini-low emerges as the best low-cost, high-performance option.
Reasoning effort matters: Minimal-effort variants underperform; higher effort boosts accuracy but costs more tokens/time.

Why it matters

Hospitals and ed-tech teams rarely buy “max accuracy at any price.” This paper provides a menu of GPT-5 settings that trade pennies for percentage points, plus an autograder recipe others can adapt to scale specialty QA beyond ophthalmology. arXiv

Paper link: arXiv 2508.09956 (PDF)

“Speed Always Wins” is the field guide to building faster, cheaper LLMs

Transformers scaled LLMs to jaw-dropping capabilities—but quadratic attention and ballooning KV caches are throttling real-world deployment. A new survey from Shanghai AI Lab, HKUST(GZ) and collaborators takes stock of what’s next, categorizing the ecosystem of efficient LLM architectures and where each shines. Think of it as a build sheet for teams trying to cut latency and cost without giving up quality.

The efficiency playbook, in seven parts

Linear sequence modeling: from linearized attention to linear RNNs and state-space models that drop KV cache and push complexity toward O(N).
Sparse sequence modeling: static, dynamic, and training-free sparsity to compute only the most useful token-token interactions.
Efficient full attention: keep softmax attention but make it practical with IO-aware, grouped, mixture, and quantized attention variants.
Sparse Mixture-of-Experts: routing, expert designs and MoE conversion to grow capacity without proportional FLOPs.
Hybrid architectures: inter-layer and intra-layer mixes that blend linear blocks with full attention for a better speed/quality trade-off.
Diffusion LLMs: non-autoregressive generation, bridges back to AR, and early steps to extend diffusion approaches to multimodality.
Beyond text: how these efficiency ideas transfer to vision, audio, and multimodal stacks.

Why this matters now

Long-context patterns—RAG, agentic tool use, deliberate reasoning, and multimodal inputs—are pushing sequence lengths and memory pressure through the roof. The survey frames these usage patterns and argues that architectural efficiency, not just better prompts or hardware, is the lever that scales the next wave of applications.

A roadmap, not just a reading list

Beyond taxonomy, the paper stitches trends into a blueprint: pick linear/sparse methods to kill KV bloat, use efficient-full-attention where fidelity matters, layer in MoE for capacity, and consider hybrids or diffusion LLMs where generation style allows. There’s also a companion GitHub “Awesome-Efficient-Arch” list to track the space as it moves.

If you’re building agents that browse, reason and call tools all day—or multimodal systems juggling video and audio—this survey is a timely map of the fastest lanes through today’s LLM bottlenecks.

Paper link: arXiv 2508.09834 (PDF)

Hunyuan-GameCraft brings “playable” video gen to AAA-style worlds

Text-to-video systems can paint beautiful clips, but making them playable—reacting smoothly to user inputs over long sequences—has been a brick wall. Tencent Hunyuan’s Hunyuan-GameCraft attacks the problem head-on with a recipe built for game dynamics: unify keyboard/mouse signals into camera-space controls, train with a history-aware objective, and distill the model for real-time latency. The result: long, action-controllable sequences that keep scenes coherent and respond like a game engine—minus the engine.

The trick: turn WASD into camera math

Instead of treating keystrokes as ad-hoc tokens, GameCraft maps keyboard and mouse inputs to a shared, continuous camera representation (translation/rotation directions plus speeds). A lightweight action encoder injects these signals into an MM-DiT video backbone (HunyuanVideo), enabling fine-grained motion like smooth pans or faster strafes without hacking the generator.

Stay coherent over minutes, not seconds

To fight the usual “long-video drift,” the team proposes hybrid history-conditioned training: during autoregressive extension, new chunks are denoised while explicitly conditioning on denoised history with a mask indicator. Compared with training-free or streaming add-ons, this keeps geometry and layout stable across extended play.

Fast enough to feel interactive

A distillation pass (Phased Consistency Model) accelerates inference by 10–20×, cutting latency to <5 s per action in their setup—crucial for anything that calls itself “interactive.”

Trained on real gameplay, then sharpened in 3-D

The dataset is built from 1M+ gameplay clips across 100+ AAA titles (e.g., Assassin’s Creed, Red Dead Redemption, Cyberpunk 2077), segmented with PySceneDetect and annotated with 6-DoF camera trajectories (Monst3R). A synthetic set of rendered motion sequences adds precise camera priors and balances trajectory distributions.

Why this matters

Input-to-motion fidelity. Unifying controls in camera space yields smoother, more physical responses to typical WASD/arrow inputs.
Long-horizon stability. History conditioning curbs error accumulation that wrecks long, user-driven videos.
Path to production. Distillation pushes latency toward “feels responsive,” a precondition for creator tools and AI-assisted level previews.

Availability and what’s next

A project page is live, and the team has released inference code and weights under the Hunyuan-GameCraft-1.0 repository. The arXiv record also notes acceptance to RSS 2025, signaling interest from the robotics community.

Paper link: arXiv 2506.17201 (PDF)