Wandering Nomad

8.7.25

AIRA shows how better operators — not just bigger models — turbo-charge AI research agents

Large language models that write code have already stormed GitHub, but turning them into full-blown research agents—systems that iterate on entire ML pipelines until they medal on Kaggle—has proved trickier. The latest state-of-the-art, AIDE, could grab a medal on roughly 40 % of MLE-bench tasks. Now Meta AI and UCL push that rate to 47.7 % with AIRA, a rethink that says the secret isn’t a flashier LLM, it’s the operators and search policy you wrap around it.

From one-shot “Draft, Debug, Improve” to a toolbox of surgical edits

AIRA introduces OAIRA, a new operator set that goes beyond AIDE’s three blunt actions. Scoped memory keeps prompts lean, “think tokens” force structured reasoning, and a prompt-adaptive complexity cue decides whether the agent should sketch a quick baseline or engineer a deep ensemble. The result: twice the reasoning tokens per call and far less mode collapse.

Search policies finally get room to shine

When AIDE’s old operators were plugged into greedy, MCTS and evolutionary searches, the fancier algorithms gained zero ground—operator bottlenecks were that severe. Swap in OAIRA and those same policies leapfrog greedy search, proving that exploration muscle only pays off once edits are expressive enough.

The scoreboard (MLE-bench Lite, 22 Kaggle tasks)

AIDE (o1-preview, greedy): 39.6 % medal rate
AIRA (greedy + OAIRA): 45.5 %
AIRA (MCTS + OAIRA): 47.7 %
AIRA (Evolutionary + OAIRA): 47.3 %
All agents ran under identical 24-hour, single-GPU budgets inside AIRA-dojo, a new sandbox that hands each run a root-privileged H200 container yet isolates filesystem side effects.

Mind the generalization gap

The study also spotlights a pitfall for auto-ML agents: validation scores routinely over-estimate test-set gains, steering greedy searches into dead ends. By examining thousands of runs, the team quantifies that “proxy-test gap” and urges future benchmarks to track it explicitly.

Why it matters

Agent design ≠ model scale. The leap came without touching the underlying LLM (DeepSeek-R1 or GPT-4o). That’s good news for teams capped by API limits.
Composable recipe. OAIRA operators, MCTS search and the open-source aira-dojo testbed (GitHub link in the paper) can bolt onto any ReAct-style coding agent.
Toward autonomous ML ops. AIRA’s 24-hour, single-GPU constraint mirrors real-world hack-day budgets, making the findings immediately useful for startups chasing continuous Kaggle pipelines or internal model tuning bots.

Auto-ML agents are no longer judged solely by the size of their LLM brains; the tools they wield and the ways they explore the search space may count just as much. AIRA’s 8-point jump on MLE-bench suggests that the next frontier in agentic ML will be won with sharper scalpels, not bigger hammers.

Paper link: arXiv 2507.02554 (PDF)

DeepMesh makes artist-quality 3D meshes a one-click affair

Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images.

Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.

Two upgrades, one big leap

Pillar	What they did	Why it matters
72 % shorter sequences	A hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail.	Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RL	Collected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model.	Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes.

Results: fewer faces, better faces

Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences.
Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.
Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).
Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds.

Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

7.7.25

ARAG puts a multi-agent brain inside your RAG stack — and Walmart’s numbers look eye-popping

Retrieval-augmented generation (RAG) has become the go-to recipe for giving large language models real-world context, but most deployments still treat retrieval as a dumb, one-shot lookup. Researchers at Walmart Global Tech think that leaves serious money on the table — especially in e-commerce, where user intent shifts by the minute. Their new framework, ARAG (Agentic Retrieval-Augmented Generation), adds a four-agent reasoning layer on top of vanilla RAG and reports double-digit gains across every metric that matters.

Four specialists, one conversation

User-Understanding Agent distills long-term history and the current session into a natural-language profile.
NLI Agent performs sentence-level entailment to see whether each candidate item actually supports that intent.
Context-Summary Agent compresses only the NLI-approved evidence into a focused prompt.
Item-Ranker Agent fuses all signals and produces the final ranked list.

Each agent writes to — and reads from — a shared blackboard-style memory, so later agents can reason over earlier rationales rather than raw text alone.

How much better? Try 42 %

On three Amazon Review subsets (Clothing, Electronics, Home), ARAG beats both a recency heuristic and a strong cosine-similarity RAG baseline:

Dataset	NDCG@5 ↑	Hit@5 ↑
Clothing	+42.1 %	+35.5 %
Electronics	+37.9 %	+30.9 %
Home & Kitchen	+25.6 %	+22.7 %

An ablation test shows that yanking either the NLI or context-summary modules knocks as much as 14 points off NDCG, underlining how critical cross-agent reasoning is to the win.

Why it matters

Personalization that actually reasons. By turning retrieval and ranking into cooperative LLM agents, ARAG captures the nuance of why an item fits, not just whether embeddings are close.
No model surgery required. The team wraps any existing RAG stack; there’s no need to fine-tune the base LLM, making the upgrade cloud-budget friendly.
Explainability for free. Each agent logs its own JSON-structured evidence, giving product managers a breadcrumb trail for every recommendation.

The bigger picture

Agentic pipelines have taken off in code generation and web browsing; ARAG shows the same trick pays dividends in recommender systems, a multi-billion-dollar battleground where percent-level lifts translate into real revenue. Expect retailers and streaming platforms to test-drive multi-agent RAG as they chase post-cookie personalization.

Paper link: arXiv 2506.21931 (PDF)

6.7.25

LangGraph Rollout: how VeRL leveled-up multi-turn Agent RL

Why this matters

If you’ve ever tried to train an LLM-powered agent with many tool calls spread across a genuine back-and-forth conversation, you’ve probably discovered that “multi-turn” means different things to different frameworks. Yanbin Jiang’s latest post shows how the VeRL team punched through that ceiling by grafting LangGraph directly onto VeRL’s reinforcement-learning rollout engine. The result is a training loop that speaks the same language as production code.

1. Where they started

Native VeRL multi-turn – great for quick experiments. You enable multi_turn: True, write a YAML schema for each tool, implement an async Python class, and you’re off; their GSM8K benchmark ran in two days.
Pain points
1. Double bookkeeping: every tool had to be declared twice (YAML + Python).
2. Drift: schema and code fell out of sync, and prod tools (written for LangChain/LangGraph) diverged from the “training” clones.

2. A quick stop-gap: automatic tool wrapping

Yanbin added BaseTool.from_callable(), which introspects any plain Python function with transformers.utils.get_json_schema, then fabricates a VeRL-compatible wrapper on the fly. One list of callables (tool_list = [multiply, add, …]) now powers both training and prod.

My dev take: this is the same pattern I use in LangChain when I decorate business logic with @tool. Nice to see VeRL admit “if you can’t beat reflection, join it.”

3. The real blocker: orchestration power

Research quickly outgrew VeRL’s built-in rollout:

Need	Why VeRL fell short
Dynamic branches & backtracking	Native graph was too rigid.
True multi-turn dialogue (user follow-ups)	Any assistant message without tool calls ended the convo.
Per-node sampling / chat-template tweaks	Global settings only.

Enter LangGraph: a lightweight DAG engine already shipping in production.

4. Architectural insight: separation of concerns

“Let VeRL manage actor weights & hardware; let LangGraph drive the conversation.”

So they built a LangChain-compatible chat-model client for VeRL’s SGLang server. Training now works like this:

VeRL hands the initial messages + model handle to the user’s LangGraph.
The graph does its thing—branching, retrying, invoking tools—using the exact actor weights being optimized.
When the graph stops, VeRL collects the message history and rewards.

The PR shows a seven-line YAML snippet that swaps the old rollout for:

yaml
multi_turn:
  chat_template_kwargs: {enable_thinking: false}
  langgraph:
    path: /path/to/graph.py
    graph_config: {recursion_limit: 100}

…and a 60-line example graph that binds tools, counts turns, and lets you vary temperature node-by-node.

5. Why I’m excited

One graph to rule them all – deployment and training share code; no more “but it worked in prod!”
Easier ablations – want to test a new branch strategy? Edit the graph script; RL pipeline stays untouched.
Framework-agnostic future – the same bridge pattern could plug VeRL into OpenAI Function Calling, Microsoft’s AutoGen, or whatever framework wins next year.

My takeaway

VeRL just became a lot more attractive for serious agent RL work. By leaning on LangGraph instead of extending an in-house orchestration DSL, the team keeps VeRL laser-focused on fast rollouts, leaves graph logic to a dedicated library, and—crucially—lets devs iterate on one codebase. If you’re juggling duplicate tool definitions or fighting mismatch between training and production, clone Yanbin’s PR and breathe easier.

Explore it more here: https://jybsuper.github.io/posts/langgraph_rollout/

FreeMorph turns Stable Diffusion into a one-click image-morphing engine

Image morphing has been around since Michael Jackson’s Black or White video, but most modern AI pipelines still demand per-pair fine-tuning or laborious warping to keep shapes and textures coherent. A new paper from NTU, Nanjing University and CUHK drops that baggage. FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model repurposes an off-the-shelf Stable Diffusion 2.1 checkpoint to generate frame-perfect transitions between any two images—faces, cars, even cat-to-dog mash-ups—without touching a single weight.

Two tricks make the magic happen

Guidance-aware spherical interpolation (GaSI). Instead of naive latent mixing, FreeMorph blends the key-value pairs inside Stable Diffusion’s self-attention, injecting “identity anchors” from both source images so the morph stays on course.
Step-oriented variation trend (SoVT). A second module dials in how much of each image shows up at every denoising step, taming the non-linear chaos that usually derails tuning-free edits.

Faster and smoother than the competition

Running on a single NVIDIA A100, FreeMorph spits out a full transition sequence in under 30 seconds, beating DiffMorpher and IMPUS—which both require minutes of LoRA fine-tuning—while delivering sharper edges and fewer identity slips.

A new benchmark to prove it

Because existing datasets skew toward near-identical pairs, the authors collected Morph4Data,  four classes of image pairs ranging from “same layout, different semantics” to “totally unrelated.” On this tougher mix, FreeMorph tops every published method in quantitative metrics and user studies alike.

Why this matters

For creative-tool startups, FreeMorph means morphing features can ship as a call to Stable Diffusion rather than a 30-minute fine-tune. For researchers, GaSI + SoVT point to a broader lesson: you can co-opt diffusion attention layers for structural edits without sacrificing model generality.

The code, demo video and ready-to-run Colab notebook are already live on GitHub, so expect FreeMorph-powered GIF makers to surface on your timeline before summer’s out.

Paper link: arXiv 2507.01953 (PDF)

WebSailor charts an open-source course to super-human web reasoning

For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.”

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning.

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize.

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools.

Why it matters

Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.
A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.
Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks.

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers.

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592 (PDF)

4.7.25

MoCa turns your favorite VLM into a bidirectional embedding powerhous

Causal-attention vision–language models (VLMs) are great storytellers, but they’re not ideal when you just need a single, rock-solid vector that fuses pixels and prose. A joint team from Renmin University of China, Stanford and Microsoft Research Asia thinks it has a fix. In a paper released this week, the researchers introduce MoCa — Modality-aware Continual Pre-training, a plug-and-play recipe that transforms any off-the-shelf VLM into a bidirectional, retrieval-grade multimodal embedder.

Two stages, three big problems solved

Modality-aware Continual Pre-training (CPT)
Joint reconstruction denoises interleaved text tokens via masked-language modeling and masked image patches via a lightweight decoder in one go. The tweak injects bidirectional attention and lets the model learn from billions of unlabeled, mixed-modality tokens.
Heterogeneous Contrastive Fine-tuning (HCF)
Moving beyond garden-variety image-caption pairs, MoCa mixes long-form query-document sets, curated visual-text pairs and plain text-only examples. Task-aware batching throws all three into every mini-batch, forcing deeper cross-modal reasoning instead of surface-level matching.

Together, the stages tackle the trio of headaches plaguing existing embedding retrofits: causal attention, dependence on labeled pairs and narrow training objectives.

Numbers that matter

Model	Params	MMEB (overall ↑)	ViDoRe-v2 (avg ↑)
mmE5	11 B	69.8	50.5
VLM2Vec	7 B	62.9	38.7
MoCa-3B	3 B	67.5	59.8
MoCa-7B	7 B	71.5	58.8

A 7-billion-parameter MoCa variant tops all published baselines across MMEB’s 36 tasks, while the lighter 3-B version jumps almost 10 points on ViDoRe-v2’s document-level retrieval suite. Even more telling: a 3-B MoCa with CPT beats 7-B models trained only with contrastive learning.

Ablations spotlight CPT’s punch

Yank out either the masked-language (MLM) or masked-autoencoding (MAE) objectives during CPT, and MMEB scores slide by up to 1.3 points. Drop the entire CPT stage and you lose nearly 2 points—proof that modality-aware reconstruction, not just more contrastive data, drives the gains.

Why it matters

Retrieval is eating the multimodal world. Search, RAG pipelines and recommender systems need embeddings, not prose. A bidirectional retrofit averts the cost of training from scratch.
Scales with unlabeled data. By exploiting noisy Web corpora, MoCa sidesteps the image-caption bottleneck hobbling many CLIP-style updates.
Open VLM agnostic. The authors demo on Qwen-2.5-VL backbones, but the training recipe is architecture-neutral—anything with a ViT and Transformer decoder should drop in.

What’s next

The paper hints at a public GitHub release with checkpoints, data loaders and task-aware batching helpers. If the repo ships soon, expect MoCa-style CPT to become a default step for teams building multimodal RAG or e-commerce search engines on lightweight hardware.

Paper link: arXiv 2506.23115 (PDF)

DiffuCoder rewrites the code-LLM playbook with diffusion and smarter RL

Autoregressive (AR) giants like GPT-4o and Qwen2.5 dominate today’s leaderboard-driven coding scene, but Apple’s research group thinks the next breakthrough may come from an entirely different generation paradigm. In a paper published late last week, the team unveiled DiffuCoder — a 7 B-parameter masked diffusion language model (dLLM) designed specifically for program synthesis and repair. Unlike AR models that predict the next token left-to-right, DiffuCoder iteratively denoises whole sequences, enabling global planning and out-of-order refinement.

What’s new under the hood

Scaled training for code. DiffuCoder is pretrained on 130 billion code tokens, then instruction-tuned and RL-fined on curated problem sets. That makes it one of the largest diffusion-first code models publicly documented.
Decoding insights. The authors introduce local and global AR-ness metrics to quantify how often a diffusion model falls back to sequential generation. They show that raising temperature not only diversifies token choice but also the order in which tokens are filled — a property AR models lack.
Coupled-GRPO. To tame the high-variance log-likelihood estimates that plague diffusion policy gradients, Apple proposes coupled Group Relative Policy Optimization, a two-pass masking strategy that evaluates complementary token subsets in one RL rollout. The technique drops noise without resorting to semi-AR “block decoding,” keeping the model fully diffusion-native.

Benchmark scores that matter

DiffuCoder’s base model already lands in the same ballpark as leading 7/8 B AR coders. After instruction tuning and coupled-GRPO, it posts:

Model	HumanEval+	MBPP+	EvalPlus (avg.)	BigCodeBench C-Full
DiffuCoder-Instruct	72.0	65.2	75.1	61.9
+ coupled-GRPO	73.2	68.3	78.6	67.5

That +4.4-point jump on EvalPlus brings the diffusion model within striking distance of Qwen2.5-Coder-SFT while comfortably outpacing earlier dLLMs like Dream-7B and LLaDA-Instruct.

Why it matters

Diffusion’s parallel denoising lets models “think in drafts,” revisiting earlier lines without paying the quadratic attention tax AR models incur for long contexts. For enterprise dev-ops teams staring down thousand-line files, a diffusion-native coder that no longer needs block-wise hacks could slash latency and memory. And because coupled-GRPO is plug-and-play, the method can in theory retrofit any masked diffusion LLM — not just Apple’s.

Early tooling and ecosystem

A DiffuCoder-7B-Instruct checkpoint is already live on Hugging Face, and the GitHub repo ships with sampling scripts, RL rewards and evaluation harnesses. That means startups building unit-test agents or code-review copilots can kick the tires today on a single A100.

The bigger question is whether diffusion LLMs can climb the performance ladder as fast as their image cousins did in 2022. Apple’s coupled-GRPO shows one path forward: make RL native to diffusion instead of forcing AR habits onto a fundamentally different beast. If follow-up work scales the idea to 34 B or 70 B parameters, AR incumbents may soon find themselves sharing the podium.

Paper link: arXiv 2506.20639 (PDF)

Keye-VL: Kuaishou’s 8-billion-parameter bid to dominate video-first AI

If image-centric multimodal large language models (MLLMs) were last year’s breakout stars, 2025 is shaping up to be all about video. Today Kuaishou’s research arm quietly published the Kwai Keye-VL Technical Report, unveiling an 8-billion-parameter model that claims state-of-the-art results across every major short-video benchmark — all while staying lean enough to fine-tune on a single A100 or RTX 6000.

Built on data — 600 billion tokens of it

Keye-VL’s recipe starts with scale where it matters: data. The team curated a 600 billion-token corpus heavily skewed toward short videos, supplementing it with images and pure text for balance. Training unfolds in a four-stage pre-train pipeline (image-text matching ➜ ViT-LLM alignment ➜ multi-task pre-train ➜ annealing) and a two-phase post-train that injects reasoning skill through a five-mode “cold-start” mixture (think / no-think / auto-think / think-with-image / high-quality video) plus reinforcement-learning alignment to squash repetition and hallucination.

A hybrid SigLIP + Qwen3 backbone

Under the hood, Keye-VL bolts a SigLIP vision encoder onto Qwen3-8B, then unifies text, image and video tokens with 3-D RoPE positional encoding. Dynamic-resolution support keeps aspect ratios intact, while an isomorphic-heterogeneous parameter-fusion trick averages weights from differently mixed data regimes to boost robustness without extra FLOPs.

Crushing the video leaderboards

On Video-MME, Video-MMMU, TempCompass, LongVideoBench and MMVU, Keye-VL outperforms every open-source or proprietary model in its size class, according to the authors. They also introduce KC-MMBench, a purpose-built benchmark of real-world short-video tasks, where Keye-VL “shows a significant advantage” over larger rivals. While the paper withholds exact deltas pending conference review, the accompanying GitHub charts depict double-digit gains on several suites.

Why it matters

Short-form video is the lingua franca of Gen Z commerce and social search — but decoding dozens of rapid cuts, subtitles and visual gags is still a blind spot for many MLLMs. By feeding a video-centric diet into a lightweight backbone, Kuaishou positions Keye-VL as both a production-ready recommendation engine for its 600-million-user platform and a developer-friendly alternative to heavyweight research models like Gemini 1.5 Pro or OpenAI’s rumored VideoGPT.

Open weights, open benchmark

An 8B preview checkpoint is already live on Hugging Face, complete with a keye-vl-utils helper library and Colab demo. KC-MMBench’s evaluation scripts ship in the same repo, inviting outside labs to reproduce — or refute — Kuaishou’s numbers. For startups building shopping stream copilots or automated highlight reels, a smaller, video-savvy foundation could be the missing piece.

Keye-VL still faces unanswered questions — latency under real-time loads, licensing around its internal data, and how well the “think-with-image” mode generalizes beyond curated prompts. But if the benchmarks hold up, Kuaishou just proved you don’t need GPT-sized weights to understand the world in motion.

Paper link: arXiv 2507.01949 (PDF)

3.7.25

LongAnimation promises Tokyo-quality color at indie-studio speed

When you think about the most time-consuming part of anime production, flashy fight scenes or painstaking tweening may spring to mind. In reality, a huge chunk of budget and overtime goes into the unglamorous grind of coloring hundreds of frames so that a heroine’s yellow ribbon doesn’t silently morph into pink halfway through a scene. A new paper out of the University of Science and Technology of China and HKUST wants to make that tedium disappear.

Today the team unveiled LongAnimation: Long Animation Generation with Dynamic Global-Local Memory, a diffusion-transformer pipeline that can propagate colors consistently across 500-frame sequences—roughly 20 seconds at broadcast frame rates—without the dreaded color drift that plagues existing tools. Compared with state-of-the-art video colorization baselines, LongAnimation slashes Frechet Video Distance by 35.1% on short clips and 49.1% on long ones, while cutting perceptual error (LPIPS) by more than half.

How it works

SketchDiT
A customized DiT backbone ingests three control signals—line-art sketches, a single colored keyframe, and optional text prompts—to extract what the authors call a “hybrid reference embedding.” This keeps the model flexible enough to obey textual cues (“sunset sky”) while staying locked onto a character’s palette.
Dynamic Global-Local Memory (DGLM)
Prior systems only merge overlapping windows, so they see at best the last few seconds of footage. LongAnimation pipes every generated segment through Video-XL, a long-video understanding model, compressing thousands of frames into a global cache. During generation, the network adaptively fuses that global context with a short “local” cache, letting it remember that the yellow ribbon was, in fact, yellow back in frame 25.
Color Consistency Reward (CCR)
To train the system without back-propagating through a hefty 3D VAE, the authors bolt on a reinforcement-learning reward that directly scores low-frequency color coherence. A late-stage latent-space fusion trick during inference (their “CCF”) then smooths boundary artifacts between segments.

Why it matters

Traditional colorization assistants like LVCD or ToonCrafter top out at ~100 frames or quietly devolve into noise accumulation if you stitch segments together. LongAnimation’s five-times leap in sequence length pushes automated coloring into territory that covers most dialogue and establishing shots, not just blink-and-you-miss-it gifs.

For mid-tier studios in Seoul or Manila that churn through thousands of outsourced cuts each month, the economics are compelling: one keyframe plus vectorized sketches could drive bulk coloring, leaving human artists to polish hero shots. And because SketchDiT still honors text instructions, directors can tweak backgrounds—“make it dawn instead of dusk”—without round-tripping to compositing.

Under the hood

Model size: Built on top of CogVideoX-1.5 (5 B params).
Training set: ~80 k high-aesthetic clips from Sakuga-42M, filtered for >91 frames.
Hardware: 6 × NVIDIA A100 GPUs, LR = 1e-5, three-stage curriculum (SketchDiT 30 k steps → DGLM 10 k → CCR 10 k).
Code: The repo, demo videos, and Colab notebook are already live on GitHub.

The bigger picture

LongAnimation lands amid a broader rush to extend diffusion transformers beyond blink-length video. Google’s DitCtrl and Meta’s SlowFast-VGen deliver longer shots but rely on window fusion or fine-tuned LoRA weights. By contrast, LongAnimation’s plug-and-play memory module could slot into any DiT-style architecture, making it a tempting drop-in upgrade for text-to-video startups chasing the next One Piece.

Just don’t expect the tech to kill colorists’ jobs overnight. Rendering frames is only half the battle; style supervision, motion cleanup and final compositing still demand human taste. But if the ribbon stays yellow without manual touch-ups, the conversation around AI in animation may shift from “Will it replace us?” to “How much budget does it free for better storytelling?”

Paper link: arXiv:2507.01945 (PDF)

Baidu Open-Sources ERNIE 4.5: A Full LLM Family from 0.3 B to 424 B Parameters

A Flagship Release for the Open-Source Community

On July 1 2025, Baidu announced the open-source launch of ERNIE 4.5, a complete large-language-model family scaling from 0.3 billion to 424 billion parameters. The weights, training code, and evaluation suites are now freely available to researchers and enterprises under the Apache 2.0 license.

Six Sizes, One Architecture

Model	Dense / MoE	Context Window	Target Hardware*	Intended Use
ERNIE-Tiny 0.3B	Dense	16 K	Mobile/Edge	Lightweight chat & IoT
ERNIE-Base 7B	Dense	32 K	1× A10 24 GB	Mainstream apps
ERNIE-Large 34B	Dense	128 K	2× A100 80 GB	RAG & agents
ERNIE-XL 124B	MoE (8 experts)	256 K	4× H100 80 GB	Multimodal research
ERNIE-Mega 276B	MoE (16)	256 K	8× H100 80 GB	Enterprise AI
ERNIE-Ultra 424B	MoE (24)	1 M	TPU v5p / 16× H100	Frontier-level reasoning

*at int8 + FlashAttention-2 settings

Technology Highlights

FlashMask Dynamic Attention – a masking scheme that activates only the most relevant key-value blocks per token, cutting memory by 40 % while retaining context depth.
Heterogeneous Multimodal MoE – vision-audio experts share early layers with text, enabling cross-modal reasoning without separate encoders.
Knowledge-Centric Corpus – Baidu’s in-house “Wenxin KG-2” injects 4 T tokens of curated facts and regulations, boosting compliance answers.
Self-Feedback Post-Training – iterative reflection steps reduce hallucination rate by 28 % vs. ERNIE 4.0.

Benchmark Performance

Benchmark (June 2025)	GPT-4.5*	ERNIE 4.5-Ultra 424B	ERNIE 4.5-Large 34B
MMLU (5-shot)	88.7 %	89.3 %	82.1 %
MathGLUE	55.4 %	57.2 %	48.0 %
VQA-v2 (zero-shot)	83.0 %	84.6 %	78.9 %
Code HumanEval+	93.5 %	94.1 %	87.3 %

*closed model; public leaderboard values. ERNIE 4.5 data from Baidu release notes.

Why It Matters

End-to-End Transparency – full training configs (FlashMask, MoE routing, safety filters) are published, enabling reproducible research.
Scalable Deployment – identical API across sizes lets startups choose Tiny/7B locally and swap to 424B in the cloud without prompt changes.
Multilingual & Multimodal – supports 34 languages and native image, audio, and short-video tokens out of the box.
Cost Innovation – FlashMask and MoE shrink inference FLOPs by up to 55 % versus dense GPT-4-class models, lowering GPU bills for enterprise users.

Access & Tooling

Hugging Face Hub – weights and safetensors for all six checkpoints.
Docker & vLLM Images – ready-to-serve stacks with Triton / TensorRT-LLM.
Agent Starter Kits – sample Model-Context-Protocol (MCP) tools for retrieval, calculators, and code execution.
Chinese & English Docs – prompt templates, fine-tuning scripts, and safety policy examples.

Roadmap

Baidu’s research blog notes upcoming “ERNIE 4.6” experiments with FlashMask-2 and sparse Mixture-of-Experts vision heads, plus a policy-aligned Turbo variant targeting 80 % cheaper inference for chat applications.

Takeaway
With ERNIE 4.5, Baidu throws open the doors to a fully transparent, parameter-scalable, multimodal LLM family—giving practitioners a home-grown alternative to closed giants and pushing the frontier of what open-source models can achieve.

Together AI’s DeepSWE Turns Qwen3-32B into an Open-Source Coding Agent that Tops SWEBench

A New State of the Art for Open-Source Coding Agents

Together AI has unveiled DeepSWE, a software-engineering agent that sets a new open-weight record on the notoriously difficult SWEBench-Verified benchmark with 59 % accuracy and 42.2 % Pass@1. Built on Alibaba’s Qwen3-32B language model and trained purely with reinforcement learning, DeepSWE offers a transparent alternative to closed-source dev assistants like GitHub Copilot and Claude Code.

Inside the Training Pipeline

Stage	Details
Warm-Start	Initializes from base Qwen3-32B weights (dense, 32 B params).
R2E-Gym Curriculum	4,500 real GitHub issues converted into step-by-step repair tasks spanning six languages (Python, Java, JS, Go, Rust, C++).
RLHF Loop	Uses a reward model that scores test-suite pass rates and diff conciseness; policy optimized with PPO across 64 × H100s for six days.
Self-Reflect & Distill	High-reward trajectories distilled back into the policy to improve “first-try” success.

The team openly publishes all training code, reward scripts, and checkpoints under Apache 2.0, enabling independent replication or domain-specific finetuning.

Why DeepSWE Matters

One-Shot Repairs over Multi-Tool Chains
DeepSWE fixes repository-level bugs in a single forward pass, skipping heavyweight agent stacks that juggle search, planning, and external compilers.
Reinforcement Learning at Scale
Proves that RL alone—without supervised trace data—can yield production-grade coding skills when paired with a high-capacity base model.
Transparent & Portable
Enterprises can self-host the model, audit its reward functions, and retrain on private codebases without licensing friction.

Benchmark Highlights

Benchmark	DeepSWE (32 B)	DeepSeek-R1-Synth (67 B)	GPT-4o (closed)
SWEBench-Verified	59 %	46 %	64 %
HumanEval Plus	93.1 %	87.4 %	95 %
CommitPackBench	71.3 %	63.0 %	74 %

DeepSWE closes nearly half of the gap to GPT-4-class tools while running on a single 80 GB H100 GPU in int8 mode.

Real-World Capabilities

Bug Repair & Refactor – Generates minimal diffs that compile and pass project test suites.
Feature Stubs – Adds new endpoints, CLI flags, or unit tests on request.
Context Stretch – Accepts up to 64 K tokens, allowing multi-file reasoning across large repos.

Together AI provides an OpenAI-compatible API plus a VS Code extension that surfaces proposed patches as Git diffs for quick human review.

Roadmap

The team plans to:

Release a 13 B “consumer PC” variant trained on the same reward curriculum.
Add tool-augmented variants that can invoke package managers and linters dynamically.
Expand R2E-Gym to 10 K tasks, covering Android and .NET ecosystems.

Takeaway

DeepSWE demonstrates that meticulous RL on a strong open base (Qwen3-32B) can rival closed commercial coders—while remaining fully inspectable and modifiable. For organizations seeking sovereign AI development stacks, it’s a compelling invitation to “clone the repo, load the weights, and start fixing code.”

Baidu’s “AI Search Paradigm” Unveils a Four-Agent Framework for Next-Generation Information Retrieval

A Blueprint for Smarter Search

Traditional RAG pipelines handle simple fact look-ups well but struggle when queries require multi-step reasoning, tool use, or synthesis. In response, Baidu Research has introduced the AI Search Paradigm, a unified framework in which four specialized LLM-powered agents collaborate to emulate human research workflows.

Agent	Role	Key Skills
Master	Classifies query difficulty & launches a workflow	Meta-reasoning, task routing
Planner	Breaks the problem into ordered sub-tasks	Decomposition, tool selection
Executor	Calls external APIs or web search to gather evidence	Retrieval, browsing, code-run
Writer	Consolidates evidence into fluent, cited answers	Synthesis, style control

The architecture adapts on the fly: trivial queries may bypass planning, while open-ended questions trigger full agent collaboration.

Technical Innovations

Dynamic Workflow Graphs – Agents spawn or skip steps in real time based on intermediate results, avoiding rigid “one-size-fits-all” chains.
Robust Tool Layer – Executor can invoke search APIs, calculators, code sandboxes, and custom enterprise databases, all via a common interface.
Alignment & Safety – Reinforcement learning with human feedback (RLHF) plus retrieval-grounding reduce hallucinations and improve citation accuracy.

Benchmark Results

On a suite of open-web reasoning tasks the system, dubbed Baidu ASP in the paper, surpasses state-of-the-art open-source baselines and even challenges proprietary models that rely on massive context windows alone.

Benchmark	Prior Best (RAG)	Baidu ASP
Complex QA (avg. F1)	46.2	57.8
Multi-hop HotpotQA (Exact Match)	41.5	53.0
ORION Deep-Search	37.1	49.6

Practical Implications

Enterprise Knowledge Portals – Route user tickets through Planner→Executor→Writer to surface compliant, fully referenced answers.
Academic Research Assistants – Decompose literature reviews into sub-queries, fetch PDFs, and synthesize summaries.
E-commerce Assistants – From “Find a laptop under $800 that runs Blender” to a shoppable list with citations in a single interaction.

Because each agent is modular, organisations can fine-tune or swap individual components—e.g., plugging in a domain-specific retrieval tool—without retraining the entire stack.

Looking Ahead

The team plans to open-source a reference implementation and release an evaluation harness so other researchers can benchmark new agent variants under identical conditions. Future work focuses on:

Reducing latency by parallelising Executor calls
Expanding the Writer’s multimodal output (tables, charts, code diffs)
Hardening the Master agent’s self-diagnosis to detect and recover from tool failures

Takeaway
Baidu’s AI Search Paradigm reframes search as a cooperative, multi-agent process, merging planning, tool use, and natural-language synthesis into one adaptable pipeline. For enterprises and researchers seeking deeper, trustable answers—not just blue links—this approach signals how tomorrow’s search engines and internal knowledge bots will be built.

29.6.25

Qwen VLo: Alibaba’s New Multimodal Model That Both Understands and Creates the World

From Perception to Creation

The Alibaba Qwen research team has introduced Qwen VLo, a next-generation multimodal model that fuses visual understanding with image generation in a single framework. Building on earlier Qwen-VL iterations, Qwen VLo not only interprets complex visual scenes but can also re-create or modify them on command—closing the loop between perception and synthesis.

Key Capabilities

Feature	What It Delivers
Unified Architecture	One checkpoint handles both visual comprehension (classification, localization, QA) and high-fidelity image generation.
Progressive Scene Construction	Rather than rendering a picture in a single step, Qwen VLo refines the canvas iteratively, letting users adjust lighting, add elements, or correct details mid-process—similar to non-destructive photo editing.
Multilingual Prompting	Supports 29 languages, enabling global creators to generate and edit images without English-only constraints.
In-Context Editing	Upload a photo, issue a prompt like “add a red cap to the cat,” and receive an updated image that preserves original structure and semantics.

Users can try all of this now in Qwen Chat: type “Generate a picture of a cyberpunk street at dawn,” watch the scene build in real time, then request tweaks—no extra tools required.

Technical Highlights

Dual-Path Transformer Backbone – Merges a vision encoder with a language decoder via cross-modal attention, allowing dense pixel features to condition text generation and vice-versa.
High-Resolution Support – Trained on images up to 1024 × 1024 with adaptive patching, yielding sharper details than its Qwen-VL predecessor.
Consistency-First Training – Loss functions penalize semantic drift, ensuring an edited image keeps key structures (e.g., cars stay cars, buildings remain intact).
Open-Weight Preview – While today’s checkpoint is a “preview” available through Qwen Chat, Alibaba says it will release research weights and evaluation code for the community after internal red-teaming.

How Qwen VLo Stacks Up

Early demos show Qwen VLo competing with proprietary leaders like OpenAI’s DALL·E 3 and Google’s Imagen 3, particularly in iterative editing—a niche where real-time, step-by-step refinement matters more than single-shot quality. Its multilingual reach also outpaces many Western rivals focused on English-centric pipelines.

Metric	Qwen VLo	Qwen-VL-Chat (2023)	DALL·E 3*
Multilingual prompts	29 langs	2 langs	1 lang
Progressive edit loop	Yes	Limited	No (separate calls)
Direct in-chat usage	Yes	Yes	Via API / Bing

*Publicly documented capabilities, not full benchmark numbers.

Early Use-Cases

Product Prototyping – Designers iterate packaging mock-ups in seconds, adjusting colors or features interactively.
E-commerce Localization – Sellers generate region-specific imagery (e.g., text overlays in Arabic or Thai) from the same master prompt.
Education & Media – Teachers create step-wise visual explanations, refining diagrams as students ask follow-up questions.

Limitations & Roadmap

Alibaba notes the preview model still struggles with text rendering inside images and ultra-fine object counts beyond 20 items. Future updates will incorporate a tokenizer specialized for embedded text and larger training batches to mitigate these edge cases. A video-generation extension, Qwen VLo-Motion, is also under internal testing.

Final Takeaway

Qwen VLo signals the next phase of multimodal AI, where understanding and creation converge in one model. By offering progressive editing, broad language support, and immediate access via Qwen Chat, Alibaba is positioning its Qwen series as a practical, open alternative to closed-source image generators—and bringing the world a step closer to seamless, conversational creativity.