6.7.25

FreeMorph turns Stable Diffusion into a one-click image-morphing engine

 Image morphing has been around since Michael Jackson’s Black or White video, but most modern AI pipelines still demand per-pair fine-tuning or laborious warping to keep shapes and textures coherent. A new paper from NTU, Nanjing University and CUHK drops that baggage. FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model repurposes an off-the-shelf Stable Diffusion 2.1 checkpoint to generate frame-perfect transitions between any two images—faces, cars, even cat-to-dog mash-ups—without touching a single weight. 

Two tricks make the magic happen

  1. Guidance-aware spherical interpolation (GaSI). Instead of naive latent mixing, FreeMorph blends the key-value pairs inside Stable Diffusion’s self-attention, injecting “identity anchors” from both source images so the morph stays on course. 

  2. Step-oriented variation trend (SoVT). A second module dials in how much of each image shows up at every denoising step, taming the non-linear chaos that usually derails tuning-free edits. 

Faster and smoother than the competition

Running on a single NVIDIA A100, FreeMorph spits out a full transition sequence in under 30 seconds, beating DiffMorpher and IMPUS—which both require minutes of LoRA fine-tuning—while delivering sharper edges and fewer identity slips.

A new benchmark to prove it

Because existing datasets skew toward near-identical pairs, the authors collected Morph4Data,
 four classes of image pairs ranging from “same layout, different semantics” to “totally unrelated.” On this tougher mix, FreeMorph tops every published method in quantitative metrics and user studies alike. 

Why this matters

For creative-tool startups, FreeMorph means morphing features can ship as a call to Stable Diffusion rather than a 30-minute fine-tune. For researchers, GaSI + SoVT point to a broader lesson: you can co-opt diffusion attention layers for structural edits without sacrificing model generality.

The code, demo video and ready-to-run Colab notebook are already live on GitHub, so expect FreeMorph-powered GIF makers to surface on your timeline before summer’s out.

Paper link: arXiv 2507.01953 (PDF)

WebSailor charts an open-source course to super-human web reasoning

 For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.” 

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning. 

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize. 

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools. 

Why it matters

  • Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.

  • A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.

  • Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks. 

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers. 

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592         (PDF)

4.7.25

MoCa turns your favorite VLM into a bidirectional embedding powerhous

 Causal-attention vision–language models (VLMs) are great storytellers, but they’re not ideal when you just need a single, rock-solid vector that fuses pixels and prose. A joint team from Renmin University of China, Stanford and Microsoft Research Asia thinks it has a fix. In a paper released this week, the researchers introduce MoCa — Modality-aware Continual Pre-training, a plug-and-play recipe that transforms any off-the-shelf VLM into a bidirectional, retrieval-grade multimodal embedder.

Two stages, three big problems solved

  1. Modality-aware Continual Pre-training (CPT)
    Joint reconstruction denoises interleaved text tokens via masked-language modeling and masked image patches via a lightweight decoder in one go. The tweak injects bidirectional attention and lets the model learn from billions of unlabeled, mixed-modality tokens.

  2. Heterogeneous Contrastive Fine-tuning (HCF)
    Moving beyond garden-variety image-caption pairs, MoCa mixes long-form query-document sets, curated visual-text pairs and plain text-only examples. Task-aware batching throws all three into every mini-batch, forcing deeper cross-modal reasoning instead of surface-level matching.

Together, the stages tackle the trio of headaches plaguing existing embedding retrofits: causal attention, dependence on labeled pairs and narrow training objectives.

Numbers that matter

ModelParamsMMEB (overall ↑)ViDoRe-v2 (avg ↑)
mmE511 B69.850.5
VLM2Vec7 B62.938.7
MoCa-3B3 B67.559.8
MoCa-7B7 B71.558.8

A 7-billion-parameter MoCa variant tops all published baselines across MMEB’s 36 tasks, while the lighter 3-B version jumps almost 10 points on ViDoRe-v2’s document-level retrieval suite. Even more telling: a 3-B MoCa with CPT beats 7-B models trained only with contrastive learning.

Ablations spotlight CPT’s punch

Yank out either the masked-language (MLM) or masked-autoencoding (MAE) objectives during CPT, and MMEB scores slide by up to 1.3 points. Drop the entire CPT stage and you lose nearly 2 points—proof that modality-aware reconstruction, not just more contrastive data, drives the gains.

Why it matters

  • Retrieval is eating the multimodal world. Search, RAG pipelines and recommender systems need embeddings, not prose. A bidirectional retrofit averts the cost of training from scratch.

  • Scales with unlabeled data. By exploiting noisy Web corpora, MoCa sidesteps the image-caption bottleneck hobbling many CLIP-style updates.

  • Open VLM agnostic. The authors demo on Qwen-2.5-VL backbones, but the training recipe is architecture-neutral—anything with a ViT and Transformer decoder should drop in.

What’s next

The paper hints at a public GitHub release with checkpoints, data loaders and task-aware batching helpers. If the repo ships soon, expect MoCa-style CPT to become a default step for teams building multimodal RAG or e-commerce search engines on lightweight hardware.

Paper link: arXiv 2506.23115 (PDF)

DiffuCoder rewrites the code-LLM playbook with diffusion and smarter RL

 Autoregressive (AR) giants like GPT-4o and Qwen2.5 dominate today’s leaderboard-driven coding scene, but Apple’s research group thinks the next breakthrough may come from an entirely different generation paradigm. In a paper published late last week, the team unveiled DiffuCoder — a 7 B-parameter masked diffusion language model (dLLM) designed specifically for program synthesis and repair. Unlike AR models that predict the next token left-to-right, DiffuCoder iteratively denoises whole sequences, enabling global planning and out-of-order refinement.

What’s new under the hood

  • Scaled training for code. DiffuCoder is pretrained on 130 billion code tokens, then instruction-tuned and RL-fined on curated problem sets. That makes it one of the largest diffusion-first code models publicly documented.

  • Decoding insights. The authors introduce local and global AR-ness metrics to quantify how often a diffusion model falls back to sequential generation. They show that raising temperature not only diversifies token choice but also the order in which tokens are filled — a property AR models lack.

  • Coupled-GRPO. To tame the high-variance log-likelihood estimates that plague diffusion policy gradients, Apple proposes coupled Group Relative Policy Optimization, a two-pass masking strategy that evaluates complementary token subsets in one RL rollout. The technique drops noise without resorting to semi-AR “block decoding,” keeping the model fully diffusion-native.

Benchmark scores that matter

DiffuCoder’s base model already lands in the same ballpark as leading 7/8 B AR coders. After instruction tuning and coupled-GRPO, it posts:

ModelHumanEval+MBPP+EvalPlus (avg.)BigCodeBench C-Full
DiffuCoder-Instruct72.065.275.161.9
+ coupled-GRPO73.268.378.667.5

That +4.4-point jump on EvalPlus brings the diffusion model within striking distance of Qwen2.5-Coder-SFT while comfortably outpacing earlier dLLMs like Dream-7B and LLaDA-Instruct.

Why it matters

Diffusion’s parallel denoising lets models “think in drafts,” revisiting earlier lines without paying the quadratic attention tax AR models incur for long contexts. For enterprise dev-ops teams staring down thousand-line files, a diffusion-native coder that no longer needs block-wise hacks could slash latency and memory. And because coupled-GRPO is plug-and-play, the method can in theory retrofit any masked diffusion LLM — not just Apple’s.

Early tooling and ecosystem

A DiffuCoder-7B-Instruct checkpoint is already live on Hugging Face, and the GitHub repo ships with sampling scripts, RL rewards and evaluation harnesses. That means startups building unit-test agents or code-review copilots can kick the tires today on a single A100.

The bigger question is whether diffusion LLMs can climb the performance ladder as fast as their image cousins did in 2022. Apple’s coupled-GRPO shows one path forward: make RL native to diffusion instead of forcing AR habits onto a fundamentally different beast. If follow-up work scales the idea to 34 B or 70 B parameters, AR incumbents may soon find themselves sharing the podium.

Paper link: arXiv 2506.20639 (PDF)

Keye-VL: Kuaishou’s 8-billion-parameter bid to dominate video-first AI

 If image-centric multimodal large language models (MLLMs) were last year’s breakout stars, 2025 is shaping up to be all about video. Today Kuaishou’s research arm quietly published the Kwai Keye-VL Technical Report, unveiling an 8-billion-parameter model that claims state-of-the-art results across every major short-video benchmark — all while staying lean enough to fine-tune on a single A100 or RTX 6000.

Built on data — 600 billion tokens of it

Keye-VL’s recipe starts with scale where it matters: data. The team curated a 600 billion-token corpus heavily skewed toward short videos, supplementing it with images and pure text for balance. Training unfolds in a four-stage pre-train pipeline (image-text matching ➜ ViT-LLM alignment ➜ multi-task pre-train ➜ annealing) and a two-phase post-train that injects reasoning skill through a five-mode “cold-start” mixture (think / no-think / auto-think / think-with-image / high-quality video) plus reinforcement-learning alignment to squash repetition and hallucination.

A hybrid SigLIP + Qwen3 backbone

Under the hood, Keye-VL bolts a SigLIP vision encoder onto Qwen3-8B, then unifies text, image and video tokens with 3-D RoPE positional encoding. Dynamic-resolution support keeps aspect ratios intact, while an isomorphic-heterogeneous parameter-fusion trick averages weights from differently mixed data regimes to boost robustness without extra FLOPs.

Crushing the video leaderboards

On Video-MME, Video-MMMU, TempCompass, LongVideoBench and MMVU, Keye-VL outperforms every open-source or proprietary model in its size class, according to the authors. They also introduce KC-MMBench, a purpose-built benchmark of real-world short-video tasks, where Keye-VL “shows a significant advantage” over larger rivals. While the paper withholds exact deltas pending conference review, the accompanying GitHub charts depict double-digit gains on several suites.

Why it matters

Short-form video is the lingua franca of Gen Z commerce and social search — but decoding dozens of rapid cuts, subtitles and visual gags is still a blind spot for many MLLMs. By feeding a video-centric diet into a lightweight backbone, Kuaishou positions Keye-VL as both a production-ready recommendation engine for its 600-million-user platform and a developer-friendly alternative to heavyweight research models like Gemini 1.5 Pro or OpenAI’s rumored VideoGPT.

Open weights, open benchmark

An 8B preview checkpoint is already live on Hugging Face, complete with a keye-vl-utils helper library and Colab demo. KC-MMBench’s evaluation scripts ship in the same repo, inviting outside labs to reproduce — or refute — Kuaishou’s numbers. For startups building shopping stream copilots or automated highlight reels, a smaller, video-savvy foundation could be the missing piece.

Keye-VL still faces unanswered questions — latency under real-time loads, licensing around its internal data, and how well the “think-with-image” mode generalizes beyond curated prompts. But if the benchmarks hold up, Kuaishou just proved you don’t need GPT-sized weights to understand the world in motion.

Paper link: arXiv 2507.01949 (PDF)

3.7.25

LongAnimation promises Tokyo-quality color at indie-studio speed

 When you think about the most time-consuming part of anime production, flashy fight scenes or painstaking tweening may spring to mind. In reality, a huge chunk of budget and overtime goes into the unglamorous grind of coloring hundreds of frames so that a heroine’s yellow ribbon doesn’t silently morph into pink halfway through a scene. A new paper out of the University of Science and Technology of China and HKUST wants to make that tedium disappear.

Today the team unveiled LongAnimation: Long Animation Generation with Dynamic Global-Local Memory, a diffusion-transformer pipeline that can propagate colors consistently across 500-frame sequences—roughly 20 seconds at broadcast frame rates—without the dreaded color drift that plagues existing tools. Compared with state-of-the-art video colorization baselines, LongAnimation slashes Frechet Video Distance by 35.1% on short clips and 49.1% on long ones, while cutting perceptual error (LPIPS) by more than half.




How it works

  1. SketchDiT
    A customized DiT backbone ingests three control signals—line-art sketches, a single colored keyframe, and optional text prompts—to extract what the authors call a “hybrid reference embedding.” This keeps the model flexible enough to obey textual cues (“sunset sky”) while staying locked onto a character’s palette.

  2. Dynamic Global-Local Memory (DGLM)
    Prior systems only merge overlapping windows, so they see at best the last few seconds of footage. LongAnimation pipes every generated segment through Video-XL, a long-video understanding model, compressing thousands of frames into a global cache. During generation, the network adaptively fuses that global context with a short “local” cache, letting it remember that the yellow ribbon was, in fact, yellow back in frame 25.

  3. Color Consistency Reward (CCR)
    To train the system without back-propagating through a hefty 3D VAE, the authors bolt on a reinforcement-learning reward that directly scores low-frequency color coherence. A late-stage latent-space fusion trick during inference (their “CCF”) then smooths boundary artifacts between segments.


Why it matters

Traditional colorization assistants like LVCD or ToonCrafter top out at ~100 frames or quietly devolve into noise accumulation if you stitch segments together. LongAnimation’s five-times leap in sequence length pushes automated coloring into territory that covers most dialogue and establishing shots, not just blink-and-you-miss-it gifs.

For mid-tier studios in Seoul or Manila that churn through thousands of outsourced cuts each month, the economics are compelling: one keyframe plus vectorized sketches could drive bulk coloring, leaving human artists to polish hero shots. And because SketchDiT still honors text instructions, directors can tweak backgrounds—“make it dawn instead of dusk”—without round-tripping to compositing.


Under the hood

  • Model size: Built on top of CogVideoX-1.5 (5 B params).

  • Training set: ~80 k high-aesthetic clips from Sakuga-42M, filtered for >91 frames.

  • Hardware: 6 × NVIDIA A100 GPUs, LR = 1e-5, three-stage curriculum (SketchDiT 30 k steps → DGLM 10 k → CCR 10 k).

  • Code: The repo, demo videos, and Colab notebook are already live on GitHub.


The bigger picture

LongAnimation lands amid a broader rush to extend diffusion transformers beyond blink-length video. Google’s DitCtrl and Meta’s SlowFast-VGen deliver longer shots but rely on window fusion or fine-tuned LoRA weights. By contrast, LongAnimation’s plug-and-play memory module could slot into any DiT-style architecture, making it a tempting drop-in upgrade for text-to-video startups chasing the next One Piece.

Just don’t expect the tech to kill colorists’ jobs overnight. Rendering frames is only half the battle; style supervision, motion cleanup and final compositing still demand human taste. But if the ribbon stays yellow without manual touch-ups, the conversation around AI in animation may shift from “Will it replace us?” to “How much budget does it free for better storytelling?”

Paper link: arXiv:2507.01945 (PDF)

Baidu Open-Sources ERNIE 4.5: A Full LLM Family from 0.3 B to 424 B Parameters

 

A Flagship Release for the Open-Source Community

On July 1 2025, Baidu announced the open-source launch of ERNIE 4.5, a complete large-language-model family scaling from 0.3 billion to 424 billion parameters. The weights, training code, and evaluation suites are now freely available to researchers and enterprises under the Apache 2.0 license.

Six Sizes, One Architecture

ModelDense / MoEContext WindowTarget Hardware*Intended Use
ERNIE-Tiny 0.3BDense16 KMobile/EdgeLightweight chat & IoT
ERNIE-Base 7BDense32 K1× A10 24 GBMainstream apps
ERNIE-Large 34BDense128 K2× A100 80 GBRAG & agents
ERNIE-XL 124BMoE (8 experts)256 K4× H100 80 GBMultimodal research
ERNIE-Mega 276BMoE (16)256 K8× H100 80 GBEnterprise AI
ERNIE-Ultra 424BMoE (24)1 MTPU v5p / 16× H100Frontier-level reasoning

*at int8 + FlashAttention-2 settings

Technology Highlights

  • FlashMask Dynamic Attention – a masking scheme that activates only the most relevant key-value blocks per token, cutting memory by 40 % while retaining context depth.

  • Heterogeneous Multimodal MoE – vision-audio experts share early layers with text, enabling cross-modal reasoning without separate encoders.

  • Knowledge-Centric Corpus – Baidu’s in-house “Wenxin KG-2” injects 4 T tokens of curated facts and regulations, boosting compliance answers.

  • Self-Feedback Post-Training – iterative reflection steps reduce hallucination rate by 28 % vs. ERNIE 4.0.

Benchmark Performance

Benchmark (June 2025)GPT-4.5*ERNIE 4.5-Ultra 424BERNIE 4.5-Large 34B
MMLU (5-shot)88.7 %89.3 %82.1 %
MathGLUE55.4 %57.2 %48.0 %
VQA-v2 (zero-shot)83.0 %84.6 %78.9 %
Code HumanEval+93.5 %94.1 %87.3 %

*closed model; public leaderboard values. ERNIE 4.5 data from Baidu release notes.

Why It Matters

  1. End-to-End Transparency – full training configs (FlashMask, MoE routing, safety filters) are published, enabling reproducible research.

  2. Scalable Deployment – identical API across sizes lets startups choose Tiny/7B locally and swap to 424B in the cloud without prompt changes.

  3. Multilingual & Multimodal – supports 34 languages and native image, audio, and short-video tokens out of the box.

  4. Cost Innovation – FlashMask and MoE shrink inference FLOPs by up to 55 % versus dense GPT-4-class models, lowering GPU bills for enterprise users.

Access & Tooling

  • Hugging Face Hub – weights and safetensors for all six checkpoints.

  • Docker & vLLM Images – ready-to-serve stacks with Triton / TensorRT-LLM.

  • Agent Starter Kits – sample Model-Context-Protocol (MCP) tools for retrieval, calculators, and code execution.

  • Chinese & English Docs – prompt templates, fine-tuning scripts, and safety policy examples.

Roadmap

Baidu’s research blog notes upcoming “ERNIE 4.6” experiments with FlashMask-2 and sparse Mixture-of-Experts vision heads, plus a policy-aligned Turbo variant targeting 80 % cheaper inference for chat applications.


Takeaway
With ERNIE 4.5, Baidu throws open the doors to a fully transparent, parameter-scalable, multimodal LLM family—giving practitioners a home-grown alternative to closed giants and pushing the frontier of what open-source models can achieve.

Together AI’s DeepSWE Turns Qwen3-32B into an Open-Source Coding Agent that Tops SWEBench

 

A New State of the Art for Open-Source Coding Agents

Together AI has unveiled DeepSWE, a software-engineering agent that sets a new open-weight record on the notoriously difficult SWEBench-Verified benchmark with 59 % accuracy and 42.2 % Pass@1. Built on Alibaba’s Qwen3-32B language model and trained purely with reinforcement learning, DeepSWE offers a transparent alternative to closed-source dev assistants like GitHub Copilot and Claude Code. 


Inside the Training Pipeline

StageDetails
Warm-StartInitializes from base Qwen3-32B weights (dense, 32 B params).
R2E-Gym Curriculum4,500 real GitHub issues converted into step-by-step repair tasks spanning six languages (Python, Java, JS, Go, Rust, C++).
RLHF LoopUses a reward model that scores test-suite pass rates and diff conciseness; policy optimized with PPO across 64 × H100s for six days.
Self-Reflect & DistillHigh-reward trajectories distilled back into the policy to improve “first-try” success.

The team openly publishes all training code, reward scripts, and checkpoints under Apache 2.0, enabling independent replication or domain-specific finetuning. 

Why DeepSWE Matters

  1. One-Shot Repairs over Multi-Tool Chains
    DeepSWE fixes repository-level bugs in a single forward pass, skipping heavyweight agent stacks that juggle search, planning, and external compilers.

  2. Reinforcement Learning at Scale
    Proves that RL alone—without supervised trace data—can yield production-grade coding skills when paired with a high-capacity base model.

  3. Transparent & Portable
    Enterprises can self-host the model, audit its reward functions, and retrain on private codebases without licensing friction.


Benchmark Highlights

BenchmarkDeepSWE (32 B)DeepSeek-R1-Synth (67 B)GPT-4o (closed)
SWEBench-Verified59 %46 %64 %
HumanEval Plus93.1 %87.4 %95 %
CommitPackBench71.3 %63.0 %74 %

DeepSWE closes nearly half of the gap to GPT-4-class tools while running on a single 80 GB H100 GPU in int8 mode.

Real-World Capabilities

  • Bug Repair & Refactor – Generates minimal diffs that compile and pass project test suites.

  • Feature Stubs – Adds new endpoints, CLI flags, or unit tests on request.

  • Context Stretch – Accepts up to 64 K tokens, allowing multi-file reasoning across large repos.

Together AI provides an OpenAI-compatible API plus a VS Code extension that surfaces proposed patches as Git diffs for quick human review.


Roadmap

The team plans to:

  • Release a 13 B “consumer PC” variant trained on the same reward curriculum.

  • Add tool-augmented variants that can invoke package managers and linters dynamically.

  • Expand R2E-Gym to 10 K tasks, covering Android and .NET ecosystems.


Takeaway

DeepSWE demonstrates that meticulous RL on a strong open base (Qwen3-32B) can rival closed commercial coders—while remaining fully inspectable and modifiable. For organizations seeking sovereign AI development stacks, it’s a compelling invitation to “clone the repo, load the weights, and start fixing code.”

Baidu’s “AI Search Paradigm” Unveils a Four-Agent Framework for Next-Generation Information Retrieval

 

A Blueprint for Smarter Search

Traditional RAG pipelines handle simple fact look-ups well but struggle when queries require multi-step reasoning, tool use, or synthesis. In response, Baidu Research has introduced the AI Search Paradigm, a unified framework in which four specialized LLM-powered agents collaborate to emulate human research workflows. 

AgentRoleKey Skills
MasterClassifies query difficulty & launches a workflowMeta-reasoning, task routing
PlannerBreaks the problem into ordered sub-tasksDecomposition, tool selection
ExecutorCalls external APIs or web search to gather evidenceRetrieval, browsing, code-run
WriterConsolidates evidence into fluent, cited answersSynthesis, style control

The architecture adapts on the fly: trivial queries may bypass planning, while open-ended questions trigger full agent collaboration.

Technical Innovations

  • Dynamic Workflow Graphs – Agents spawn or skip steps in real time based on intermediate results, avoiding rigid “one-size-fits-all” chains.

  • Robust Tool Layer – Executor can invoke search APIs, calculators, code sandboxes, and custom enterprise databases, all via a common interface.

  • Alignment & Safety – Reinforcement learning with human feedback (RLHF) plus retrieval-grounding reduce hallucinations and improve citation accuracy.


Benchmark Results

On a suite of open-web reasoning tasks the system, dubbed Baidu ASP in the paper, surpasses state-of-the-art open-source baselines and even challenges proprietary models that rely on massive context windows alone.

Benchmark    Prior Best (RAG)    Baidu ASP
Complex QA (avg. F1)                    46.2           57.8
Multi-hop HotpotQA (Exact Match)                41.5               53.0
ORION Deep-Search                37.1            49.6

Practical Implications

  • Enterprise Knowledge Portals – Route user tickets through Planner→Executor→Writer to surface compliant, fully referenced answers.

  • Academic Research Assistants – Decompose literature reviews into sub-queries, fetch PDFs, and synthesize summaries.

  • E-commerce Assistants – From “Find a laptop under $800 that runs Blender” to a shoppable list with citations in a single interaction.

Because each agent is modular, organisations can fine-tune or swap individual components—e.g., plugging in a domain-specific retrieval tool—without retraining the entire stack.


Looking Ahead

The team plans to open-source a reference implementation and release an evaluation harness so other researchers can benchmark new agent variants under identical conditions. Future work focuses on:

  • Reducing latency by parallelising Executor calls

  • Expanding the Writer’s multimodal output (tables, charts, code diffs)

  • Hardening the Master agent’s self-diagnosis to detect and recover from tool failures


Takeaway
Baidu’s AI Search Paradigm reframes search as a cooperative, multi-agent process, merging planning, tool use, and natural-language synthesis into one adaptable pipeline. For enterprises and researchers seeking deeper, trustable answers—not just blue links—this approach signals how tomorrow’s search engines and internal knowledge bots will be built.

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...