Wandering Nomad: Multi-Agent Systems

Showing posts with label Multi-Agent Systems. Show all posts

1.9.25

Self-evolving AI agents: from static LLMs to systems that learn on the job

Agent frameworks are great at demo day, brittle in the wild. A sweeping new survey argues the fix isn’t a bigger model but a new self-evolving paradigm: agents that keep improving after deployment using the data and feedback their work naturally produces. The paper pulls scattered ideas under one roof and offers a playbook for researchers and startups building agents that won’t ossify after v1.0.

The big idea: turn agents into closed-loop learners

The authors formalize a feedback loop with four moving parts—System Inputs, the Agent System, the Environment, and Optimisers—and show how different research threads plug into each stage. Think: collecting richer traces from real use (inputs), upgrading skills or tools (agent system), instrumenting the app surface (environment), and choosing the learning rule (optimisers).

A working taxonomy you can implement

Within that loop, the survey maps techniques you can mix-and-match:

Single-agent evolution: self-reflection, memory growth, tool discovery, skill libraries, meta-learning and planner refinements driven by interaction data.
Multi-agent evolution: division-of-labour curricula, role negotiation, and team-level learning signals so collectives improve—not just individuals.
Domain programs: recipes specialized for biomed, programming, and finance, where optimization targets and constraints are domain-specific.

Evaluation and safety don’t lag behind

The paper argues for verifiable benchmarks (exact-match tasks, executable tests, grounded web tasks) so improvements aren’t just prompt luck. It also centers safety and ethics: guarding against reward hacking, data poisoning, distribution shift, and privacy leaks that can arise when models learn from their own usage.

Why this matters now

Static fine-tunes stagnate. Post-training once, shipping, and hoping for the best leaves quality on the table as tasks drift.
Logs are learning fuel. Structured traces, success/failure signals, and user edits are free gradients if you design the loop.
From demos to durable systems. The framework gives teams a shared language to plan what to learn, when, and how to verify it—before flipping the “autonomous improvement” switch.

If you’re building an assistant, coder, or web agent you expect to live for months, this survey is a pragmatic roadmap to keep it getting better—safely—long after launch.

Paper link: arXiv 2508.07407 (PDF)

28.8.25

Anemoi: a semi-centralized agent system that lets bots talk to each other—literally

Most generalist multi-agent stacks still look like a relay race: a central planner prompts specialist workers, who pass back long context blobs for the planner to stitch together. It works—until you downsize the planner or hit token limits. Anemoi proposes a different wiring: keep a light planner, but let agents communicate directly over an Agent-to-Agent (A2A) MCP server so everyone can see progress, flag bottlenecks, and propose fixes in real time.

What’s actually new

Anemoi replaces unidirectional prompt passing with a threaded A2A server (built on the Model Context Protocol) that exposes primitives like list_agents, create_thread, send_message, and wait_for_mentions. Any agent can join a thread, address peers, and update plans mid-flight—reducing redundant context stuffing and information loss.

The cast of agents (and why it matters)

Planner: drafts the initial plan and spins up a thread.
Critique: continuously audits intermediate results.
Answer-Finder: compiles the final submission.
Workers: Web, Document Processing, and Reasoning & Coding—mirroring OWL’s tool set for a fair head-to-head. All are MCP-enabled so they can monitor progress and coordinate directly.

This design reduces reliance on one overpowered planner, supports adaptive plan updates, and cuts token overhead from repeated context injection.

Numbers that move the needle (GAIA validation)

Framework	Planner / Workers	Avg. Acc.
OWL-rep (pass@3)	GPT-4.1-mini / GPT-4o	43.64%
OWL (paper, pass@3)	GPT-4o-mini / GPT-4o	47.27%
Anemoi (pass@3)	GPT-4.1-mini / GPT-4o	52.73%

With a small planner (GPT-4.1-mini), Anemoi tops a strong open-source baseline by +9.09 points under identical tools and models—and is competitive with several proprietary systems that rely on larger planners.

How the A2A workflow runs

Discover agents → 2) Create thread with participants → 3) Workers execute subtasks; Critique labels outputs accept/uncertain while any agent can contribute revisions → 4) Consensus vote before finalization → 5) Answer-Finder submits. All via MCP messaging in a single conversation context.

Where it wins—and where it trips

Wins: Of the tasks Anemoi solved that OWL missed, 52% were due to collaborative refinement enabled by A2A; another 8% came from less context redundancy.
Failures: Remaining errors skew to LLM/tool limits (≈46%/21%), incorrect plans (≈12%), and some communication latency (≈10%)—notably when the web agent is busy and can’t respond to peers.

Why this matters

If your agent system juggles web search, file I/O, and coding, direct inter-agent communication can deliver better results without upgrading to an expensive planner. Anemoi shows a practical blueprint: keep the planner lightweight, move coordination into an A2A layer, and let specialists negotiate in-thread instead of bloating prompts.

Paper link: arXiv 2508.17068 (PDF)

4.8.25

The Agentic Web: when bots become the primary users of the internet

Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end.

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory.

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots.

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows.

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome?

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

31.7.25

“Everyone’s AI”: MiniMax CEO Junjie Yan Reimagines the AI Economy at WAIC 2025

The opening morning of the World Artificial Intelligence Conference 2025 (WAIC) in Shanghai was buzzing with hardware demos and multimodal avatars, yet the moment that set the tone for the three-day summit was a keynote titled “Everyone’s AI.” Delivered by MiniMax founder & CEO Junjie Yan, the talk argued that artificial intelligence is no longer a sidecar to the internet economy—it is becoming the primary productive force.

From research toy to societal engine

Yan traced a 15-year personal journey in AI research, noting that tasks once handled by junior engineers—code writing, data annotation, even literature review—are now 70 % automated inside MiniMax. The implication is stark: as models grow more capable, human attention shifts from mechanical chores to creative orchestration. “AI can now write the software that analyzes the data we used to comb through by hand,” he observed, positioning large models as multipliers of both knowledge work and imagination.

The economics: another 10× drop on the horizon

MiniMax isn’t just waxing philosophical; it is betting on cost curves. Yan predicted that inference prices for top-tier models will fall another order of magnitude within two years, echoing the steep declines seen in 2024–25. Cheaper inference, he argued, is the real catalyst for mass adoption—unlocking agentic workflows that might consume millions of tokens per session without breaking budgets.

Many models, many values

Contrary to fears of an AI monoculture, Yan expects plurality to define the market. Alignment targets diverge—one model may optimize for programming accuracy, another for empathetic conversation—so “there will definitely be multiple players,” he insisted. Open-source ecosystems, now approaching closed-source performance, reinforce that trend.

Multi-agent systems change the rules

Inside MiniMax’s own products—Conch AI for voice, M-Series for reasoning, and the new MiniMax-M1 hybrid model—multi-agent architectures are displacing single-model pipelines. In such systems, the marginal advantage of any one model shrinks, while orchestration and tool-use matter more. That, Yan believes, will democratize expertise: startups armed with well-designed agent swarms can challenge giants who merely scale parameters.

A less money-burning industry

Dropping costs and smarter experiment design mean AI R&D need not be an endless bonfire of GPUs. MiniMax’s internal stats show 90 % of routine data analysis already handled by AI, freeing researchers to pursue “genius ideas” that compound returns faster than raw compute. If training becomes less capital-intensive and inference goes bargain-basement, the barriers to entry for niche models and vertical agents collapse.

“Everyone’s AI” as call to action

Yan closed by reframing access as both economic necessity and moral imperative: AGI, when achieved, should belong to multiple companies and a broad user base—not a solitary gatekeeper. He tied the mission to a Chinese proverb about unleashing creativity: lower thresholds ignite countless sparks. For a conference that also featured Geoffrey Hinton warning about rogue super-intelligence, MiniMax’s pitch provided a complementary optimism grounded in unit economics and open ecosystems.

Why it matters

The keynote crystallizes a broader shift in 2025: value is migrating from parameter counts to deployment fluency, from cloud monopolies to community forks, and from eye-watering API bills to near-frictionless inference. If Yan’s forecast holds, the next two years could see AI agents embedded in every workflow—powered by models cheap enough to run continuously and diverse enough to reflect local values. In that future, “Everyone’s AI” is not a slogan; it is table stakes.

21.7.25

The rise of Context Engineering: why LLM performance now lives and dies on what you feed it

Prompt tricks and vector databases used to feel like nice-to-have extras for chatbots. A sprawling new study argues they have matured into a discipline of their own. Titled “A Survey of Context Engineering for Large Language Models,” the 165-page report from the Chinese Academy of Sciences, UC Merced and seven other universities positions context selection, shaping and storage as the primary lever for squeezing more capability out of ever-larger models. The team sifted through 1,400-plus research papers to build the first comprehensive roadmap of the space.

From prompt hacks to a three-pillar stack

The authors split Context Engineering into three foundational components:

Context retrieval & generation – everything from classic prompt templates to dynamic external-knowledge acquisition.
Context processing – long-sequence handling, self-refinement loops and multimodal or structured context fusion.
Context management – memory hierarchies, compression schemes and token-budget optimisation.

These pillars support four dominant system archetypes: Retrieval-Augmented Generation (RAG), long-lived memory agents, tool-integrated reasoning (function calling, code execution) and fully fledged multi-agent frameworks.

Why the stakes keep rising

Bigger models, harsher limits. Even GPT-class contexts choke on enterprise-scale corpora; smarter pruning and compression decide whether answers stay on-topic or derail.
Agents need persistence. As LLM agents stretch across hours or days, hierarchical memory and context-refresh policies become as critical as the policy network itself.
Tool use explodes token demand. Function calls and code snippets are powerful but verbose; context engineering keeps them from crowding out the original question.

A looming research gap

Despite dramatic gains in understanding long and complex contexts, models remain weak at generating equally long, logically coherent outputs—a mismatch the survey brands the field’s “defining priority for future research.”

Practical takeaways for builders

Treat context like a first-class system resource—budget, cache and monitor it the way you would GPU memory.
Mix retrieval styles. Hybrid pipelines (keyword, dense, graph) outperform single-method RAG on complex queries.
Plan for multi-layer memory. Short-term windows, episodic buffers and long-term stores each have distinct TTLs and compression trade-offs.

Published July 17 2025 with an accompanying GitHub “awesome list,” the survey is already circulating among infra and agent teams looking to squeeze more mileage out of existing checkpoints before the next trillion-parameter beast lands.

Paper link: arXiv 2507.13334 (PDF)

8.7.25

Context Engineering in AI: Designing the Right Inputs for Smarter, Safer Large-Language Models

What Is Context Engineering?

In classic software, developers write deterministic code; in today’s AI systems, we compose contexts. Context engineering is the systematic craft of designing, organizing and manipulating every token fed into a large-language model (LLM) at inference time—instructions, examples, retrieved documents, API results, user profiles, safety policies, even intermediate chain-of-thought. Well-engineered context turns a general model into a domain expert; poor context produces hallucinations, leakage or policy violations.

Core Techniques

Technique	Goal	Typical Tools / Patterns
Prompt Design & Templates	Give the model clear role, task, format and constraints	System + user role prompts; XML / JSON schemas; function-calling specs
Retrieval-Augmented Generation (RAG)	Supply fresh, external knowledge just-in-time	Vector search, hybrid BM25+embedding, GraphRAG
Context Compression	Fit more signal into limited tokens	Summarisation, saliency ranking, LLM-powered “short-former” rewriters
Chunking & Windowing	Preserve locality in extra-long inputs	Hierarchical windows, sliding attention, FlashMask / Ring Attention
Scratchpads & CoT Scaffolds	Expose model reasoning for better accuracy and debuggability	Self-consistency, tree-of-thought, DST (Directed Self-Testing)
Memory & Profiles	Personalise without retraining	Vector memories, episodic caches, preference embeddings
Tool / API Context	Let models call and interpret external systems	Model Context Protocol (MCP), JSON-schema function calls, structured tool output
Policy & Guardrails	Enforce safety and brand style	Content filters, regex validators, policy adapters, YAML instruction blocks

Why It Matters

Accuracy & Trust – Fact-filled, well-structured context slashes hallucination rates and citation errors.
Privacy & Governance – Explicit control over what leaves the organisation or reaches the model helps meet GDPR, HIPAA and the EU AI Act.
Cost Efficiency – Compressing or caching context can cut token bills by 50-80 %.
Scalability – Multi-step agent systems live or die by fast, machine-readable context routing; good design tames complexity.

High-Impact Use Cases

Sector	How Context Engineering Delivers Value
Customer Support	RAG surfaces the exact policy paragraph and recent ticket history, enabling a single prompt to draft compliant replies.
Coding Agents	Function-calling + repository retrieval feed IDE paths, diffs and test logs, letting models patch bugs autonomously.
Healthcare Q&A	Context filters strip PHI before retrieval; clinically-approved guidelines injected to guide safe advice.
Legal Analysis	Long-context models read entire case bundles; chunk ranking highlights precedent sections for argument drafting.
Manufacturing IoT	Streaming sensor data is summarised every minute and appended to a rolling window for predictive-maintenance agents.

Designing a Context Pipeline: Four Practical Steps

Map the Task Surface
• What knowledge is static vs. dynamic?
• Which external tools or databases are authoritative?
Define Context Layers
• Base prompt: role, format, policy
• Ephemeral layer: user query, tool results
• Memory layer: user or session history
• Safety layer: filters, refusal templates
Choose Retrieval & Compression Strategies
• Exact text (BM25) for short policies; dense vectors for semantic match
• Summaries or selective quoting for large PDFs
Instrument & Iterate
• Log token mixes, latency, cost
• A/B test different ordering, chunking, or reasoning scaffolds
• Use self-reflection or eval suites (e.g., TruthfulQA-Context) to measure gains

Emerging Tools & Standards

MCP (Model Context Protocol) – open JSON schema for passing tool output and trace metadata to any LLM, adopted by Claude Code, Gemini CLI and IBM MCP Gateway.
Context-Aware Runtimes – vLLM, Flash-Infer and Infinity Lite stream 128 K-1 M tokens with optimized KV caches.
Context Observability Dashboards – Startups like ContextHub show token-level diff, attribution and cost per layer.

The Road Ahead

As context windows expand to a million tokens and multi-agent systems proliferate, context engineering will sit alongside model training and fine-tuning as a first-class AI discipline. Teams that master it will ship assistants that feel domain-expert-smart, honest and cost-efficient—while everyone else will chase unpredictable black boxes.

Whether you’re building a retrieval chatbot, a self-healing codebase or an autonomous research agent, remember: the model is only as good as the context you feed it.

7.7.25

ARAG puts a multi-agent brain inside your RAG stack — and Walmart’s numbers look eye-popping

Retrieval-augmented generation (RAG) has become the go-to recipe for giving large language models real-world context, but most deployments still treat retrieval as a dumb, one-shot lookup. Researchers at Walmart Global Tech think that leaves serious money on the table — especially in e-commerce, where user intent shifts by the minute. Their new framework, ARAG (Agentic Retrieval-Augmented Generation), adds a four-agent reasoning layer on top of vanilla RAG and reports double-digit gains across every metric that matters.

Four specialists, one conversation

User-Understanding Agent distills long-term history and the current session into a natural-language profile.
NLI Agent performs sentence-level entailment to see whether each candidate item actually supports that intent.
Context-Summary Agent compresses only the NLI-approved evidence into a focused prompt.
Item-Ranker Agent fuses all signals and produces the final ranked list.

Each agent writes to — and reads from — a shared blackboard-style memory, so later agents can reason over earlier rationales rather than raw text alone.

How much better? Try 42 %

On three Amazon Review subsets (Clothing, Electronics, Home), ARAG beats both a recency heuristic and a strong cosine-similarity RAG baseline:

Dataset	NDCG@5 ↑	Hit@5 ↑
Clothing	+42.1 %	+35.5 %
Electronics	+37.9 %	+30.9 %
Home & Kitchen	+25.6 %	+22.7 %

An ablation test shows that yanking either the NLI or context-summary modules knocks as much as 14 points off NDCG, underlining how critical cross-agent reasoning is to the win.

Why it matters

Personalization that actually reasons. By turning retrieval and ranking into cooperative LLM agents, ARAG captures the nuance of why an item fits, not just whether embeddings are close.
No model surgery required. The team wraps any existing RAG stack; there’s no need to fine-tune the base LLM, making the upgrade cloud-budget friendly.
Explainability for free. Each agent logs its own JSON-structured evidence, giving product managers a breadcrumb trail for every recommendation.

The bigger picture

Agentic pipelines have taken off in code generation and web browsing; ARAG shows the same trick pays dividends in recommender systems, a multi-billion-dollar battleground where percent-level lifts translate into real revenue. Expect retailers and streaming platforms to test-drive multi-agent RAG as they chase post-cookie personalization.

Paper link: arXiv 2506.21931 (PDF)

9.6.25

Google’s MASS Revolutionizes Multi-Agent AI by Automating Prompt and Topology Optimization

Designing multi-agent AI systems—where several AI "agents" collaborate—has traditionally depended on manual tuning of prompt instructions and agent communication structures (topologies). Google AI, in partnership with Cambridge researchers, is aiming to change that with their new Multi-Agent System Search (MASS) framework. MASS brings automation to the design process, ensuring consistent performance gains across complex domains.

🧠 What MASS Actually Does

MASS performs a three-stage automated optimization that iteratively refines:

Block-Level Prompt Tuning
Fine-tunes individual agent prompts via local search—sharpening their roles (think “questioner”, “solver”).
Topology Optimization
Identifies the best agent interaction structure. It prunes and evaluates possible communication workflows to find the most impactful design.
Workflow-Level Prompt Refinement
Final tuning of prompts once the best network topology is set.

By alternating prompt and topology adjustments, MASS achieves optimization that surpasses previous methods which tackled only one dimension

🏅 Why It Matters

Benchmarked Success: MASS-designed agent systems outperform AFlow and ADAS on challenging benchmarks like MATH, LiveCodeBench, and multi-hop question-answering
Reduced Manual Overhead: Designers no longer need to trial-and-error their way through thousands of prompt-topology combinations.
Extended to Real-World Tasks: Whether for reasoning, coding, or decision-making, this framework is broadly applicable across domains.

💬 Community Reactions

Reddit’s r/machinelearningnews highlighted MASS’s leap beyond isolated prompt or topology tuning:

“Multi-Agent System Search (MASS) … reduces manual effort while achieving state‑of‑the‑art performance on tasks like reasoning, multi‑hop QA, and code generation.” linkedin.com

📘 Technical Deep Dive

Originating from a February 2025 paper by Zhou et al., MASS represents a methodological advance in agentic AI

Agents are modular: designed for distinct roles through prompts.
Topology defines agent communication patterns: linear chain, tree, ring, etc.
MASS explores both prompt and topology spaces, sequentially optimizing them across three stages.
Final systems demonstrate robustness not just in benchmarks but as a repeatable design methodology.

🚀 Wider Implications

Democratizing Agent Design: Non-experts in prompt engineering can deploy effective agent systems from pre-designed searches.
Adaptability: Potential for expanding MASS to dynamic, real-world settings like real-time planning and adaptive workflows.
Innovation Accelerator: Encourages research into auto-tuned multi-agent frameworks for fields like robotics, data pipelines, and interactive assistants.

🧭 Looking Ahead

As Google moves deeper into its “agentic era”—with initiatives like Project Mariner and Gemini's Agent Mode—MASS offers a scalable blueprint for future AS/AI applications. Expect to see frameworks that not only generate prompts but also self-optimize their agent networks for performance and efficiency.

19.5.25

AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications, and Challenges

A recent study by researchers Ranjan Sapkota, Konstantinos I. Roumeliotis, and Manoj Karkee delves into the nuanced differences between AI Agents and Agentic AI, providing a structured taxonomy, application mapping, and an analysis of the challenges inherent to each paradigm.

Defining AI Agents and Agentic AI

AI Agents: These are modular systems primarily driven by Large Language Models (LLMs) and Large Image Models (LIMs), designed for narrow, task-specific automation. They often rely on prompt engineering and tool integration to perform specific functions.
Agentic AI: Representing a paradigmatic shift, Agentic AI systems are characterized by multi-agent collaboration, dynamic task decomposition, persistent memory, and orchestrated autonomy. They move beyond isolated tasks to coordinated systems capable of complex decision-making processes.

Architectural Evolution

The transition from AI Agents to Agentic AI involves significant architectural enhancements:

AI Agents: Utilize core reasoning components like LLMs, augmented with tools to enhance functionality.
Agentic AI: Incorporate advanced architectural components that allow for higher levels of autonomy and coordination among multiple agents, enabling more sophisticated and context-aware operations.

Applications

AI Agents: Commonly applied in areas such as customer support, scheduling, and data summarization, where tasks are well-defined and require specific responses.
Agentic AI: Find applications in more complex domains like research automation, robotic coordination, and medical decision support, where tasks are dynamic and require adaptive, collaborative problem-solving.

Challenges and Proposed Solutions

Both paradigms face unique challenges:

AI Agents: Issues like hallucination and brittleness, where the system may produce inaccurate or nonsensical outputs.
Agentic AI: Challenges include emergent behavior and coordination failures among agents.

To address these, the study suggests solutions such as ReAct loops, Retrieval-Augmented Generation (RAG), orchestration layers, and causal modeling to enhance system robustness and explainability.

References

Sapkota, R., Roumeliotis, K. I., & Karkee, M. (2025). AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges. arXiv preprint arXiv:2505.10468.