Wandering Nomad: Retrieval-Augmented Generation

Showing posts with label Retrieval-Augmented Generation. Show all posts

5.8.25

ReaGAN turns every node into an agent—with a plan, memory, and tools

Classical GNNs push messages with one global rule per layer—great for tidy graphs, brittle for messy ones. ReaGAN (Retrieval-augmented Graph Agentic Network) breaks that mold by treating each node as an autonomous agent that decides whether to aggregate locally, retrieve globally, predict now, or do nothing—based on its own memory and a plan drafted by a frozen LLM.

What’s new

Node-level autonomy. At every layer, a node queries the LLM for an action plan, executes it, and updates memory—no globally synchronized rulebook.
Local + global context. Beyond neighbors in the graph, nodes invoke RAG to retrieve semantically similar but structurally distant nodes, then fuse both sources.
Memory as glue. Nodes persist aggregated text snippets and few-shot (text, label) exemplars, enabling in-context prediction later.

Why it matters

Real-world graphs are sparse and noisy; uniform propagation amplifies junk. ReaGAN’s per-node planning and local-global retrieval adapt to informativeness imbalances and long-range semantics—key gaps in standard GNNs. In experiments, the authors report competitive few-shot performance using only a frozen LLM (no fine-tuning), highlighting a compute-friendly path for graph ML.

How it runs (at a glance)

Each node iterates a loop: perceive → plan → act (LocalAggregation / GlobalAggregation / Predict / NoOp) → update memory. A simple algorithmic skeleton formalizes the layer-wise cycle and action space.

Paper link: https://arxiv.org/pdf/2508.00429

21.7.25

Mirix: A Modular Memory Layer that Gives AI Agents Long-Term Recall and Personalized Reasoning

1 | Why “Memory” Is the Next AI Bottleneck

Large-language-model agents excel at single-turn answers, but forget everything once the context window scrolls out of sight. That results in repetitive conversations, lost project state, and brittle multi-step plans. Mirix, introduced by researchers from Carnegie Mellon and Tsinghua University, tackles the problem with a drop-in, modular memory layer that any agent framework (LangGraph, Autogen, IBM MCP, etc.) can call.

2 | How Mirix Works under the Hood

Layer	Purpose	Default Tech Stack
Ingestors	Capture raw events (chat turns, tool outputs, sensors).	Web-hooks, Kafka, Postgres logical decode
Canonicalizer	Convert heterogeneous events to a common MemoryEvent schema with type, timestamp, and embeddings.	Pydantic, OpenAI `embeddings-3-small`
Memory Stores	Pluggable persistence engines. Ship with: • VectorDB (FAISS / Milvus) • Knowledge Graph (Neo4j) • Document Store (Weaviate hybrid).	Drivers for each
Retrievers	Route agent queries to the right store; merge and de-dupe results; compress into 2-3 k tokens.	Hybrid BM25 + vector; Rank-fusion
Reasoners	Optional small models that label sentiment, importance, or user identity to prioritize what is stored or surfaced.	DistilRoBERTa sentiment, MiniLM ranker

Key insight: memory need not live in a single DB; Mirix treats it as an orchestrated ensemble of stores, each optimised for a particular signal (facts vs. tasks vs. social cues).

3 | What It Enables

Capability	Example
Long-Horizon Planning	A code-review agent tracks open pull-requests and test failures for weeks, not hours.
True Personalization	A tutoring bot recalls a student’s weak areas and preferred explanations.
Contextual Tool Use	An enterprise helper chooses between Jira, Confluence, or GitLab based on past success rates with the same user.

Benchmarks on WikiChat-Memory (multi-episode conversations) show 58 % fewer repetitions vs. vanilla RAG and 3.4 × higher success on 15-step task chains.

4 | Plugging Mirix into an Existing Agent


from mirix.memory import MemoryClient
from agentic import Agent

mem = MemoryClient(
    stores=[
        "faiss://embeddings",
        "neo4j://graph",
        "weaviate://docs"
    ]
)

agent = Agent(llm="mistral-small-3.2", memory=mem)

response = agent.chat("Where did we leave the migration script last week?")
print(response)

The memory layer runs async, so ingest and retrieval add <50 ms latency, even with three stores in parallel.

5 | Governance & Cost Controls

Policy Filters: PII redaction rules determine what is persisted.
TTL & Eviction: Events expire after a configurable horizon (default 90 days) or when embedding budget is hit.
Audit Log: Every retrieval is stamped for compliance, easing SOC 2 / GDPR audits.

6 | Limitations & Roadmap

Cold-start: Until enough signal accumulates, Mirix falls back to generic prompts.
Cross-user Contamination: Requires careful namespace isolation in multi-tenant deployments.
Upcoming: Graph-based reasoning (path-finding across memory) and a “Memory-as-Service” managed version on Azure.

Final Takeaway

Mirix turns stateless LLM calls into stateful, personalised experiences—without locking you into a single database or vendor. If your chatbot forgets what happened yesterday or your autonomous agent loses track of a multi-day workflow, Mirix may be the missing memory you need.

The rise of Context Engineering: why LLM performance now lives and dies on what you feed it

Prompt tricks and vector databases used to feel like nice-to-have extras for chatbots. A sprawling new study argues they have matured into a discipline of their own. Titled “A Survey of Context Engineering for Large Language Models,” the 165-page report from the Chinese Academy of Sciences, UC Merced and seven other universities positions context selection, shaping and storage as the primary lever for squeezing more capability out of ever-larger models. The team sifted through 1,400-plus research papers to build the first comprehensive roadmap of the space.

From prompt hacks to a three-pillar stack

The authors split Context Engineering into three foundational components:

Context retrieval & generation – everything from classic prompt templates to dynamic external-knowledge acquisition.
Context processing – long-sequence handling, self-refinement loops and multimodal or structured context fusion.
Context management – memory hierarchies, compression schemes and token-budget optimisation.

These pillars support four dominant system archetypes: Retrieval-Augmented Generation (RAG), long-lived memory agents, tool-integrated reasoning (function calling, code execution) and fully fledged multi-agent frameworks.

Why the stakes keep rising

Bigger models, harsher limits. Even GPT-class contexts choke on enterprise-scale corpora; smarter pruning and compression decide whether answers stay on-topic or derail.
Agents need persistence. As LLM agents stretch across hours or days, hierarchical memory and context-refresh policies become as critical as the policy network itself.
Tool use explodes token demand. Function calls and code snippets are powerful but verbose; context engineering keeps them from crowding out the original question.

A looming research gap

Despite dramatic gains in understanding long and complex contexts, models remain weak at generating equally long, logically coherent outputs—a mismatch the survey brands the field’s “defining priority for future research.”

Practical takeaways for builders

Treat context like a first-class system resource—budget, cache and monitor it the way you would GPU memory.
Mix retrieval styles. Hybrid pipelines (keyword, dense, graph) outperform single-method RAG on complex queries.
Plan for multi-layer memory. Short-term windows, episodic buffers and long-term stores each have distinct TTLs and compression trade-offs.

Published July 17 2025 with an accompanying GitHub “awesome list,” the survey is already circulating among infra and agent teams looking to squeeze more mileage out of existing checkpoints before the next trillion-parameter beast lands.

Paper link: arXiv 2507.13334 (PDF)

8.7.25

Context Engineering in AI: Designing the Right Inputs for Smarter, Safer Large-Language Models

What Is Context Engineering?

In classic software, developers write deterministic code; in today’s AI systems, we compose contexts. Context engineering is the systematic craft of designing, organizing and manipulating every token fed into a large-language model (LLM) at inference time—instructions, examples, retrieved documents, API results, user profiles, safety policies, even intermediate chain-of-thought. Well-engineered context turns a general model into a domain expert; poor context produces hallucinations, leakage or policy violations.

Core Techniques

Technique	Goal	Typical Tools / Patterns
Prompt Design & Templates	Give the model clear role, task, format and constraints	System + user role prompts; XML / JSON schemas; function-calling specs
Retrieval-Augmented Generation (RAG)	Supply fresh, external knowledge just-in-time	Vector search, hybrid BM25+embedding, GraphRAG
Context Compression	Fit more signal into limited tokens	Summarisation, saliency ranking, LLM-powered “short-former” rewriters
Chunking & Windowing	Preserve locality in extra-long inputs	Hierarchical windows, sliding attention, FlashMask / Ring Attention
Scratchpads & CoT Scaffolds	Expose model reasoning for better accuracy and debuggability	Self-consistency, tree-of-thought, DST (Directed Self-Testing)
Memory & Profiles	Personalise without retraining	Vector memories, episodic caches, preference embeddings
Tool / API Context	Let models call and interpret external systems	Model Context Protocol (MCP), JSON-schema function calls, structured tool output
Policy & Guardrails	Enforce safety and brand style	Content filters, regex validators, policy adapters, YAML instruction blocks

Why It Matters

Accuracy & Trust – Fact-filled, well-structured context slashes hallucination rates and citation errors.
Privacy & Governance – Explicit control over what leaves the organisation or reaches the model helps meet GDPR, HIPAA and the EU AI Act.
Cost Efficiency – Compressing or caching context can cut token bills by 50-80 %.
Scalability – Multi-step agent systems live or die by fast, machine-readable context routing; good design tames complexity.

High-Impact Use Cases

Sector	How Context Engineering Delivers Value
Customer Support	RAG surfaces the exact policy paragraph and recent ticket history, enabling a single prompt to draft compliant replies.
Coding Agents	Function-calling + repository retrieval feed IDE paths, diffs and test logs, letting models patch bugs autonomously.
Healthcare Q&A	Context filters strip PHI before retrieval; clinically-approved guidelines injected to guide safe advice.
Legal Analysis	Long-context models read entire case bundles; chunk ranking highlights precedent sections for argument drafting.
Manufacturing IoT	Streaming sensor data is summarised every minute and appended to a rolling window for predictive-maintenance agents.

Designing a Context Pipeline: Four Practical Steps

Map the Task Surface
• What knowledge is static vs. dynamic?
• Which external tools or databases are authoritative?
Define Context Layers
• Base prompt: role, format, policy
• Ephemeral layer: user query, tool results
• Memory layer: user or session history
• Safety layer: filters, refusal templates
Choose Retrieval & Compression Strategies
• Exact text (BM25) for short policies; dense vectors for semantic match
• Summaries or selective quoting for large PDFs
Instrument & Iterate
• Log token mixes, latency, cost
• A/B test different ordering, chunking, or reasoning scaffolds
• Use self-reflection or eval suites (e.g., TruthfulQA-Context) to measure gains

Emerging Tools & Standards

MCP (Model Context Protocol) – open JSON schema for passing tool output and trace metadata to any LLM, adopted by Claude Code, Gemini CLI and IBM MCP Gateway.
Context-Aware Runtimes – vLLM, Flash-Infer and Infinity Lite stream 128 K-1 M tokens with optimized KV caches.
Context Observability Dashboards – Startups like ContextHub show token-level diff, attribution and cost per layer.

The Road Ahead

As context windows expand to a million tokens and multi-agent systems proliferate, context engineering will sit alongside model training and fine-tuning as a first-class AI discipline. Teams that master it will ship assistants that feel domain-expert-smart, honest and cost-efficient—while everyone else will chase unpredictable black boxes.

Whether you’re building a retrieval chatbot, a self-healing codebase or an autonomous research agent, remember: the model is only as good as the context you feed it.

7.7.25

ARAG puts a multi-agent brain inside your RAG stack — and Walmart’s numbers look eye-popping

Retrieval-augmented generation (RAG) has become the go-to recipe for giving large language models real-world context, but most deployments still treat retrieval as a dumb, one-shot lookup. Researchers at Walmart Global Tech think that leaves serious money on the table — especially in e-commerce, where user intent shifts by the minute. Their new framework, ARAG (Agentic Retrieval-Augmented Generation), adds a four-agent reasoning layer on top of vanilla RAG and reports double-digit gains across every metric that matters.

Four specialists, one conversation

User-Understanding Agent distills long-term history and the current session into a natural-language profile.
NLI Agent performs sentence-level entailment to see whether each candidate item actually supports that intent.
Context-Summary Agent compresses only the NLI-approved evidence into a focused prompt.
Item-Ranker Agent fuses all signals and produces the final ranked list.

Each agent writes to — and reads from — a shared blackboard-style memory, so later agents can reason over earlier rationales rather than raw text alone.

How much better? Try 42 %

On three Amazon Review subsets (Clothing, Electronics, Home), ARAG beats both a recency heuristic and a strong cosine-similarity RAG baseline:

Dataset	NDCG@5 ↑	Hit@5 ↑
Clothing	+42.1 %	+35.5 %
Electronics	+37.9 %	+30.9 %
Home & Kitchen	+25.6 %	+22.7 %

An ablation test shows that yanking either the NLI or context-summary modules knocks as much as 14 points off NDCG, underlining how critical cross-agent reasoning is to the win.

Why it matters

Personalization that actually reasons. By turning retrieval and ranking into cooperative LLM agents, ARAG captures the nuance of why an item fits, not just whether embeddings are close.
No model surgery required. The team wraps any existing RAG stack; there’s no need to fine-tune the base LLM, making the upgrade cloud-budget friendly.
Explainability for free. Each agent logs its own JSON-structured evidence, giving product managers a breadcrumb trail for every recommendation.

The bigger picture

Agentic pipelines have taken off in code generation and web browsing; ARAG shows the same trick pays dividends in recommender systems, a multi-billion-dollar battleground where percent-level lifts translate into real revenue. Expect retailers and streaming platforms to test-drive multi-agent RAG as they chase post-cookie personalization.

Paper link: arXiv 2506.21931 (PDF)

3.7.25

Baidu’s “AI Search Paradigm” Unveils a Four-Agent Framework for Next-Generation Information Retrieval

A Blueprint for Smarter Search

Traditional RAG pipelines handle simple fact look-ups well but struggle when queries require multi-step reasoning, tool use, or synthesis. In response, Baidu Research has introduced the AI Search Paradigm, a unified framework in which four specialized LLM-powered agents collaborate to emulate human research workflows.

Agent	Role	Key Skills
Master	Classifies query difficulty & launches a workflow	Meta-reasoning, task routing
Planner	Breaks the problem into ordered sub-tasks	Decomposition, tool selection
Executor	Calls external APIs or web search to gather evidence	Retrieval, browsing, code-run
Writer	Consolidates evidence into fluent, cited answers	Synthesis, style control

The architecture adapts on the fly: trivial queries may bypass planning, while open-ended questions trigger full agent collaboration.

Technical Innovations

Dynamic Workflow Graphs – Agents spawn or skip steps in real time based on intermediate results, avoiding rigid “one-size-fits-all” chains.
Robust Tool Layer – Executor can invoke search APIs, calculators, code sandboxes, and custom enterprise databases, all via a common interface.
Alignment & Safety – Reinforcement learning with human feedback (RLHF) plus retrieval-grounding reduce hallucinations and improve citation accuracy.

Benchmark Results

On a suite of open-web reasoning tasks the system, dubbed Baidu ASP in the paper, surpasses state-of-the-art open-source baselines and even challenges proprietary models that rely on massive context windows alone.

Benchmark	Prior Best (RAG)	Baidu ASP
Complex QA (avg. F1)	46.2	57.8
Multi-hop HotpotQA (Exact Match)	41.5	53.0
ORION Deep-Search	37.1	49.6

Practical Implications

Enterprise Knowledge Portals – Route user tickets through Planner→Executor→Writer to surface compliant, fully referenced answers.
Academic Research Assistants – Decompose literature reviews into sub-queries, fetch PDFs, and synthesize summaries.
E-commerce Assistants – From “Find a laptop under $800 that runs Blender” to a shoppable list with citations in a single interaction.

Because each agent is modular, organisations can fine-tune or swap individual components—e.g., plugging in a domain-specific retrieval tool—without retraining the entire stack.

Looking Ahead

The team plans to open-source a reference implementation and release an evaluation harness so other researchers can benchmark new agent variants under identical conditions. Future work focuses on:

Reducing latency by parallelising Executor calls
Expanding the Writer’s multimodal output (tables, charts, code diffs)
Hardening the Master agent’s self-diagnosis to detect and recover from tool failures

Takeaway
Baidu’s AI Search Paradigm reframes search as a cooperative, multi-agent process, merging planning, tool use, and natural-language synthesis into one adaptable pipeline. For enterprises and researchers seeking deeper, trustable answers—not just blue links—this approach signals how tomorrow’s search engines and internal knowledge bots will be built.

30.5.25

Mistral Enters the AI Agent Arena with New Agents API

The AI landscape is rapidly evolving, and the latest "status symbol" for billion-dollar AI companies isn't a fancy office or high-end swag, but a robust agents framework or, as Mistral AI has just unveiled, an Agents API. This new offering from the well-funded and innovative French AI startup signals a significant step towards empowering developers to build more capable, useful, and active problem-solving AI applications.

Mistral has been on a roll, recently releasing models like "Devstral," their latest coding-focused LLM. Their new Agents API aims to provide a dedicated, server-side solution for building and orchestrating AI agents, contrasting with local frameworks by being a cloud-pinged service. This approach is reminiscent of OpenAI's "requests API" but tailored for agentic workflows.

Key Features of the Mistral Agents API

Mistral's Agents API isn't trying to be a one-size-fits-all framework. Instead, it focuses on providing powerful tools and capabilities specifically for leveraging Mistral's models in agentic systems. Here are some of the standout features:

Persistent Memory Across Conversations: A significant advantage, this allows agents to maintain context and history over extended interactions, a common pain point in many existing agent frameworks where managing memory can be tedious.

Built-in Connectors (Tools): The API comes equipped with a suite of pre-built tools to enhance agent functionality:

Code Execution: Leveraging models like Devstral, agents can securely run Python code in a server-side sandbox, enabling data visualization, scientific computing, and more.

Web Search: Provides agents with access to up-to-date information from online sources, news outlets, and reputable databases.

Image Generation: Integrates with Black Forest Lab's FLUX models (including FLUX1.1 [pro] Ultra) to allow agents to create custom visuals for diverse applications, from educational aids to artistic images.

Document Library (Beta): Enables agents to access and leverage content from user-uploaded documents stored in Mistral Cloud, effectively providing built-in Retrieval-Augmented Generation (RAG) functionality.

MCP (Model Context Protocol) Tools: Supports function calling, allowing agents to interact with external services and data sources.

Agentic Orchestration Capabilities: The API facilitates complex workflows:

Handoffs: Allows different agents to collaborate as part of a larger workflow, with one agent calling another.

Sequential and Parallel Processing: Supports both step-by-step task execution and parallel subtask processing, similar to concepts seen in LangGraph or LlamaIndex, but managed through the API.

Structured Outputs: The API supports structured outputs, allowing developers to define data schemas (e.g., using Pydantic) for more reliable and predictable agent responses.

Illustrative Use Cases and Examples

Mistral has provided a "cookbook" with various examples demonstrating the Agents API's capabilities. These include:

GitHub Agent: A developer assistant powered by Devstral that can manage tasks like creating repositories, handling pull requests, and improving unit tests, using MCP tools for GitHub interaction.

Financial Analyst Agent: An agent designed to handle user queries about financial data, fetch stock prices, generate reports, and perform analysis using MCP servers and structured outputs.

Multi-Agent Earnings Call Analysis System (MAECAS): A more complex example showcasing an orchestration of multiple specialized agents (Financial, Strategic, Sentiment, Risk, Competitor, Temporal) to process PDF earnings call transcripts (using Mistral OCR), extract insights, and generate comprehensive reports or answer specific queries.

These examples highlight how the API can be used for tasks ranging from simple, chained LLM calls to sophisticated multi-agent systems involving pre-processing, parallel task execution, and synthesized outputs.

Differentiation and Implications

The Mistral Agents API positions itself as a cloud-based service rather than a local library like LangChain or LlamaIndex. This server-side approach, particularly with built-in connectors and orchestration, aims to simplify the development of enterprise-grade agentic platforms.

Key differentiators include:

API-centric approach: Focuses on providing endpoints for agentic capabilities.

Tight integration with Mistral models: Optimized for Mistral's own LLMs, including specialized ones like Devstral for coding and their OCR model.

Built-in, server-side tools: Reduces the need for developers to implement and manage these integrations themselves.

Persistent state management: Addresses a critical aspect of building robust conversational agents.

This offering is particularly interesting for organizations looking at on-premise deployments of AI models. Mistral, like other smaller, agile AI companies, has shown more openness to licensing proprietary models for such use cases. The Agents API provides a clear pathway for these on-prem users to build sophisticated agentic systems.

The Path Forward

Mistral's Agents API is a significant step in making AI more capable, useful, and an active problem-solver. It reflects a broader trend in the AI industry: moving beyond foundational models to building ecosystems and platforms that enable more complex and practical applications.

While still in its early stages, the API, with its focus on robust features like persistent memory, built-in tools, and orchestration, provides a compelling new option for developers looking to build the next generation of AI agents. As the tools and underlying models continue to improve, the potential for what can be achieved with such an API will only grow. Developers are encouraged to explore Mistral's documentation and cookbook to get started.

29.5.25

Introducing s3: A Modular RAG Framework for Efficient Search Agent Training

Researchers at the University of Illinois Urbana-Champaign have developed s3, an open-source framework designed to streamline the training of search agents within Retrieval-Augmented Generation (RAG) systems. By decoupling the retrieval and generation components, s3 allows for efficient training using minimal data, addressing challenges faced by enterprises in deploying AI applications.

Evolution of RAG Systems

The effectiveness of RAG systems largely depends on the quality of their retrieval mechanisms. The researchers categorize the evolution of RAG approaches into three phases:

Classic RAG: Utilizes static retrieval methods with fixed queries, often resulting in a disconnect between retrieval quality and generation performance.
Pre-RL-Zero: Introduces multi-turn interactions between query generation, retrieval, and reasoning, but lacks trainable components to optimize retrieval based on outcomes.
RL-Zero: Employs reinforcement learning to train models as search agents, improving through feedback like answer correctness. However, these approaches often require fine-tuning the entire language model, which can be costly and limit compatibility with proprietary models.

The s3 Framework

s3 addresses these limitations by focusing solely on optimizing the retrieval component. It introduces a novel reward signal called Gain Beyond RAG (GBR), which measures the improvement in generation accuracy when using s3's retrieved documents compared to naive retrieval methods. This approach allows the generator model to remain untouched, facilitating integration with various off-the-shelf or proprietary large language models.

In evaluations across multiple question-answering benchmarks, s3 demonstrated strong performance using only 2.4k training examples, outperforming other methods that require significantly more data. Notably, s3 also showed the ability to generalize to domains it wasn't explicitly trained on, such as medical question-answering tasks.

Implications for Enterprises

For enterprises, s3 offers a practical solution to building efficient and adaptable search agents without the need for extensive data or computational resources. Its modular design ensures compatibility with existing language models and simplifies the deployment of AI-powered search applications.

Paper: "s3: You Don't Need That Much Data to Train a Search Agent via RL" – arXiv, May 20, 2025.

https://arxiv.org/abs/2505.14146

14.5.25

Vectara's Guardian Agents Aim to Reduce AI Hallucinations Below 1% in Enterprise Applications

In the rapidly evolving landscape of enterprise artificial intelligence, the challenge of AI hallucinations—instances where AI models generate false or misleading information—remains a significant barrier to adoption. While techniques like Retrieval-Augmented Generation (RAG) have been employed to mitigate this issue, hallucinations persist, especially in complex, agentic workflows.

Vectara, a company known for its pioneering work in grounded retrieval, has introduced a novel solution: Guardian Agents. These software components are designed to monitor AI outputs in real-time, automatically identifying, explaining, and correcting hallucinations without disrupting the overall content flow. This approach not only preserves the integrity of the AI-generated content but also provides transparency by detailing the changes made and the reasons behind them.

According to Vectara, implementing Guardian Agents can reduce hallucination rates in smaller language models (under 7 billion parameters) to less than 1%. Eva Nahari, Vectara's Chief Product Officer, emphasized the importance of this development, stating that as enterprises increasingly adopt agentic workflows, the potential negative impact of AI errors becomes more pronounced. Guardian Agents aim to address this by enhancing the trustworthiness and reliability of AI systems in critical business applications.

This advancement represents a significant step forward in enterprise AI, offering a proactive solution to one of the industry's most pressing challenges.