Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep the LLM frozen and adapt the agent with a memory that learns from every episode. The team formalizes this as a Memory-augmented MDP and shows it can lift real-world “deep research” performance—without gradient updates to the underlying model.
The recipe in one diagram
Memento is a planner–executor architecture wired to a growing Case Bank of episodic traces (state, action, reward). At each step, the planner retrieves similar past cases to guide the next action; after acting, the trajectory (success or failure) is written back—so the memory rewrites itself with environmental feedback. Retrieval can be non-parametric (Top-K by similarity) or parametric via a lightweight Q(s, c) scorer trained online to prefer high-utility cases. Tools are accessed through an MCP-style interface so the executor can browse, run code, or call APIs inside the same loop.
Why this beats “prompt more” and “train more”
Unlike static RAG or handcrafted reflections, case-based reasoning (CBR) selectively reuses successful and failed traces; unlike RL-fine-tuning, it avoids catastrophic forgetting and heavy compute. In ablations, adding CBR memory yields +4.7 to +9.6 absolute points on out-of-distribution QA sets (MuSiQue, Bamboogle, PopQA).
The receipts
-
GAIA (long-horizon tool use): Top-1 on validation (87.88% Pass@3) and 79.40% on the private test leaderboard.
-
DeepResearcher (live web research): 66.6 F1 / 80.4 PM, outperforming training-based systems under the paper’s setup.
-
SimpleQA (single-hop factual): 95.0 PM, the highest among reported baselines.
-
Humanity’s Last Exam (HLE): 24.4 PM, second overall and within 0.92 of GPT-5 in the authors’ evaluation.
What this means for builders
-
Ship updates without re-training. Treat memory as the learning substrate; leave your production LLM untouched.
-
Choose your memory: start with non-parametric retrieval; add the parametric Q-head when you need sharper case selection.
-
Tooling that scales. MCP-based execution keeps multi-tool orchestration inside one protocol, making traces coherent and reusable.
The upshot: Memento reframes “agent improvement” as memory engineering. If your research agent gets better the more it works—without touching base weights—you’ve got a path to continual learning that’s practical outside the lab.
Paper link: arXiv 2508.16153 (PDF)