Showing posts with label online RL. Show all posts
Showing posts with label online RL. Show all posts

2.9.25

Memento: teach agents to learn on the fly—no LLM fine-tune required

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep the LLM frozen and adapt the agent with a memory that learns from every episode. The team formalizes this as a Memory-augmented MDP and shows it can lift real-world “deep research” performance—without gradient updates to the underlying model. 

The recipe in one diagram

Memento is a planner–executor architecture wired to a growing Case Bank of episodic traces (state, action, reward). At each step, the planner retrieves similar past cases to guide the next action; after acting, the trajectory (success or failure) is written back—so the memory rewrites itself with environmental feedback. Retrieval can be non-parametric (Top-K by similarity) or parametric via a lightweight Q(s, c) scorer trained online to prefer high-utility cases. Tools are accessed through an MCP-style interface so the executor can browse, run code, or call APIs inside the same loop. 

Why this beats “prompt more” and “train more”

Unlike static RAG or handcrafted reflections, case-based reasoning (CBR) selectively reuses successful and failed traces; unlike RL-fine-tuning, it avoids catastrophic forgetting and heavy compute. In ablations, adding CBR memory yields +4.7 to +9.6 absolute points on out-of-distribution QA sets (MuSiQue, Bamboogle, PopQA). 

The receipts

  • GAIA (long-horizon tool use): Top-1 on validation (87.88% Pass@3) and 79.40% on the private test leaderboard. 

  • DeepResearcher (live web research): 66.6 F1 / 80.4 PM, outperforming training-based systems under the paper’s setup. 

  • SimpleQA (single-hop factual): 95.0 PM, the highest among reported baselines. 

  • Humanity’s Last Exam (HLE): 24.4 PM, second overall and within 0.92 of GPT-5 in the authors’ evaluation. 

What this means for builders

  • Ship updates without re-training. Treat memory as the learning substrate; leave your production LLM untouched. 

  • Choose your memory: start with non-parametric retrieval; add the parametric Q-head when you need sharper case selection. 

  • Tooling that scales. MCP-based execution keeps multi-tool orchestration inside one protocol, making traces coherent and reusable. 

The upshot: Memento reframes “agent improvement” as memory engineering. If your research agent gets better the more it works—without touching base weights—you’ve got a path to continual learning that’s practical outside the lab.

Paper link: arXiv 2508.16153 (PDF)

22.8.25

ComputerRL scales online RL for “desktop agents,” unifying APIs and GUIs

 The next wave of computer-use agents won’t just click around UIs—they’ll mix API calls and GUI interaction in one policy. That’s the bet behind ComputerRL, a new framework that treats desktop work as an end-to-end reinforcement learning problem and introduces an API-GUI paradigm so agents can call services and operate human-oriented interfaces within the same loop. 

The missing infrastructure for scale

Training desktop agents with online RL has been hamstrung by slow, brittle environments. ComputerRL ships a distributed RL stack that orchestrates thousands of parallel virtual desktops, making long-horizon, on-policy training runs practical for general computer use. 

Stabilizing long runs: Entropulse

Pure RL on complex desktops tends to collapse exploration entropy over time. The authors propose Entropulse, a simple but effective schedule that alternates RL with supervised fine-tuning, restoring healthy entropy while retaining the gains from policy improvement. 

Results & models

Using open backbones (GLM-4-9B-0414 and Qwen2.5-14B), the team evaluates on OSWorld and reports 48.1% accuracy with AutoGLM-OS-9B, a new state of the art for general desktop automation in their setup. The framework underpins the group’s AutoGLM system.

Why it matters

  • Bridging the modality gap: Real workflows mix API calls with UI operations; ComputerRL trains a single policy to do both. 

  • Throughput for RL: Parallelized desktops unlock the scale online RL has needed for computer agents. 

  • Simple stability trick: Entropulse offers a practical recipe any lab can try to keep long runs from collapsing. 

If your roadmap includes agents that file expenses, reconcile sheets, or run web apps end-to-end, ComputerRL reads like a blueprint for turning brittle demos into trainable, scalable systems.

Paper link: arXiv 2508.14040 (PDF)

What Claude offers now From Anthropic’s announcements: Creates and edits real files directly in chats or the desktop app: Excel (.xlsx)...