Showing posts with label evaluation. Show all posts
Showing posts with label evaluation. Show all posts

1.9.25

Self-evolving AI agents: from static LLMs to systems that learn on the job

 Agent frameworks are great at demo day, brittle in the wild. A sweeping new survey argues the fix isn’t a bigger model but a new self-evolving paradigm: agents that keep improving after deployment using the data and feedback their work naturally produces. The paper pulls scattered ideas under one roof and offers a playbook for researchers and startups building agents that won’t ossify after v1.0. 

The big idea: turn agents into closed-loop learners

The authors formalize a feedback loop with four moving parts—System Inputs, the Agent System, the Environment, and Optimisers—and show how different research threads plug into each stage. Think: collecting richer traces from real use (inputs), upgrading skills or tools (agent system), instrumenting the app surface (environment), and choosing the learning rule (optimisers). 

A working taxonomy you can implement

Within that loop, the survey maps techniques you can mix-and-match:

  • Single-agent evolution: self-reflection, memory growth, tool discovery, skill libraries, meta-learning and planner refinements driven by interaction data.

  • Multi-agent evolution: division-of-labour curricula, role negotiation, and team-level learning signals so collectives improve—not just individuals.

  • Domain programs: recipes specialized for biomed, programming, and finance, where optimization targets and constraints are domain-specific. 

Evaluation and safety don’t lag behind

The paper argues for verifiable benchmarks (exact-match tasks, executable tests, grounded web tasks) so improvements aren’t just prompt luck. It also centers safety and ethics: guarding against reward hacking, data poisoning, distribution shift, and privacy leaks that can arise when models learn from their own usage. 

Why this matters now

  • Static fine-tunes stagnate. Post-training once, shipping, and hoping for the best leaves quality on the table as tasks drift.

  • Logs are learning fuel. Structured traces, success/failure signals, and user edits are free gradients if you design the loop.

  • From demos to durable systems. The framework gives teams a shared language to plan what to learn, when, and how to verify it—before flipping the “autonomous improvement” switch. 

If you’re building an assistant, coder, or web agent you expect to live for months, this survey is a pragmatic roadmap to keep it getting better—safely—long after launch.

Paper link: arXiv 2508.07407 (PDF)

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

 DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.


 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...