Showing posts with label world models. Show all posts
Showing posts with label world models. Show all posts

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

 DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.


10.7.25

Meta AI’s grand blueprint for embodied agents: put a world model at the core

 Move over “chatbots with arms.” Meta AI has published a sweeping manifesto that recasts embodied intelligence as a world-model problem. The 40-page paper, Embodied AI Agents: Modeling the World (July 7, 2025), is signed by a who’s-who of researchers from EPFL, Carnegie Mellon, NTU and Meta’s own labs, and argues that any meaningful agent—virtual, wearable or robotic—must learn a compact, predictive model of both the physical and the mental worlds it inhabits.

Three kinds of bodies, one cognitive engine

The authors sort today’s prototypes into three buckets:

  • Virtual agents (think emotionally intelligent avatars in games or therapy apps)

  • Wearable agents that live in smart glasses and coach you through daily tasks

  • Robotic agents capable of general-purpose manipulation and navigation

Despite wildly different form factors, all three need the same six ingredients: multimodal perception, a physical world model, a mental model of the user, action & control, short-/long-term memory, and a planner that ties them together.

What “world modeling” actually means

Meta’s framework breaks the catch-all term into concrete modules:

  1. Multimodal perception – image, video, audio and even touch encoders deliver a unified scene graph.

  2. Physical world model – predicts object dynamics and plans low- to high-level actions.

  3. Mental world model – tracks user goals, emotions and social context for better collaboration.

  4. Memory – fixed (weights), working and external stores that support life-long learning.

The paper contends that current generative LLMs waste compute by predicting every pixel or token. Instead, Meta is experimenting with transformer-based predictive models and JEPA-style latent learning to forecast just the state abstractions an agent needs to plan long-horizon tasks.

New benchmarks to keep them honest

To measure progress, the team proposes a suite of “world-model” stress tests—from Minimal Video Pairs for perceptual prediction to CausalVQA and the WorldPrediction benchmark that evaluates high-level procedural planning. Early results show humans near-perfect and SOTA multimodal models barely above chance, highlighting the gap Meta hopes to close.

Where they’re headed next

Two research directions top the agenda:

  • Embodied learning loops that pair System A (learning by passive observation) with System B (learning by physical action), each bootstrapping the other.

  • Multi-agent collaboration, where a family of specialized bodies—your glasses, a kitchen robot, and a home avatar—share a common world model and negotiate tasks.

Ethics is a running theme: privacy for always-on sensors and the risk of over-anthropomorphizing robots both get dedicated sections.

Why it matters

Meta isn’t open-sourcing code here; it’s setting the intellectual agenda. By declaring world models—not ever-larger GPTs—the “missing middle” of embodied AI, the company positions itself for a future where agents must act, not just talk. Expect the next iterations of Meta’s smart-glasses assistant (and perhaps its humanoid robot partners) to lean heavily on the blueprint sketched in this paper.

Paper link: arXiv 2506.22355 (PDF)

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...