Wandering Nomad

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.

MolmoAct brings editable spatial plans to robot foundation models

Most robot FMs still map pixels + instructions straight to torques—a shortcut that crumbles on long-horizon tasks. MolmoAct proposes a cleaner recipe: an Action Reasoning Model (ARM) that explicitly separates perception, planning, and control so robots can reason about where to act before deciding how.

A three-stage pipeline you can steer

MolmoAct encodes images and instructions into depth-aware perception tokens, then produces a mid-level spatial plan as editable trajectory traces, and finally emits precise low-level actions. Because the plan lives as a manipulable trajectory, behavior is explainable—and steerable—without retraining.

Numbers that move the needle

SimplerEnv (Visual Matching, zero-shot): 70.5%, beating closed models like Pi-0 and GR00T N1.
LIBERO (avg): 86.6% success, including a +6.3-point gain over ThinkAct on long-horizon tasks.
Real-world fine-tuning: additional +10% task progression on single-arm and +22.7% on bimanual setups vs Pi-0-FAST.
OOD generalization: +23.3% over baselines; also top human-preference scores for instruction following and trajectory steering.

An open blueprint, not just a model

The team releases MolmoAct-7B-D weights, training code, and—importantly—the MolmoAct Dataset, over 10,000 high-quality robot trajectories spanning diverse scenarios. Adding this mid-training set yields an average +5.5% performance lift over the base model, making it a practical plug-in for existing stacks.

Why it matters

By promoting spatial plans to first-class citizens, MolmoAct bridges the gap between language-level intent and controller-level execution. For labs and startups, that means debuggable policies, few-shot steerability, and a realistic path to explainable manipulation at scale—without signing away to a closed stack.

Paper link: arXiv 2508.07917 (PDF)

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model.

Why it stands out

ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs.
Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding.
Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint.

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass).

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores.

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

8.8.25

GPT-5 Arrives: A Quantum Leap or an Incremental Step Toward Everyday AGI?

OpenAI CEO Sam Altman opened the launch keynote with a statistic that still jolts me: 700 million weekly ChatGPT users. If accurate, that is the fastest adoption curve of any software platform in history. Altman framed GPT-5 as the model that finally feels like “talking to a PhD-level expert in anything,” capable of planning a birthday party, writing a full software stack, or parsing biopsy results in seconds. As someone who has lived through GPT-3’s flashes of brilliance and GPT-4o’s solid utility, I’m impressed by the live demos—particularly the on-the-fly 3-D castle game and the finance dashboard spun up in minutes. Yet part of me wonders how often real-world edge-cases will still trip the model, PhD metaphors aside.

Reasoning + Speed = Default
One genuine breakthrough is that GPT-5 merges OpenAI’s slow “reasoning models” and fast “standard models” into a single pipeline. The system decides—dynamically—how much chain-of-thought to spend on each request. As a developer, I love the promise of no more model-picker gymnastics. But the skeptic in me notes that latency remains physics-bound; the keynote glossed over how much extra compute the “perfect amount of thinking” really burns.

Safer, but Still a Work in Progress
Safety lead Saachi emphasized safe completions: instead of the binary comply/refuse we’ve grown used to, GPT-5 offers partial, contextual answers plus policy pointers. I applaud the nuance (the potassium perchlorate fireworks example was spot-on), and early physician-audited benchmarks suggest lower hallucination rates. Still, bi-modal safety often fails at scale. Until we see longitudinal data from millions of prompts, I reserve judgment on whether “significantly less deceptive” translates into materially fewer bad outcomes.

Coding Superpowers—and Benchmarks That May Be Peaking
On SWEBench, GPT-5 posts 74.9 %—state-of-the-art by a wide margin—and Cursor’s integration shows real autonomy: the model searches code, patches errors after compiling, and writes explanatory READMEs. That’s developer candy. Yet I can’t ignore Michael Truell’s aside that models are saturating classic evals. When a leaderboard hits 99 %, the next delta in usefulness won’t come from marginal accuracy boosts; it will come from deeper tool integration, live debugging, and sustained multi-day agent runs—areas GPT-5 only begins to address.

Health and Personalization
The on-stage story of Carolina using GPT-5 to weigh radiation options was moving and highlights the model’s strength as a patient advocate. Free-tier voice chat, Gmail/calendar integration, and memory all point toward a more personal assistant future. My worry is data consent and provenance: when GPT-5 merges personal email with medical queries, the privacy surface expands dramatically. OpenAI’s policies will need the same iterative care the model architecture received.

What I’m Excited About—and Watching Carefully
I love the 400 K context window, the new “minimal reasoning” knob for latency-sensitive tasks, and regular-expression-constrained outputs. Those are practical, developer-driven wins. I’m less convinced by the AGI framing; Altman downplayed compute bottlenecks and energy costs, and benchmark fatigue is real. GPT-5 feels like the best general-purpose model we’ve seen—but whether it inaugurates a “team of experts in your pocket” or reveals the limits of current scaling will depend on how it behaves over the next billion prompts.

Overall, GPT-5 is a thrilling upgrade—smarter, faster, and more context-aware. Just remember: even PhD-level experts can be confidently wrong, and the same will be true for the most intuitive model yet.

6.8.25

OpenAI Unveils GPT-OSS: Two Apache-Licensed Open-Weight Models Aimed at Reasoning, Agents, and Real-World Deployment

OpenAI has released GPT-OSS, a pair of open-weight language models designed for strong reasoning and agentic workflows—gpt-oss-120b and gpt-oss-20b—marking the company’s most significant “open” move since GPT-2. Both models are distributed under Apache 2.0 (with an accompanying GPT-OSS usage policy), positioning them for commercial use, customization, and local deployment.

What’s in the release

Two sizes, one family. The larger gpt-oss-120b targets top-tier reasoning; gpt-oss-20b is a lighter option for edge and on-prem use. OpenAI says 120b achieves near-parity with o4-mini on core reasoning benchmarks, while 20b performs similarly to o3-mini—a notable claim for open-weight models.
Hardware footprint. OpenAI highlights efficient operation for the 120b model (single 80 GB GPU) and 20b running with as little as 16 GB memory in edge scenarios, enabling local inference and rapid iteration without costly infrastructure.
Licensing & model card. The company published a model card and licensing details (Apache 2.0 + usage policy), clarifying intended use, evaluations, and limitations.

Why this matters

For years, OpenAI prioritized API-only access to frontier systems. GPT-OSS signals a strategic broadening toward open-weight distribution, meeting developers where they build—local, cloud, or hybrid—and competing more directly with leaders like Llama and DeepSeek. Early coverage underscores the shift: outlets note this is OpenAI’s first open-weight release since GPT-2 and frame it as both an ecosystem and competitive move.

Where you can run it (day one)

OpenAI launched with unusually wide partner support, making GPT-OSS easy to try in existing MLOps stacks:

Hugging Face: downloadable weights and a welcome post with implementation details.
AWS SageMaker JumpStart: curated deployment templates for OSS-20B/120B.
Azure AI Foundry & Windows AI Foundry: managed endpoints and tooling for fine-tuning and inference.
Databricks: native availability with 131k-context serving options and enterprise controls.
NVIDIA: performance tuning for GB200 NVL72 systems; NVIDIA cites up to ~1.5M tokens/sec rack-scale throughput for the 120B variant.

Developer ergonomics: Harmony & agents

OpenAI also published Harmony, a response format and prompt schema that GPT-OSS models are trained to follow. Harmony standardizes conversation structure, reasoning output, and function-calling/tool-use—useful for building agents that require predictable JSON and multi-step plans. If you’re serving via common runtimes (Hugging Face, vLLM, Ollama), the formatting is handled for you; custom servers can adopt the schema from the public repo.

Safety posture

OpenAI says GPT-OSS went through Preparedness Framework testing, including trials where a maliciously fine-tuned 120B model was evaluated for risky capabilities. The company reports that such variants did not reach high-capability thresholds, presenting a measured step forward in open-model safety practices.

How it stacks up (early read)

Early reports highlight the significance of the move and the headline performance claims—near-o4-mini for 120B and o3-mini-like results for 20B—alongside the practical win of local, customizable models under a permissive license. Analysts also point out the competitive context: GPT-OSS arrives as open-weight ecosystems (Llama, DeepSeek, Qwen, Kimi) surge in adoption.

What to build first

Agent backends that rely on structured tool use and local policy control (Harmony + Apache 2.0 helps here).
Sovereign/air-gapped deployments in regulated environments using on-prem GPUs or edge hardware, especially with the 20B model.
Cost-sensitive RAG and analytics where fine-tuning and local inference can beat per-token API economics—now supported across major clouds and MLOps platforms.

The takeaway

GPT-OSS is OpenAI’s clearest embrace of the open-weight ecosystem to date: credible reasoning performance, permissive licensing, broad partner availability, and practical tooling for agents. If your roadmap calls for customizable, locally deployable models with strong reasoning, GPT-OSS belongs on your shortlist—whether you’re targeting laptops, single-GPU servers, or GB200-class scale.

5.8.25

MLE-STAR: Google’s ML Engineering Agent Is Impressive—But Real-World Automation Still Needs Guardrails

Google Research just unveiled MLE-STAR, a machine-learning engineering agent that treats model building like a guided search-and-refine loop rather than a single shot of LLM codegen. The announcement (August 1, 2025) positions MLE-STAR as a state-of-the-art ML engineering agent capable of automating diverse tasks.

At a high level, the system does three things I really like:

Bootstraps from the web. Instead of relying purely on prior LLM knowledge (which often overfits to familiar libraries), MLE-STAR first uses web search to pull task-appropriate, modern model patterns and builds an initial solution from them. In other words, it goes looking for today’s best practice before writing code.
Refines the right part of the pipeline. Many agents rewrite whole scripts every iteration; MLE-STAR runs ablation studies to find the code block with the biggest performance impact (e.g., feature engineering vs. model vs. ensembling), then iteratively refines that block using feedback from prior runs. This targeted loop is far closer to how strong human MLEs work day-to-day.
Ensembles with intent. Rather than naive voting, the agent proposes and improves ensemble strategies to merge multiple candidate solutions into a single, better one.

The team also built pragmatic safety rails I’m thrilled to see in an autonomous coder: a debugging agent for traceback-driven fixes, a data-leakage checker to catch test-time contamination, and a data-usage checker so scripts don’t ignore provided modalities. These modules address common failure modes I’ve encountered with LLM-generated pipelines.

On benchmarks, the results are eye-catching. MLE-STAR won medals in ~63–64% of Kaggle competitions in MLE-Bench-Lite, a massive jump over prior agents; the blog cites 63.6% any-medal (with 36% gold), and the arXiv v2 reports 64%. Either way, it’s a big leap.

I also appreciate the ops mindset: there’s open-source code built with Google’s Agent Development Kit (ADK) so teams can reproduce the workflow and extend it.

Now, where I’m cautious:

Generalization. MLE-Bench-Lite is a valuable proxy, but medals on curated Kaggle tasks aren’t the same as long-lived production systems with shifting data, compliance constraints, and messy labels. The refinement loop may still need human “taste” to set success metrics and pick trade-offs (latency vs. accuracy, cost vs. recall). The paper itself stresses targeted refinement and web retrieval as the key innovations—not a claim that human MLEs are obsolete.
Licensing & provenance. Because the agent retrieves models and code from the web, verifying permissive licenses and acceptable usage is non-negotiable—Google explicitly flags MLE-STAR as research-only and expects users to check licensing of retrieved assets. That’s the right call, and enterprises should wire in policy checks before any auto-generated PRs land.
Evaluation drift. The ablation-guided focus is elegant, but it assumes your validation signal is representative. In many real datasets, weak labels or distribution shift can mislead the ablation and push the agent to overfit the “most impactful block.” Tight data splits and independent holdouts remain essential.

Bottom line: MLE-STAR advances the state of autonomous ML engineering—web-aware bootstrapping, ablation-driven targeted refinement, and smarter ensembling are exactly the techniques I want in an agentic MLE. I’m ready to use it as a co-engineer on well-scoped problems, with humans owning metrics, governance, and final review. If we pair this agent with robust eval harnesses and license compliance, the payoff could be faster iteration and stronger baselines—without losing the engineering discipline that production ML demands.

ReaGAN turns every node into an agent—with a plan, memory, and tools

Classical GNNs push messages with one global rule per layer—great for tidy graphs, brittle for messy ones. ReaGAN (Retrieval-augmented Graph Agentic Network) breaks that mold by treating each node as an autonomous agent that decides whether to aggregate locally, retrieve globally, predict now, or do nothing—based on its own memory and a plan drafted by a frozen LLM.

What’s new

Node-level autonomy. At every layer, a node queries the LLM for an action plan, executes it, and updates memory—no globally synchronized rulebook.
Local + global context. Beyond neighbors in the graph, nodes invoke RAG to retrieve semantically similar but structurally distant nodes, then fuse both sources.
Memory as glue. Nodes persist aggregated text snippets and few-shot (text, label) exemplars, enabling in-context prediction later.

Why it matters

Real-world graphs are sparse and noisy; uniform propagation amplifies junk. ReaGAN’s per-node planning and local-global retrieval adapt to informativeness imbalances and long-range semantics—key gaps in standard GNNs. In experiments, the authors report competitive few-shot performance using only a frozen LLM (no fine-tuning), highlighting a compute-friendly path for graph ML.

How it runs (at a glance)

Each node iterates a loop: perceive → plan → act (LocalAggregation / GlobalAggregation / Predict / NoOp) → update memory. A simple algorithmic skeleton formalizes the layer-wise cycle and action space.

Paper link: https://arxiv.org/pdf/2508.00429

4.8.25

The Agentic Web: when bots become the primary users of the internet

Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end.

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory.

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots.

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows.

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome?

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently.

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengths—low, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick.

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9.

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack.

Why this matters

Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains.
Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations.
Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers.
Open release. Code and checkpoints are public, inviting replication and adaptation.

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)