Wandering Nomad: benchmarks

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.

22.7.25

Qwen3-235B-A22B-Instruct-2507: Alibaba’s New Open-Weight Flagship Redefines Efficient Megamodels

When the Qwen team hit “post” on X announcing Qwen3-235B-A22B-Instruct-2507—plus a lightweight FP8 variant—the tweet felt less like routine release notes and more like a thunderclap across AI Twitter. The thread promised “better across the board” performance and immediate open-weights access, positioning Qwen as the most aggressive big-model vendor in the open ecosystem.

Inside the Model

Under the hood, the new model keeps the mixture-of-experts (MoE) recipe that made earlier Qwen3 builds special: 128 experts, but only 8 fire on each forward pass, so just 22 B parameters are active even though the full network tops out at 235 B. That efficiency allows 256 K tokens of native context and enables consumer-grade deployments that once demanded datacenter GPUs.

Benchmark Shockwaves

Numbers published with the release show why the community’s jaw dropped. On the notoriously tricky ARC-AGI benchmark, Qwen3-235B-A22B-Instruct-2507 scores 41.8 %, eclipsing Moonshot’s freshly minted Kimi K2 by nearly 29 points and edging ahead of Claude Opus 4 in non-thinking mode. Coding (LiveCodeBench v6) jumps to 51.8 %, and reasoning tasks like AIME25 leap to 70.3 %. In most rows of the evaluation table, the new Qwen flags sit comfortably ahead of DeepSeek-V3, o3-mini, and OpenAI’s o1 reference.

Why an FP8 Build Matters

Alongside the bf16 release, Alibaba published a fully FP8-quantised version. Dropping to eight-bit floats slashes VRAM by roughly 40 % while preserving accuracy, paving the way for single-GPU inference or even multi-GPU laptop rigs. Apache-2.0 licensing means startups can bake the FP8 weights directly into commercial products without costly negotiations.

Community Reception: K2 Who?

Reddit’s r/singularity lit up within minutes: “Kimi K2 is already irrelevant,” read the top-voted post, linking to the Qwen tweet and highlighting the model’s 4.2× smaller total size yet broader win-rate. Analysts on Interconnects echoed the sentiment, framing the drop as part of a summer in which Chinese labs “continue to dominate” the open-weight leaderboard and openly court Western builders.

Beyond Benchmarks: Agentic DNA

Qwen3’s team stresses that the instruct model is tuned for tool-calling and agent workflows. The official model card shows code snippets for integrating with Qwen-Agent and MCP config files, underscoring Alibaba’s push toward practical automation at 262 K-token scale—think mega-docs, legal contracts or multi-day chat histories without windowing hacks.

Why It Matters

Qwen3-235B-A22B-Instruct-2507 sets a new bar for “open yet frontier-grade.” By decoupling “thinking” and “non-thinking” modes into separate models, Alibaba embraced community feedback while sidestepping latency complaints. The result is a release that:

outperforms larger proprietary models on knowledge, reasoning, and multilingual tests;
ships under a permissive license;
arrives in both bf16 and FP8 flavors for hobbyists and enterprises alike;
proves that giant MoEs can be resource-friendly—and, crucially, available today.

For AI enthusiasts and builders, the message is clear: grab the weights, spin up your agent stack, and see how far 22 B active parameters can take you. The open-source race just found a new pacesetter.