12.8.25

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

 Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model. 

Why it stands out

  • ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs. 

  • Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding. 

  • Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint. 

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass). 

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores. 

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

8.8.25

GPT-5 Arrives: A Quantum Leap or an Incremental Step Toward Everyday AGI?

 OpenAI CEO Sam Altman opened the launch keynote with a statistic that still jolts me: 700 million weekly ChatGPT users. If accurate, that is the fastest adoption curve of any software platform in history. Altman framed GPT-5 as the model that finally feels like “talking to a PhD-level expert in anything,” capable of planning a birthday party, writing a full software stack, or parsing biopsy results in seconds. As someone who has lived through GPT-3’s flashes of brilliance and GPT-4o’s solid utility, I’m impressed by the live demos—particularly the on-the-fly 3-D castle game and the finance dashboard spun up in minutes. Yet part of me wonders how often real-world edge-cases will still trip the model, PhD metaphors aside.

Reasoning + Speed = Default
One genuine breakthrough is that GPT-5 merges OpenAI’s slow “reasoning models” and fast “standard models” into a single pipeline. The system decides—dynamically—how much chain-of-thought to spend on each request. As a developer, I love the promise of no more model-picker gymnastics. But the skeptic in me notes that latency remains physics-bound; the keynote glossed over how much extra compute the “perfect amount of thinking” really burns.

Safer, but Still a Work in Progress
Safety lead Saachi emphasized safe completions: instead of the binary comply/refuse we’ve grown used to, GPT-5 offers partial, contextual answers plus policy pointers. I applaud the nuance (the potassium perchlorate fireworks example was spot-on), and early physician-audited benchmarks suggest lower hallucination rates. Still, bi-modal safety often fails at scale. Until we see longitudinal data from millions of prompts, I reserve judgment on whether “significantly less deceptive” translates into materially fewer bad outcomes.

Coding Superpowers—and Benchmarks That May Be Peaking
On SWEBench, GPT-5 posts 74.9 %—state-of-the-art by a wide margin—and Cursor’s integration shows real autonomy: the model searches code, patches errors after compiling, and writes explanatory READMEs. That’s developer candy. Yet I can’t ignore Michael Truell’s aside that models are saturating classic evals. When a leaderboard hits 99 %, the next delta in usefulness won’t come from marginal accuracy boosts; it will come from deeper tool integration, live debugging, and sustained multi-day agent runs—areas GPT-5 only begins to address.

Health and Personalization
The on-stage story of Carolina using GPT-5 to weigh radiation options was moving and highlights the model’s strength as a patient advocate. Free-tier voice chat, Gmail/calendar integration, and memory all point toward a more personal assistant future. My worry is data consent and provenance: when GPT-5 merges personal email with medical queries, the privacy surface expands dramatically. OpenAI’s policies will need the same iterative care the model architecture received.

What I’m Excited About—and Watching Carefully
I love the 400 K context window, the new “minimal reasoning” knob for latency-sensitive tasks, and regular-expression-constrained outputs. Those are practical, developer-driven wins. I’m less convinced by the AGI framing; Altman downplayed compute bottlenecks and energy costs, and benchmark fatigue is real. GPT-5 feels like the best general-purpose model we’ve seen—but whether it inaugurates a “team of experts in your pocket” or reveals the limits of current scaling will depend on how it behaves over the next billion prompts.

Overall, GPT-5 is a thrilling upgrade—smarter, faster, and more context-aware. Just remember: even PhD-level experts can be confidently wrong, and the same will be true for the most intuitive model yet.

6.8.25

OpenAI Unveils GPT-OSS: Two Apache-Licensed Open-Weight Models Aimed at Reasoning, Agents, and Real-World Deployment

 OpenAI has released GPT-OSS, a pair of open-weight language models designed for strong reasoning and agentic workflows—gpt-oss-120b and gpt-oss-20b—marking the company’s most significant “open” move since GPT-2. Both models are distributed under Apache 2.0 (with an accompanying GPT-OSS usage policy), positioning them for commercial use, customization, and local deployment. 

What’s in the release

  • Two sizes, one family. The larger gpt-oss-120b targets top-tier reasoning; gpt-oss-20b is a lighter option for edge and on-prem use. OpenAI says 120b achieves near-parity with o4-mini on core reasoning benchmarks, while 20b performs similarly to o3-mini—a notable claim for open-weight models. 

  • Hardware footprint. OpenAI highlights efficient operation for the 120b model (single 80 GB GPU) and 20b running with as little as 16 GB memory in edge scenarios, enabling local inference and rapid iteration without costly infrastructure. 

  • Licensing & model card. The company published a model card and licensing details (Apache 2.0 + usage policy), clarifying intended use, evaluations, and limitations. 

Why this matters

For years, OpenAI prioritized API-only access to frontier systems. GPT-OSS signals a strategic broadening toward open-weight distribution, meeting developers where they build—local, cloud, or hybrid—and competing more directly with leaders like Llama and DeepSeek. Early coverage underscores the shift: outlets note this is OpenAI’s first open-weight release since GPT-2 and frame it as both an ecosystem and competitive move. 

Where you can run it (day one)

OpenAI launched with unusually wide partner support, making GPT-OSS easy to try in existing MLOps stacks:

  • Hugging Face: downloadable weights and a welcome post with implementation details. 

  • AWS SageMaker JumpStart: curated deployment templates for OSS-20B/120B. 

  • Azure AI Foundry & Windows AI Foundry: managed endpoints and tooling for fine-tuning and inference. 

  • Databricks: native availability with 131k-context serving options and enterprise controls. 

  • NVIDIA: performance tuning for GB200 NVL72 systems; NVIDIA cites up to ~1.5M tokens/sec rack-scale throughput for the 120B variant. 

Developer ergonomics: Harmony & agents

OpenAI also published Harmony, a response format and prompt schema that GPT-OSS models are trained to follow. Harmony standardizes conversation structure, reasoning output, and function-calling/tool-use—useful for building agents that require predictable JSON and multi-step plans. If you’re serving via common runtimes (Hugging Face, vLLM, Ollama), the formatting is handled for you; custom servers can adopt the schema from the public repo. 

Safety posture

OpenAI says GPT-OSS went through Preparedness Framework testing, including trials where a maliciously fine-tuned 120B model was evaluated for risky capabilities. The company reports that such variants did not reach high-capability thresholds, presenting a measured step forward in open-model safety practices. 

How it stacks up (early read)

Early reports highlight the significance of the move and the headline performance claims—near-o4-mini for 120B and o3-mini-like results for 20B—alongside the practical win of local, customizable models under a permissive license. Analysts also point out the competitive context: GPT-OSS arrives as open-weight ecosystems (Llama, DeepSeek, Qwen, Kimi) surge in adoption. 

What to build first

  • Agent backends that rely on structured tool use and local policy control (Harmony + Apache 2.0 helps here). 

  • Sovereign/air-gapped deployments in regulated environments using on-prem GPUs or edge hardware, especially with the 20B model. 

  • Cost-sensitive RAG and analytics where fine-tuning and local inference can beat per-token API economics—now supported across major clouds and MLOps platforms.  

The takeaway

GPT-OSS is OpenAI’s clearest embrace of the open-weight ecosystem to date: credible reasoning performance, permissive licensing, broad partner availability, and practical tooling for agents. If your roadmap calls for customizable, locally deployable models with strong reasoning, GPT-OSS belongs on your shortlist—whether you’re targeting laptops, single-GPU servers, or GB200-class scale.

5.8.25

MLE-STAR: Google’s ML Engineering Agent Is Impressive—But Real-World Automation Still Needs Guardrails

 Google Research just unveiled MLE-STAR, a machine-learning engineering agent that treats model building like a guided search-and-refine loop rather than a single shot of LLM codegen. The announcement (August 1, 2025) positions MLE-STAR as a state-of-the-art ML engineering agent capable of automating diverse tasks. 

At a high level, the system does three things I really like:

  1. Bootstraps from the web. Instead of relying purely on prior LLM knowledge (which often overfits to familiar libraries), MLE-STAR first uses web search to pull task-appropriate, modern model patterns and builds an initial solution from them. In other words, it goes looking for today’s best practice before writing code. 

  2. Refines the right part of the pipeline. Many agents rewrite whole scripts every iteration; MLE-STAR runs ablation studies to find the code block with the biggest performance impact (e.g., feature engineering vs. model vs. ensembling), then iteratively refines that block using feedback from prior runs. This targeted loop is far closer to how strong human MLEs work day-to-day. 

  3. Ensembles with intent. Rather than naive voting, the agent proposes and improves ensemble strategies to merge multiple candidate solutions into a single, better one. 

The team also built pragmatic safety rails I’m thrilled to see in an autonomous coder: a debugging agent for traceback-driven fixes, a data-leakage checker to catch test-time contamination, and a data-usage checker so scripts don’t ignore provided modalities. These modules address common failure modes I’ve encountered with LLM-generated pipelines. 

On benchmarks, the results are eye-catching. MLE-STAR won medals in ~63–64% of Kaggle competitions in MLE-Bench-Lite, a massive jump over prior agents; the blog cites 63.6% any-medal (with 36% gold), and the arXiv v2 reports 64%. Either way, it’s a big leap. 

I also appreciate the ops mindset: there’s open-source code built with Google’s Agent Development Kit (ADK) so teams can reproduce the workflow and extend it. 

Now, where I’m cautious:

  • Generalization. MLE-Bench-Lite is a valuable proxy, but medals on curated Kaggle tasks aren’t the same as long-lived production systems with shifting data, compliance constraints, and messy labels. The refinement loop may still need human “taste” to set success metrics and pick trade-offs (latency vs. accuracy, cost vs. recall). The paper itself stresses targeted refinement and web retrieval as the key innovations—not a claim that human MLEs are obsolete. 

  • Licensing & provenance. Because the agent retrieves models and code from the web, verifying permissive licenses and acceptable usage is non-negotiable—Google explicitly flags MLE-STAR as research-only and expects users to check licensing of retrieved assets. That’s the right call, and enterprises should wire in policy checks before any auto-generated PRs land. 

  • Evaluation drift. The ablation-guided focus is elegant, but it assumes your validation signal is representative. In many real datasets, weak labels or distribution shift can mislead the ablation and push the agent to overfit the “most impactful block.” Tight data splits and independent holdouts remain essential.

Bottom line: MLE-STAR advances the state of autonomous ML engineering—web-aware bootstrapping, ablation-driven targeted refinement, and smarter ensembling are exactly the techniques I want in an agentic MLE. I’m ready to use it as a co-engineer on well-scoped problems, with humans owning metrics, governance, and final review. If we pair this agent with robust eval harnesses and license compliance, the payoff could be faster iteration and stronger baselines—without losing the engineering discipline that production ML demands. 

ReaGAN turns every node into an agent—with a plan, memory, and tools

 Classical GNNs push messages with one global rule per layer—great for tidy graphs, brittle for messy ones. ReaGAN (Retrieval-augmented Graph Agentic Network) breaks that mold by treating each node as an autonomous agent that decides whether to aggregate locally, retrieve globally, predict now, or do nothing—based on its own memory and a plan drafted by a frozen LLM

What’s new

  • Node-level autonomy. At every layer, a node queries the LLM for an action plan, executes it, and updates memory—no globally synchronized rulebook. 

  • Local + global context. Beyond neighbors in the graph, nodes invoke RAG to retrieve semantically similar but structurally distant nodes, then fuse both sources. 

  • Memory as glue. Nodes persist aggregated text snippets and few-shot (text, label) exemplars, enabling in-context prediction later. 

Why it matters

Real-world graphs are sparse and noisy; uniform propagation amplifies junk. ReaGAN’s per-node planning and local-global retrieval adapt to informativeness imbalances and long-range semantics—key gaps in standard GNNs. In experiments, the authors report competitive few-shot performance using only a frozen LLM (no fine-tuning), highlighting a compute-friendly path for graph ML. 

How it runs (at a glance)

Each node iterates a loop: perceive → plan → act (LocalAggregation / GlobalAggregation / Predict / NoOp) → update memory. A simple algorithmic skeleton formalizes the layer-wise cycle and action space. 

Paper link: https://arxiv.org/pdf/2508.00429

4.8.25

The Agentic Web: when bots become the primary users of the internet

 Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end. 

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory. 

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots. 

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows. 

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome? 

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

 Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently. 

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengthslow, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick. 

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack. 

Why this matters

  • Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains. 

  • Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations. 

  • Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers. 

  • Open release. Code and checkpoints are public, inviting replication and adaptation. 

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

Computing Changes How We Think—But Creativity, Not Just GPUs, Will Decide AI’s Next Decade

 In a wide-ranging Bloomberg interview, Dr. Wang Jian (founder of Alibaba Cloud) makes a forceful case that the era of AI “toy problems” is over. I agree. The last two years moved us from brittle demos to systems that reliably draft code, analyze documents, and support human decision-making. His analogy that more compute is like upgrading from a bicycle to a rocket is compelling: when the cost and scale of computation change, the feasible solution space—and our mental models—change with it.

Where I especially align is his view that markets are not just places to sell, but living testbeds where technology matures under real constraints. This resonates with best practices in ML ops: no benchmark, however well chosen, substitutes for deployment feedback. China’s dense competitive landscape, as he notes, creates short iteration loops—startups push features, rivals answer, users vote—accelerating collective learning. In ML terms, it’s a virtuous cycle of data, gradient steps, and evaluation at production scale.

I also appreciate his skepticism about tidy labels like AI → AGI → ASI. In practice, capability is a continuum: larger context windows, better tool use, richer memory, and planning—these blur categorical boundaries. Treating progress as increasing capability across tasks avoids false thresholds and keeps builders focused on measurable gains.

That said, I diverge on several points.

First, Dr. Wang downplays compute as a long-term bottleneck. I’m not fully convinced. While creativity and product insight absolutely dominate value creation, frontier training remains capital- and energy-intensive. Export controls, supply chain variability, and power availability still shape who can train or serve the most advanced models. For many labs, clever data curation and distillation help—but they don’t erase the physics and economics of scaling laws.

Second, on robotics, he frames AI as a new “engine” for an existing vehicle. Conceptually useful—but today’s embodied intelligence also requires tight integration across perception, control, simulation, and safety, not just swapping motors. Progress is real (foundation models for vision and language transfer surprisingly well), yet reliable grasping, long-horizon autonomy, and recovery from edge cases remain research frontiers. The “AI engine” metaphor risks underestimating those system-level challenges.

Third, the notion that no current advantage forms a durable moat is directionally optimistic and healthy for competition; still, moats can emerge from datasets with verified provenance, reinforcement-learning pipelines at scale, distribution, and compliance. Even if individual components commoditize, the orchestration (agents, tools, retrieval, evals, and workflow integration) can compound into real defensibility.

Finally, I agree with his emphasis that creativity is the scarcest input. Where I’d extend the argument is execution discipline: teams need evaluation harnesses, safety checks, and shipping cadences so creativity feeds a measurable loop. In other words, pair inspired ideas with ruthless metrics.

The upshot: Dr. Wang’s thesis—compute reshapes thinking, markets mature tech, creativity drives breakthroughs—captures much of what’s powering AI right now. My caveats don’t negate his vision; they refine it. The winners will be those who marry inventive product design with pragmatic engineering and acknowledge that, even in a marathon, hardware, data, and distribution still set the course.

Hierarchical Reasoning Model: a tiny, brain-inspired model that out-reasons giant CoT LLMs

 Most frontier models “reason” by narrating token-by-token chains of thought. Sapient Intelligence’s Hierarchical Reasoning Model (HRM) argues you don’t need that narration—or billions of parameters—to solve hard puzzles. The 27 M-parameter model runs two coupled recurrent modules at different timescales (a slow H-module for abstract planning and a fast L-module for detailed computation) to perform deep latent reasoning in a single forward pass. Trained from scratch with no pretraining and no CoT supervision, HRM hits standout scores across inductive-reasoning and search-heavy tasks.

Why it works: depth without the usual pain

HRM’s core trick is hierarchical convergence: the fast L-module iterates to a local equilibrium, then the slow H-module updates once and “resets” context for the next refinement cycle—stacking many effective computation steps without vanishing into a fixed point. To train it efficiently, the authors derive a one-step gradient approximation that avoids backpropagation-through-time, cutting memory from O(T) to O(1) per sequence. 

There’s also an adaptive halting head (a small Q-learner) that decides whether to stop or continue another reasoning segment, enabling “think-more-if-needed” behavior at inference time—useful when a problem demands longer planning. 

The receipts

With roughly 1,000 training examples per task, HRM posts numbers that would make far larger CoT systems blush:

  • ARC-AGI-1: 40.3 %, beating o3-mini-high (34.5), Claude-3.7 8K (21.2) and DeepSeek-R1 (21.0); a Transformer trained directly on IO pairs manages 15.8. 

  • ARC-AGI-2: HRM reaches 5.0 % where strong CoT baselines hover near zero—consistent with the benchmark’s step-up in compositional difficulty. 

  • Sudoku-Extreme (9×9, 1k ex.): 55.0 % accuracy; on the full Sudoku-Extreme-Full (3.83 M puzzles), HRM approaches near-perfect accuracy. 

  • Maze-Hard (30×30, 1k ex.): 74.5 % optimal-path success—where CoT baselines flatline. 

What this means for builders

  • Latent > linguistic reasoning: HRM shows you can get deep, backtracking-style reasoning inside hidden states—no verbose CoT, fewer tokens, lower latency. 

  • Tiny models, big compute depth: By recycling computation through nested recurrent cycles, HRM attains “depth” that standard Transformers don’t, even when you stack layers. 

  • Knob for “thinking time”: The halting mechanism effectively scales compute at inference—handy for tasks like Sudoku where a few extra cycles pay off more than on ARC-style transformations. 

Dataset & evaluation notes

Sudoku-Extreme combines easier Kaggle-style puzzles with community “forum-hard” sets; difficulty is measured by average backtracks (≈22 per puzzle on the new subset—much tougher than common datasets). Maze-Hard requires optimal 30×30 paths; ARC-AGI results follow the official challenge protocols with standard augmentations. 

If subsequent open-sourced code (the paper links a GitHub repo) spurs replication, expect a wave of BPTT-free recurrent designs and “reason-more-on-demand” controls to show up in lightweight agents—especially where token budgets and latency matter more than eloquent chain-of-thoughts. 

Paper link: arXiv 2506.21734 (PDF)

 Anthropic has expanded Claude Sonnet 4’s context window to a full 1,000,000 tokens, a five-fold jump that shifts what teams can do in a sin...