Wandering Nomad

6.8.25

OpenAI Unveils GPT-OSS: Two Apache-Licensed Open-Weight Models Aimed at Reasoning, Agents, and Real-World Deployment

OpenAI has released GPT-OSS, a pair of open-weight language models designed for strong reasoning and agentic workflows—gpt-oss-120b and gpt-oss-20b—marking the company’s most significant “open” move since GPT-2. Both models are distributed under Apache 2.0 (with an accompanying GPT-OSS usage policy), positioning them for commercial use, customization, and local deployment.

What’s in the release

Two sizes, one family. The larger gpt-oss-120b targets top-tier reasoning; gpt-oss-20b is a lighter option for edge and on-prem use. OpenAI says 120b achieves near-parity with o4-mini on core reasoning benchmarks, while 20b performs similarly to o3-mini—a notable claim for open-weight models.
Hardware footprint. OpenAI highlights efficient operation for the 120b model (single 80 GB GPU) and 20b running with as little as 16 GB memory in edge scenarios, enabling local inference and rapid iteration without costly infrastructure.
Licensing & model card. The company published a model card and licensing details (Apache 2.0 + usage policy), clarifying intended use, evaluations, and limitations.

Why this matters

For years, OpenAI prioritized API-only access to frontier systems. GPT-OSS signals a strategic broadening toward open-weight distribution, meeting developers where they build—local, cloud, or hybrid—and competing more directly with leaders like Llama and DeepSeek. Early coverage underscores the shift: outlets note this is OpenAI’s first open-weight release since GPT-2 and frame it as both an ecosystem and competitive move.

Where you can run it (day one)

OpenAI launched with unusually wide partner support, making GPT-OSS easy to try in existing MLOps stacks:

Hugging Face: downloadable weights and a welcome post with implementation details.
AWS SageMaker JumpStart: curated deployment templates for OSS-20B/120B.
Azure AI Foundry & Windows AI Foundry: managed endpoints and tooling for fine-tuning and inference.
Databricks: native availability with 131k-context serving options and enterprise controls.
NVIDIA: performance tuning for GB200 NVL72 systems; NVIDIA cites up to ~1.5M tokens/sec rack-scale throughput for the 120B variant.

Developer ergonomics: Harmony & agents

OpenAI also published Harmony, a response format and prompt schema that GPT-OSS models are trained to follow. Harmony standardizes conversation structure, reasoning output, and function-calling/tool-use—useful for building agents that require predictable JSON and multi-step plans. If you’re serving via common runtimes (Hugging Face, vLLM, Ollama), the formatting is handled for you; custom servers can adopt the schema from the public repo.

Safety posture

OpenAI says GPT-OSS went through Preparedness Framework testing, including trials where a maliciously fine-tuned 120B model was evaluated for risky capabilities. The company reports that such variants did not reach high-capability thresholds, presenting a measured step forward in open-model safety practices.

How it stacks up (early read)

Early reports highlight the significance of the move and the headline performance claims—near-o4-mini for 120B and o3-mini-like results for 20B—alongside the practical win of local, customizable models under a permissive license. Analysts also point out the competitive context: GPT-OSS arrives as open-weight ecosystems (Llama, DeepSeek, Qwen, Kimi) surge in adoption.

What to build first

Agent backends that rely on structured tool use and local policy control (Harmony + Apache 2.0 helps here).
Sovereign/air-gapped deployments in regulated environments using on-prem GPUs or edge hardware, especially with the 20B model.
Cost-sensitive RAG and analytics where fine-tuning and local inference can beat per-token API economics—now supported across major clouds and MLOps platforms.

The takeaway

GPT-OSS is OpenAI’s clearest embrace of the open-weight ecosystem to date: credible reasoning performance, permissive licensing, broad partner availability, and practical tooling for agents. If your roadmap calls for customizable, locally deployable models with strong reasoning, GPT-OSS belongs on your shortlist—whether you’re targeting laptops, single-GPU servers, or GB200-class scale.

5.8.25

MLE-STAR: Google’s ML Engineering Agent Is Impressive—But Real-World Automation Still Needs Guardrails

Google Research just unveiled MLE-STAR, a machine-learning engineering agent that treats model building like a guided search-and-refine loop rather than a single shot of LLM codegen. The announcement (August 1, 2025) positions MLE-STAR as a state-of-the-art ML engineering agent capable of automating diverse tasks.

At a high level, the system does three things I really like:

Bootstraps from the web. Instead of relying purely on prior LLM knowledge (which often overfits to familiar libraries), MLE-STAR first uses web search to pull task-appropriate, modern model patterns and builds an initial solution from them. In other words, it goes looking for today’s best practice before writing code.
Refines the right part of the pipeline. Many agents rewrite whole scripts every iteration; MLE-STAR runs ablation studies to find the code block with the biggest performance impact (e.g., feature engineering vs. model vs. ensembling), then iteratively refines that block using feedback from prior runs. This targeted loop is far closer to how strong human MLEs work day-to-day.
Ensembles with intent. Rather than naive voting, the agent proposes and improves ensemble strategies to merge multiple candidate solutions into a single, better one.

The team also built pragmatic safety rails I’m thrilled to see in an autonomous coder: a debugging agent for traceback-driven fixes, a data-leakage checker to catch test-time contamination, and a data-usage checker so scripts don’t ignore provided modalities. These modules address common failure modes I’ve encountered with LLM-generated pipelines.

On benchmarks, the results are eye-catching. MLE-STAR won medals in ~63–64% of Kaggle competitions in MLE-Bench-Lite, a massive jump over prior agents; the blog cites 63.6% any-medal (with 36% gold), and the arXiv v2 reports 64%. Either way, it’s a big leap.

I also appreciate the ops mindset: there’s open-source code built with Google’s Agent Development Kit (ADK) so teams can reproduce the workflow and extend it.

Now, where I’m cautious:

Generalization. MLE-Bench-Lite is a valuable proxy, but medals on curated Kaggle tasks aren’t the same as long-lived production systems with shifting data, compliance constraints, and messy labels. The refinement loop may still need human “taste” to set success metrics and pick trade-offs (latency vs. accuracy, cost vs. recall). The paper itself stresses targeted refinement and web retrieval as the key innovations—not a claim that human MLEs are obsolete.
Licensing & provenance. Because the agent retrieves models and code from the web, verifying permissive licenses and acceptable usage is non-negotiable—Google explicitly flags MLE-STAR as research-only and expects users to check licensing of retrieved assets. That’s the right call, and enterprises should wire in policy checks before any auto-generated PRs land.
Evaluation drift. The ablation-guided focus is elegant, but it assumes your validation signal is representative. In many real datasets, weak labels or distribution shift can mislead the ablation and push the agent to overfit the “most impactful block.” Tight data splits and independent holdouts remain essential.

Bottom line: MLE-STAR advances the state of autonomous ML engineering—web-aware bootstrapping, ablation-driven targeted refinement, and smarter ensembling are exactly the techniques I want in an agentic MLE. I’m ready to use it as a co-engineer on well-scoped problems, with humans owning metrics, governance, and final review. If we pair this agent with robust eval harnesses and license compliance, the payoff could be faster iteration and stronger baselines—without losing the engineering discipline that production ML demands.

ReaGAN turns every node into an agent—with a plan, memory, and tools

Classical GNNs push messages with one global rule per layer—great for tidy graphs, brittle for messy ones. ReaGAN (Retrieval-augmented Graph Agentic Network) breaks that mold by treating each node as an autonomous agent that decides whether to aggregate locally, retrieve globally, predict now, or do nothing—based on its own memory and a plan drafted by a frozen LLM.

What’s new

Node-level autonomy. At every layer, a node queries the LLM for an action plan, executes it, and updates memory—no globally synchronized rulebook.
Local + global context. Beyond neighbors in the graph, nodes invoke RAG to retrieve semantically similar but structurally distant nodes, then fuse both sources.
Memory as glue. Nodes persist aggregated text snippets and few-shot (text, label) exemplars, enabling in-context prediction later.

Why it matters

Real-world graphs are sparse and noisy; uniform propagation amplifies junk. ReaGAN’s per-node planning and local-global retrieval adapt to informativeness imbalances and long-range semantics—key gaps in standard GNNs. In experiments, the authors report competitive few-shot performance using only a frozen LLM (no fine-tuning), highlighting a compute-friendly path for graph ML.

How it runs (at a glance)

Each node iterates a loop: perceive → plan → act (LocalAggregation / GlobalAggregation / Predict / NoOp) → update memory. A simple algorithmic skeleton formalizes the layer-wise cycle and action space.

Paper link: https://arxiv.org/pdf/2508.00429

4.8.25

The Agentic Web: when bots become the primary users of the internet

Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end.

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory.

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots.

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows.

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome?

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently.

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengths—low, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick.

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9.

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack.

Why this matters

Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains.
Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations.
Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers.
Open release. Code and checkpoints are public, inviting replication and adaptation.

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

Computing Changes How We Think—But Creativity, Not Just GPUs, Will Decide AI’s Next Decade

In a wide-ranging Bloomberg interview, Dr. Wang Jian (founder of Alibaba Cloud) makes a forceful case that the era of AI “toy problems” is over. I agree. The last two years moved us from brittle demos to systems that reliably draft code, analyze documents, and support human decision-making. His analogy that more compute is like upgrading from a bicycle to a rocket is compelling: when the cost and scale of computation change, the feasible solution space—and our mental models—change with it.

Where I especially align is his view that markets are not just places to sell, but living testbeds where technology matures under real constraints. This resonates with best practices in ML ops: no benchmark, however well chosen, substitutes for deployment feedback. China’s dense competitive landscape, as he notes, creates short iteration loops—startups push features, rivals answer, users vote—accelerating collective learning. In ML terms, it’s a virtuous cycle of data, gradient steps, and evaluation at production scale.

I also appreciate his skepticism about tidy labels like AI → AGI → ASI. In practice, capability is a continuum: larger context windows, better tool use, richer memory, and planning—these blur categorical boundaries. Treating progress as increasing capability across tasks avoids false thresholds and keeps builders focused on measurable gains.

That said, I diverge on several points.

First, Dr. Wang downplays compute as a long-term bottleneck. I’m not fully convinced. While creativity and product insight absolutely dominate value creation, frontier training remains capital- and energy-intensive. Export controls, supply chain variability, and power availability still shape who can train or serve the most advanced models. For many labs, clever data curation and distillation help—but they don’t erase the physics and economics of scaling laws.

Second, on robotics, he frames AI as a new “engine” for an existing vehicle. Conceptually useful—but today’s embodied intelligence also requires tight integration across perception, control, simulation, and safety, not just swapping motors. Progress is real (foundation models for vision and language transfer surprisingly well), yet reliable grasping, long-horizon autonomy, and recovery from edge cases remain research frontiers. The “AI engine” metaphor risks underestimating those system-level challenges.

Third, the notion that no current advantage forms a durable moat is directionally optimistic and healthy for competition; still, moats can emerge from datasets with verified provenance, reinforcement-learning pipelines at scale, distribution, and compliance. Even if individual components commoditize, the orchestration (agents, tools, retrieval, evals, and workflow integration) can compound into real defensibility.

Finally, I agree with his emphasis that creativity is the scarcest input. Where I’d extend the argument is execution discipline: teams need evaluation harnesses, safety checks, and shipping cadences so creativity feeds a measurable loop. In other words, pair inspired ideas with ruthless metrics.

The upshot: Dr. Wang’s thesis—compute reshapes thinking, markets mature tech, creativity drives breakthroughs—captures much of what’s powering AI right now. My caveats don’t negate his vision; they refine it. The winners will be those who marry inventive product design with pragmatic engineering and acknowledge that, even in a marathon, hardware, data, and distribution still set the course.

Hierarchical Reasoning Model: a tiny, brain-inspired model that out-reasons giant CoT LLMs

Most frontier models “reason” by narrating token-by-token chains of thought. Sapient Intelligence’s Hierarchical Reasoning Model (HRM) argues you don’t need that narration—or billions of parameters—to solve hard puzzles. The 27 M-parameter model runs two coupled recurrent modules at different timescales (a slow H-module for abstract planning and a fast L-module for detailed computation) to perform deep latent reasoning in a single forward pass. Trained from scratch with no pretraining and no CoT supervision, HRM hits standout scores across inductive-reasoning and search-heavy tasks.

Why it works: depth without the usual pain

HRM’s core trick is hierarchical convergence: the fast L-module iterates to a local equilibrium, then the slow H-module updates once and “resets” context for the next refinement cycle—stacking many effective computation steps without vanishing into a fixed point. To train it efficiently, the authors derive a one-step gradient approximation that avoids backpropagation-through-time, cutting memory from O(T) to O(1) per sequence.

There’s also an adaptive halting head (a small Q-learner) that decides whether to stop or continue another reasoning segment, enabling “think-more-if-needed” behavior at inference time—useful when a problem demands longer planning.

The receipts

With roughly 1,000 training examples per task, HRM posts numbers that would make far larger CoT systems blush:

ARC-AGI-1: 40.3 %, beating o3-mini-high (34.5), Claude-3.7 8K (21.2) and DeepSeek-R1 (21.0); a Transformer trained directly on IO pairs manages 15.8.
ARC-AGI-2: HRM reaches 5.0 % where strong CoT baselines hover near zero—consistent with the benchmark’s step-up in compositional difficulty.
Sudoku-Extreme (9×9, 1k ex.): 55.0 % accuracy; on the full Sudoku-Extreme-Full (3.83 M puzzles), HRM approaches near-perfect accuracy.
Maze-Hard (30×30, 1k ex.): 74.5 % optimal-path success—where CoT baselines flatline.

What this means for builders

Latent > linguistic reasoning: HRM shows you can get deep, backtracking-style reasoning inside hidden states—no verbose CoT, fewer tokens, lower latency.
Tiny models, big compute depth: By recycling computation through nested recurrent cycles, HRM attains “depth” that standard Transformers don’t, even when you stack layers.
Knob for “thinking time”: The halting mechanism effectively scales compute at inference—handy for tasks like Sudoku where a few extra cycles pay off more than on ARC-style transformations.

Dataset & evaluation notes

Sudoku-Extreme combines easier Kaggle-style puzzles with community “forum-hard” sets; difficulty is measured by average backtracks (≈22 per puzzle on the new subset—much tougher than common datasets). Maze-Hard requires optimal 30×30 paths; ARC-AGI results follow the official challenge protocols with standard augmentations.

If subsequent open-sourced code (the paper links a GitHub repo) spurs replication, expect a wave of BPTT-free recurrent designs and “reason-more-on-demand” controls to show up in lightweight agents—especially where token budgets and latency matter more than eloquent chain-of-thoughts.

Paper link: arXiv 2506.21734 (PDF)

Stargate Norway: OpenAI’s First European AI Data Center Bets Big on Clean Power and Local Ecosystems

OpenAI has announced Stargate Norway, its first AI data center initiative in Europe, marking a major step in the company’s plan to place world-class compute closer to the communities that use it. The project debuts under the OpenAI for Countries program, which aims to pair national priorities with frontier-grade AI infrastructure. The announcement was posted on July 31, 2025.

The site will rise in Narvik, Norway, chosen for its abundant hydropower, cool climate, and established industrial base—factors that make it a compelling home for sustainable, at-scale AI. OpenAI frames Stargate Norway as “one of the most ambitious AI infrastructure investments in Europe to date,” designed to boost productivity and growth for developers, researchers, startups, and public bodies across the region.

Two heavyweight partners anchor the build: Nscale, an AI infrastructure provider with deployments across Europe and North America, and Aker, whose century-long industrial track record in energy makes it a natural fit. Nscale will design and build the facility, and ownership is expected to be a 50/50 joint venture between Nscale and Aker. OpenAI is positioned as an initial offtaker, with the option to scale usage over time through OpenAI for Countries.

On capacity, the numbers are striking: 230 MW at launch, with ambitions to add another 290 MW as demand grows. The plan targets 100,000 NVIDIA GPUs by the end of 2026, with room to expand significantly thereafter. For a continent grappling with surging AI workloads, that’s meaningful headroom—and a signal that sovereign compute is moving from rhetoric to reality.

Sustainability is built in, not bolted on. The facility will run entirely on renewable power and incorporate closed-loop, direct-to-chip liquid cooling for high thermal efficiency. Even better, waste heat from the GPU systems will be made available to local low-carbon enterprises, turning a by-product into regional value. This approach pairs performance with environmental responsibility in a way that European stakeholders have been demanding.

Crucially, OpenAI stresses that priority access will flow to Norway’s AI ecosystem—supporting homegrown startups and scientific teams—while surplus capacity will be available to public and private users across the UK, Nordics, and Northern Europe. That regional framing aims to accelerate Europe’s AI development while strengthening resilience and choice for organizations seeking high-end compute.

Stargate Norway follows Stargate UAE earlier this year and sits alongside OpenAI’s growing collaborations with European governments, including a recent MOU with the UK Government, partnerships in Estonia’s schools, and expressions of interest for the EU’s AI Gigafactories initiative. It’s part of a larger strategy to meet demand locally and support sovereign AI goals with credible infrastructure.

As an AI enthusiast, I see Stargate Norway as more than a data center—it’s an ecosystem commitment. By blending renewable energy, advanced cooling, heat-reuse, and regional access policies, OpenAI is sketching a blueprint for how frontier compute can serve communities, not just workloads. If Europe wants AI’s benefits widely shared, this is the kind of build that makes it possible.

1.8.25

Inside Gemini Deep Think: Google’s Gold-Medal Reasoning Engine with a 16-Minute Brain-Cycle

When Google DeepMind quietly flipped the switch on Gemini 2.5 Deep Think, it wasn’t just another toggle in the Gemini app. The same enhanced-reasoning mode had already notched a gold-medal-level score at the 2025 International Mathematical Olympiad (IMO)—solving five of six notoriously brutal problems and tying the human cutoff for gold. That feat put DeepMind shoulder-to-shoulder with OpenAI’s own experimental “gold-IMO” model, announced the very same week .

What makes the IMO special?

Founded in 1959, the IMO pits six pre-university prodigies from each country against six problems spanning algebra, geometry, number theory, and combinatorics. Every question is worth seven points, so 42 is perfection; a score of 35 secured this year’s gold cutoff. DeepMind’s best 2024 system managed silver, but needed more time than the four-and-a-half hours allotted to humans. In 2025, Deep Think achieved the same result within the human time window, using only plain-language prompts instead of formal proof assistants .

Under the hood: parallel minds at work

Deep Think is Gemini 2.5 Pro running in a multi-agent “parallel thinking” mode. Instead of one chain-of-thought, it spins up dozens, scores them against intermediate goals, and fuses the strongest ideas into a final answer. Google says the approach boosts benchmark scores for math, logic, and coding, at the cost of far longer inference times .

A field test from the transcript

In the YouTube walkthrough, the host pastes a 2025 IMO geometry problem into Deep Think. The clock ticks 16 minutes before the first full token arrives—but the model nails the official solution, listing the only valid values of k as 0, 1, 3. A second experiment on an AIME-25 algebra question takes 13 minutes yet again lands the correct answer (204) with detailed derivations. The lesson: breakthroughs come after a coffee break, not in real time.

Beyond math: voxel temples and half-baked Angry Birds

Deep Think’s slow-burn genius extends to generative tasks. Asked to script a colorful 3D “Sala Thai” pavilion in Three.js, the model architected a fully navigable voxel scene—complete with stylized roof eaves—on the first pass. A tougher challenge—re-creating Angry Birds in Pygame—showed its iterative potential: the first build lacked obstacles, but a follow-up prompt produced pigs, wood, glass, and workable physics. Still, each refinement added another ten-plus minutes to the wait.

When speed matters more than brilliance

Because Deep Think withholds partial streams until it has weighed all candidate thoughts, users stare at a blank screen for up to ten minutes. Google engineers admit the mode “isn’t practical for everyday coding” unless you fire a prompt and walk away—then return to review the answer or receive a push notification. For everyday tasks, plain Gemini 2.5 Pro or Flash-Lite may offer better latency-to-value ratios.

How to try it—and what’s next

Deep Think is already live for Gemini Ultra subscribers inside the consumer app, and Google says an API endpoint will roll out in the “next few weeks” to AI Studio and Vertex AI . Once that lands, developers can add a “deep-think” flag to long-form reasoning jobs—think automated theorem proving, contract analysis, or multi-step coding agents.

Bottom line: Gemini Deep Think proves massive parallel reflection can push public models into Olympiad territory, but it also shows there’s no free lunch—each extra IQ point costs time and compute. The next frontier won’t just be smarter LLMs; it will be orchestration layers that decide when a 16-minute think-tank is worth the wait and when a quick, cheaper model will do.