Wandering Nomad

2.8.25

Stargate Norway: OpenAI’s First European AI Data Center Bets Big on Clean Power and Local Ecosystems

OpenAI has announced Stargate Norway, its first AI data center initiative in Europe, marking a major step in the company’s plan to place world-class compute closer to the communities that use it. The project debuts under the OpenAI for Countries program, which aims to pair national priorities with frontier-grade AI infrastructure. The announcement was posted on July 31, 2025.

The site will rise in Narvik, Norway, chosen for its abundant hydropower, cool climate, and established industrial base—factors that make it a compelling home for sustainable, at-scale AI. OpenAI frames Stargate Norway as “one of the most ambitious AI infrastructure investments in Europe to date,” designed to boost productivity and growth for developers, researchers, startups, and public bodies across the region.

Two heavyweight partners anchor the build: Nscale, an AI infrastructure provider with deployments across Europe and North America, and Aker, whose century-long industrial track record in energy makes it a natural fit. Nscale will design and build the facility, and ownership is expected to be a 50/50 joint venture between Nscale and Aker. OpenAI is positioned as an initial offtaker, with the option to scale usage over time through OpenAI for Countries.

On capacity, the numbers are striking: 230 MW at launch, with ambitions to add another 290 MW as demand grows. The plan targets 100,000 NVIDIA GPUs by the end of 2026, with room to expand significantly thereafter. For a continent grappling with surging AI workloads, that’s meaningful headroom—and a signal that sovereign compute is moving from rhetoric to reality.

Sustainability is built in, not bolted on. The facility will run entirely on renewable power and incorporate closed-loop, direct-to-chip liquid cooling for high thermal efficiency. Even better, waste heat from the GPU systems will be made available to local low-carbon enterprises, turning a by-product into regional value. This approach pairs performance with environmental responsibility in a way that European stakeholders have been demanding.

Crucially, OpenAI stresses that priority access will flow to Norway’s AI ecosystem—supporting homegrown startups and scientific teams—while surplus capacity will be available to public and private users across the UK, Nordics, and Northern Europe. That regional framing aims to accelerate Europe’s AI development while strengthening resilience and choice for organizations seeking high-end compute.

Stargate Norway follows Stargate UAE earlier this year and sits alongside OpenAI’s growing collaborations with European governments, including a recent MOU with the UK Government, partnerships in Estonia’s schools, and expressions of interest for the EU’s AI Gigafactories initiative. It’s part of a larger strategy to meet demand locally and support sovereign AI goals with credible infrastructure.

As an AI enthusiast, I see Stargate Norway as more than a data center—it’s an ecosystem commitment. By blending renewable energy, advanced cooling, heat-reuse, and regional access policies, OpenAI is sketching a blueprint for how frontier compute can serve communities, not just workloads. If Europe wants AI’s benefits widely shared, this is the kind of build that makes it possible.

1.8.25

Inside Gemini Deep Think: Google’s Gold-Medal Reasoning Engine with a 16-Minute Brain-Cycle

When Google DeepMind quietly flipped the switch on Gemini 2.5 Deep Think, it wasn’t just another toggle in the Gemini app. The same enhanced-reasoning mode had already notched a gold-medal-level score at the 2025 International Mathematical Olympiad (IMO)—solving five of six notoriously brutal problems and tying the human cutoff for gold. That feat put DeepMind shoulder-to-shoulder with OpenAI’s own experimental “gold-IMO” model, announced the very same week .

What makes the IMO special?

Founded in 1959, the IMO pits six pre-university prodigies from each country against six problems spanning algebra, geometry, number theory, and combinatorics. Every question is worth seven points, so 42 is perfection; a score of 35 secured this year’s gold cutoff. DeepMind’s best 2024 system managed silver, but needed more time than the four-and-a-half hours allotted to humans. In 2025, Deep Think achieved the same result within the human time window, using only plain-language prompts instead of formal proof assistants .

Under the hood: parallel minds at work

Deep Think is Gemini 2.5 Pro running in a multi-agent “parallel thinking” mode. Instead of one chain-of-thought, it spins up dozens, scores them against intermediate goals, and fuses the strongest ideas into a final answer. Google says the approach boosts benchmark scores for math, logic, and coding, at the cost of far longer inference times .

A field test from the transcript

In the YouTube walkthrough, the host pastes a 2025 IMO geometry problem into Deep Think. The clock ticks 16 minutes before the first full token arrives—but the model nails the official solution, listing the only valid values of k as 0, 1, 3. A second experiment on an AIME-25 algebra question takes 13 minutes yet again lands the correct answer (204) with detailed derivations. The lesson: breakthroughs come after a coffee break, not in real time.

Beyond math: voxel temples and half-baked Angry Birds

Deep Think’s slow-burn genius extends to generative tasks. Asked to script a colorful 3D “Sala Thai” pavilion in Three.js, the model architected a fully navigable voxel scene—complete with stylized roof eaves—on the first pass. A tougher challenge—re-creating Angry Birds in Pygame—showed its iterative potential: the first build lacked obstacles, but a follow-up prompt produced pigs, wood, glass, and workable physics. Still, each refinement added another ten-plus minutes to the wait.

When speed matters more than brilliance

Because Deep Think withholds partial streams until it has weighed all candidate thoughts, users stare at a blank screen for up to ten minutes. Google engineers admit the mode “isn’t practical for everyday coding” unless you fire a prompt and walk away—then return to review the answer or receive a push notification. For everyday tasks, plain Gemini 2.5 Pro or Flash-Lite may offer better latency-to-value ratios.

How to try it—and what’s next

Deep Think is already live for Gemini Ultra subscribers inside the consumer app, and Google says an API endpoint will roll out in the “next few weeks” to AI Studio and Vertex AI . Once that lands, developers can add a “deep-think” flag to long-form reasoning jobs—think automated theorem proving, contract analysis, or multi-step coding agents.

Bottom line: Gemini Deep Think proves massive parallel reflection can push public models into Olympiad territory, but it also shows there’s no free lunch—each extra IQ point costs time and compute. The next frontier won’t just be smarter LLMs; it will be orchestration layers that decide when a 16-minute think-tank is worth the wait and when a quick, cheaper model will do.

Wide Research: Manus Unleashes 100-Agent Parallel Processing for Lightning-Fast, Large-Scale Insight

Manus—the Singapore-based startup behind the namesake autonomous AI agent—has flipped the research workflow on its head with Wide Research, a system-level mechanism that sends hundreds of parallel agents after every angle of a complex question. Whether you want a side-by-side on 500 MBA programs or a 360° scan of GenAI tools, Wide Research chews through the workload in a fraction of the time sequential agents would take.

From Deep to Wide

Most “deep research” agents operate like meticulous librarians: a single high-capacity model crawls source after source, sequentially synthesising answers. It’s thorough—but agonisingly slow at scale. Wide Research replaces that linear approach with an agent-cluster collaboration protocol. Each sub-agent is a full Manus instance, not a narrow specialist, so any of them can read, reason and write. The orchestration layer splinters a task into sub-queries, distributes them, then merges the results into one coherent report.

Why general-purpose sub-agents matter

Traditional multi-agent designs hard-code roles—“planner,” “coder,” “critic.” Those rigid templates break when a project veers off script. Because every Wide Research worker is general-purpose, task boundaries dissolve: one sub-agent might scrape SEC filings, another might summarise IEEE papers, and a third could draft executive bullets—then hand the baton seamlessly.

Inside the Architecture

Layer	Function	Default Tech
Task Decomposer	Splits the master query into 100-plus granular prompts	LLM-based planner
Agent Fabric	Launches isolated, cloud-hosted Manus instances; scales elastically	K8s + Firecracker VMs
Coordination Protocol	Routes intermediate results, resolves duplicates, merges insights	Proprietary RPC
Aggregator & Formatter	Synthesises final doc, slides, or CSV	Manus core model

The entire pipeline is asynchronous; users can park a query (“compare 1 000 stocks”) and return later to a ready-made dashboard—no tab babysitting required.

Performance Snapshot

Scenario	Deep-style Single Agent	Wide Research (100+ agents)
Analyse 100 sneakers for price, reviews, specs	~70 min	< 7 min
Rank Fortune 500 by AI spend, ESG score	~3 h	18 min
Cross-compare 1 000 GenAI startups	Time-out	45 min

(Internal Manus demo data shown during launch.)

Early Use Cases

Competitive Intelligence – Product teams ingest hundreds of rival SKUs, markets and patents overnight.
Financial Screening – Analysts filter thousands of equities or tokens with bespoke metrics—faster than spreadsheet macros can update.
Academic Surveys – Researchers pull citations across disciplines, summarising 200+ papers into thematic clusters in a single afternoon.

Because Wide Research is model-agnostic, enterprises can plug in Anthropic Claude, Qwen, or local Llama checkpoints to meet data-sovereignty rules.

Pricing & Roll-Out

Today: Wide Research is live for Pro subscribers (US $199/month).
Q3 2025: Gradual access for Plus and Basic tiers.
Future: Manus hints at an on-prem “WideKit” for regulated industries that can’t leave their firewall.

Limitations & Trade-Offs

Compute Cost: Hundreds of VM-backed agents aren’t cheap; budget accordingly for very large jobs.
Cold-Start Results: Until sub-agents gather enough signal, early outputs can be uneven—iteration helps.
Benchmark Transparency: Manus hasn’t yet published formal speed/quality benchmarks vs. sequential baselines, though third-party analyses are emerging.

The Bigger Picture

Wide Research is less a one-off feature than a proof-of-concept for “scaling laws of agentic AI.” Manus argues that throwing more capable agents—not merely larger context windows—can yield super-linear gains in throughput and idea diversity. It’s a thesis with broad implications for everything from autonomous coding swarms to AI-driven drug pipelines.

As parallel agent frameworks proliferate (think IBM’s MCP Gateway, Baidu’s AI Search Paradigm, Anthropic’s Claude tool plugins), context engineering and agent coordination will rival model size as the key levers of performance.

Key Takeaway

Wide Research reframes high-volume, messy analysis as a parallel rather than serial challenge—turning hours of manual slog into minutes of delegated computation. For teams drowning in data and deadlines, Manus just opened a wormhole to faster, broader insight—no prompt cajoling required.

31.7.25

X-Omni proves RL can make token-based image generators great again

Diffusion may rule today’s text-to-image scene, but Tencent researchers just reminded everyone why discrete autoregressive models still matter. In a paper titled “X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again,” they show that a sprinkle of reward learning turns a 7 B LLM that predicts visual tokens into a Sora-class image engine—while natively sharing weights with language generation.

Three moving parts

Module	Job	RL impact
Semantic image tokenizer	Converts 32 × 32 patch features into a 65 k-token vocabulary without vector-quantization blur.	Supplies denser reward signals than pixel-level losses.
Unified AR backbone	One transformer handles both language and image tokens; no diffusion head during training.	After SFT it over-fits, but RL fixes fidelity & instruction following.
Offline diffusion decoder	A lightweight “decompressor” turns token grids into crisp 1 K-px frames.	Keeps inference < 2 s on a single A100.

Why reinforcement learning?

Supervised fine-tuning left the model with warped faces and garbled typography. Policy-gradient updates—rewarded for CLIP aesthetics, OCR accuracy and prompt adherence—steadily cleaned up artifacts and nailed complex layouts, something best-of-N sampling couldn’t match.

Early numbers worth noting

FID 1.7 on ImageNet-256 (beating DiT-XL by 9 %)
99.2 % prompt compliance on the new LongText-Bench (Chinese + English captions up to 120 chars)
3.5× faster than diffusion baselines at 1024 × 1024 when streaming tokens with Flash-Attn 3.0
< 8.5 GB VRAM for a distilled 1.3 B variant (coming soon, according to the repo)

Why it matters

Unified model, unified budget – No separate diffusion tower; language and image share the same 7 B weights, making deployment simpler and cheaper.
Long-text rendering solved – Posters, UI mock-ups and meme creators finally get reliable lettering without kludgy diffusion guidance.
Open everything – Code, checkpoints and the 200-prompt LongText-Bench live on GitHub under Apache-2.0. Fine-tune away.

The bigger picture

Until now, researchers had mostly written off discrete AR image models as artifacts-prone hold-overs from DALL·E 1. X-Omni flips that narrative: with the right reward design, token predictors can match (and in text rendering, beat) diffusion’s photorealism while keeping the door open for seamless language–vision fusion and future any-to-any generation. Expect a resurgence of AR tokenizers, LoRA packs for brand fonts, and perhaps a new front in the multimodal model wars.

Paper link: arXiv 2507.22058 (PDF)

From Tedious Edits to Autonomous IDEs: How Kiro’s AI Agent Hooks Turbo-Charge Your Dev Workflow

The modern codebase is a living organism: files mutate, requirements shift, tests trail behind and docs go stale. Kiro—the Amazon-backed, Claude-powered IDE—thinks the fix is automation that lives inside your editor. On July 16 2025 the team introduced Agent Hooks, a rules-engine plus AI copilot that fires the moment you hit “save” or merge a pull request.

What exactly is an Agent Hook?

Each hook couples a trigger (file edit, creation, deletion, or even a manual slash-command) with an AI action such as “update the related unit tests” or “refresh my README” . Unlike brittle shell scripts, the action is described in plain English and executed by a Gemini-class agent that understands project context. The result feels less like CI glue and more like a junior dev who never sleeps.

Five headline benefits

Natural-language config – type “Whenever I touch *.py, update the matching test_*.py” and the hook YAML writes itself.
Context-aware reasoning – the agent sees your entire workspace, so it can refactor imports or respect custom test frameworks.
Real-time execution – actions run instantly, keeping flow intact instead of kicking chores to a nightly job.
Shareable recipes – hook files live in .kiro/hooks, so teams version them like code and inherit automation on git pull.
Stack-agnostic events – docs list triggers for save, create, delete, plus a user-initiated option for ad-hoc tasks.

Building your first hook in three clicks

Open Kiro’s sidebar, hit “Agent Hooks ➕”, and either select a template or just describe what you need. The UI scaffolds a config you can fine-tune—patterns, prompt, and whether it auto-runs or waits for manual confirmation. Behind the scenes, Kiro writes a .kiro.hook file so you’re always one git diff away from auditing the logic.

Real-world recipes

Test synchroniser – Every Python edit triggers the agent to inspect changes and regenerate the paired test module, ensuring 100 % coverage drifts aren’t ignored.
Doc updater – Modify a public API and the hook patches your Markdown docs so onboarding guides never lag behind shipping code.
Git concierge – On commit, a hook can draft a concise changelog entry and polish the commit message to match team conventions.
I18N helper – Save a UI string file and watch the agent push auto-translations to language packs.

Best-practice tips

Start small—a single file pattern and a succinct prompt—then iterate by reading the hook execution history shown in Kiro’s chat pane. Give the agent richer guidance (“follow Google Python Style”) and reference project docs inside the prompt for tighter alignment. Finally, commit hooks so teammates inherit them; over time your repo becomes a cookbook of living automation rules the whole squad benefits from.

Why this matters

Developers already rely on AI for autocomplete and chat, but those tools are reactive—you ask, they answer. Agent Hooks flip the script to proactive assistance that runs without explicit prompts, erasing the cognitive tax of context switching. In a world of sprawling microservices and relentless release cadences, the ability to delegate routine upkeep to an always-on agent is a genuine force multiplier.

Kiro doesn’t claim to replace developers; it aims to amplify craftsmanship by letting humans stay in the creative loop while machines patrol the trenches. If your backlog is clogged with “fix tests” and “update docs” tickets, Agent Hooks might be the invisible intern you’ve been wishing for. Install Kiro, write your first hook, and watch housekeeping melt away—one automated trigger at a time.

AlphaEarth Foundations: Google DeepMind’s “Virtual Satellite” Sets a New Baseline for Planet-Scale Mapping

A virtual satellite built from data

On July 30 2025, Google DeepMind unwrapped AlphaEarth Foundations, an AI model that ingests optical, radar, lidar and climate-simulation feeds and distills them into a single 64-dimensional “embedding field” for every 10 × 10 meter patch of terrestrial land and coastal waters. Think of it as a software satellite constellation: instead of waiting for the next orbital pass, analysts query a unified representation that already encodes land cover, surface materials and temporal change.

How it works

AlphaEarth tackles two long-standing headaches—data overload and inconsistency. First, it merges dozens of public observation streams, weaving them into time-aligned “video” frames of the planet. Second, it compresses those frames 16× more efficiently than previous AI pipelines, slashing storage and compute for downstream tasks. Each embedding becomes a compact, loss-aware summary that models can reason over without re-processing raw pixels.

A leap in accuracy and efficiency

In head-to-head evaluations spanning land-use, surface-property and seasonal-change tasks, AlphaEarth posted a 24 % lower error rate than both classical remote-sensing methods and recent deep-learning baselines. Crucially, it excelled when label data was sparse—proof that its self-supervised pre-training truly generalises. The accompanying research paper on arXiv highlights consistent out-performance across “diverse mapping evaluations” without fine-tuning.

From blog post to real-world maps

To jump-start adoption, DeepMind and Google Earth Engine released the Satellite Embedding dataset: annual global snapshots containing 1.4 trillion embedding footprints per year. More than 50 organisations—including the UN’s Food and Agriculture Organisation, MapBiomas, the Global Ecosystems Atlas and Stanford University—are already piloting projects that range from rainforest monitoring to precision agriculture. Users report faster map production and higher classification accuracy, even in cloudy tropics or sparsely imaged polar regions.

Why it matters for climate and beyond

Accurate, up-to-date geospatial data underpins decisions on food security, infrastructure and conservation. Yet researchers often juggle incompatible satellite products or wrestle with GPU-hungry vision models. AlphaEarth shrinks that friction: a single API call retrieves embeddings that are both information-dense and provenance-rich, ready for plug-and-play into GIS tools, LLM agents or custom model fine-tunes. Cheaper storage and lower latency also mean national agencies with modest budgets can now run continent-scale analyses weekly instead of yearly.

The road ahead

DeepMind hints at extending the framework to real-time streams and coupling it with Gemini-class reasoning agents capable of answering open-ended “why” and “what-if” questions about Earth systems. For AI builders, the combination of long-context language models and AlphaEarth embeddings could enable chatbots that diagnose crop stress or forecast urban heat islands—all grounded in verifiable pixels.

Bottom line: AlphaEarth Foundations compresses the planet into a query-ready lattice of vectors, handing scientists, policymakers and hobbyist mappers a new lens on Earth’s shifting surface. With open data, documented gains and an Apache-style license, DeepMind has effectively democratized a planetary observatory—one 10-meter square at a time.

“Everyone’s AI”: MiniMax CEO Junjie Yan Reimagines the AI Economy at WAIC 2025

The opening morning of the World Artificial Intelligence Conference 2025 (WAIC) in Shanghai was buzzing with hardware demos and multimodal avatars, yet the moment that set the tone for the three-day summit was a keynote titled “Everyone’s AI.” Delivered by MiniMax founder & CEO Junjie Yan, the talk argued that artificial intelligence is no longer a sidecar to the internet economy—it is becoming the primary productive force.

From research toy to societal engine

Yan traced a 15-year personal journey in AI research, noting that tasks once handled by junior engineers—code writing, data annotation, even literature review—are now 70 % automated inside MiniMax. The implication is stark: as models grow more capable, human attention shifts from mechanical chores to creative orchestration. “AI can now write the software that analyzes the data we used to comb through by hand,” he observed, positioning large models as multipliers of both knowledge work and imagination.

The economics: another 10× drop on the horizon

MiniMax isn’t just waxing philosophical; it is betting on cost curves. Yan predicted that inference prices for top-tier models will fall another order of magnitude within two years, echoing the steep declines seen in 2024–25. Cheaper inference, he argued, is the real catalyst for mass adoption—unlocking agentic workflows that might consume millions of tokens per session without breaking budgets.

Many models, many values

Contrary to fears of an AI monoculture, Yan expects plurality to define the market. Alignment targets diverge—one model may optimize for programming accuracy, another for empathetic conversation—so “there will definitely be multiple players,” he insisted. Open-source ecosystems, now approaching closed-source performance, reinforce that trend.

Multi-agent systems change the rules

Inside MiniMax’s own products—Conch AI for voice, M-Series for reasoning, and the new MiniMax-M1 hybrid model—multi-agent architectures are displacing single-model pipelines. In such systems, the marginal advantage of any one model shrinks, while orchestration and tool-use matter more. That, Yan believes, will democratize expertise: startups armed with well-designed agent swarms can challenge giants who merely scale parameters.

A less money-burning industry

Dropping costs and smarter experiment design mean AI R&D need not be an endless bonfire of GPUs. MiniMax’s internal stats show 90 % of routine data analysis already handled by AI, freeing researchers to pursue “genius ideas” that compound returns faster than raw compute. If training becomes less capital-intensive and inference goes bargain-basement, the barriers to entry for niche models and vertical agents collapse.

“Everyone’s AI” as call to action

Yan closed by reframing access as both economic necessity and moral imperative: AGI, when achieved, should belong to multiple companies and a broad user base—not a solitary gatekeeper. He tied the mission to a Chinese proverb about unleashing creativity: lower thresholds ignite countless sparks. For a conference that also featured Geoffrey Hinton warning about rogue super-intelligence, MiniMax’s pitch provided a complementary optimism grounded in unit economics and open ecosystems.

Why it matters

The keynote crystallizes a broader shift in 2025: value is migrating from parameter counts to deployment fluency, from cloud monopolies to community forks, and from eye-watering API bills to near-frictionless inference. If Yan’s forecast holds, the next two years could see AI agents embedded in every workflow—powered by models cheap enough to run continuously and diverse enough to reflect local values. In that future, “Everyone’s AI” is not a slogan; it is table stakes.

LangExtract: Google’s Gemini-Powered Library That Turns Raw Text into Reliable Data

A new way to mine insight from messy text

On July 30 2025 the Google Developers Blog unveiled LangExtract, an open-source Python package that promises to “unlock the data within” any text-heavy corpus, from clinical notes to customer feedback threads. Built around Gemini models but compatible with any LLM, the project aims to replace brittle regex pipelines with a single declarative interface for extraction, visualization and traceability.

Why LangExtract stands out

LangExtract combines seven features that rarely appear together in one tool:

Precise source grounding – every entity you pull out is linked back to its exact character span in the original document, so auditors can see where a value came from.
Schema-enforced outputs – you describe the JSON you want, add a few examples, and the library leverages Gemini’s controlled generation to keep responses on-spec.
Long-context optimisation – chunking, parallel passes and multi-stage recall tame “needle-in-a-haystack” searches across million-token inputs.
Interactive HTML visualisation – one command turns results into a self-contained page where extractions glow inside the source text.
Flexible back-ends – swap Gemini for on-device Ollama models or any OpenAI-compatible endpoint.
Domain agnosticism – the same prompt-plus-examples recipe works for finance, law, medicine or literature.
Apache-2.0 licence – no gating, just pip install langextract.

How it works in practice

A “quick-start” script pulls Shakespeare characters, emotions and relationships in about a dozen lines of code, then writes an interactive HTML overlay showing each extraction highlighted inside the play. The same pattern scales: push the full Romeo and Juliet text through three extraction passes and LangExtract surfaces hundreds of grounded entities while keeping recall high. G

The GitHub repository already counts 200+ stars less than a week after launch, and ships with examples for medication extraction and structured radiology reporting—fields where provenance and accuracy are critical. A live Hugging Face demo called RadExtract shows the library converting free-text X-ray reports into structured findings, then color-coding the original sentences that justify each data point.

Under the hood: Gemini plus controlled generation

When you pass model_id="gemini-2.5-flash" (or -pro for harder tasks), LangExtract automatically applies Google’s controlled generation API to lock output into the schema you defined. That means fewer JSON-parse errors and cleaner downstream pipelines—something traditional LLM calls often fumble. For massive workloads, Google recommends a Tier-2 Gemini quota to avoid rate limits.

Why developers should pay attention

Information extraction has long oscillated between hand-tuned rules (fast but brittle) and heavyweight ML pipelines (accurate but slow to build). LangExtract offers a third path: prompt-programming simplicity with enterprise-grade traceability. Because it’s open-source, teams can audit the chain of custody and fine-tune prompts to their own compliance rules instead of black-box vendor filters.

Whether you’re structuring earnings calls, tagging sentiment in product reviews, or mapping drug-dosage relationships in EMRs, LangExtract turns unreadable text into queryable data—without sacrificing transparency. For AI enthusiasts, it’s also a practical showcase of what Gemini’s long-context and schema-control features can do today.

Bottom line: install the package, craft a clear prompt, add a few gold examples, and LangExtract will handle the rest—from parallel chunking to an HTML dashboard—so you can move straight from raw documents to actionable datasets.

30.7.25

ChatGLM’s GLM‑4 family levels up—and brings its own toolbox

Tsinghua‑spun Zhipu AI has spent three years iterating on ChatGLM, a Chinese‑English rival to GPT. Its new report zooms in on the GLM‑4 series, a trio that stretches from a data‑center‑class behemoth to a 9 B‑parameter fine‑tune you can run at home. The headline: GLM‑4 “closely rivals or outperforms GPT‑4” on marquee leaderboards—while an All Tools variant autonomously fires up external apps to finish harder jobs.

Under the hood

Piece	Why it matters
10 T‑token corpus (Chinese & English‑heavy, 24 other languages)	Gives the model near‑par bilingual parity—something GPT‑4 still chases in Chinese.
Multi‑stage alignment (SFT → RLHF)	Drives instruction following to GPT‑4‑Turbo levels on IFEval without bloating answers.
All Tools post‑training	Lets GLM‑4 decide if a prompt needs web search, Python, text‑to‑image, or any user‑defined API—no manual tool triggers.

The SKUs

GLM‑4 – flagship ~130 B active params, 128 K context, up to 1 M with sparse attention.
GLM‑4‑Air – latency‑trimmed 34 B variant tuned for GPU serving.
GLM‑4‑9B / 9B‑Chat – consumer‑grade checkpoint (128 K / 1 M context) already live on Hugging Face.

Scorecard highlights

General reasoning: beats or ties GPT‑4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval.
Chinese alignment: tops GPT‑4 on AlignBench.
Long context: matches GPT‑4‑Turbo 128 K and Claude 3 at 256 K spill‑tests.
Tool use: in dev‑set trials, GLM‑4 All Tools edges GPT‑4 All Tools in web‑info retrieval and Python‑powered math. a

Why it matters

Bilingual crown – China finally has an open(-ish) model that doesn’t trade English chops for Mandarin mastery.
Tool autonomy – A single checkpoint that chooses whether to browse, code or draw marks a step toward plug‑and‑play agent workflows.
Open‑source momentum – Previous ChatGLM releases logged 10 M+ Hugging Face downloads in 2023; GLM‑4‑9B is expected to super‑charge that hobbyist wave.

Rapid timeline of the GLM ecosystem

![timeline figure omitted] The paper’s timeline shows an 18‑month sprint from GLM‑130B to GLM‑4‑All Tools, with side quests into code (CodeGeeX), vision (GLM‑4V‑9B) and agents (AutoWebGLM).

The road ahead

Zhipu AI hints at an MoE‑style GLM‑5 and deeper tool libraries (SQL, vector search, proprietary APIs). For builders already juggling browser calls, Python sandboxes and image pipes, GLM‑4 All Tools may offer a cleaner, unified brain—especially if your product needs to speak both English and Mandarin with equal poise.

Paper link: arXiv 2406.12793 (PDF)