Showing posts with label SWE-bench. Show all posts
Showing posts with label SWE-bench. Show all posts

12.8.25

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

 Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model. 

Why it stands out

  • ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs. 

  • Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding. 

  • Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint. 

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass). 

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores. 

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

26.5.25

The 3 Biggest Bombshells from Last Week’s AI Extravaganza

The week of May 23, 2025, marked a significant milestone in the AI industry, with major announcements from Microsoft, Anthropic, and Google during their respective developer conferences. These developments signal a transformative shift in AI capabilities and their applications.

1. Microsoft's Push for Interoperable AI Agents

At Microsoft Build, the company introduced the adoption of the Model Context Protocol (MCP), a standard facilitating communication between AI agents, even those built on different large language models (LLMs). Originally developed by Anthropic in November 2024, MCP's integration into Microsoft's Azure AI Foundry enables developers to build AI agents that can seamlessly interact, paving the way for more cohesive and efficient AI-driven workflows. 

2. Anthropic's Claude 4 Sets New Coding Benchmarks

Anthropic unveiled Claude 4, including its Opus and Sonnet variants, surprising the developer community with its enhanced coding capabilities. Notably, Claude 4 achieved a 72.5% score on the SWE-bench software engineering benchmark, surpassing OpenAI's o3 (69.1%) and Google's Gemini 2.5 Pro (63.2%). Its "extended thinking" mode allows for up to seven hours of continuous reasoning, utilizing tools like web search to tackle complex problems. 

3. Google's AI Mode Revolutionizes Search

During Google I/O, the company introduced AI Mode for its search engine, integrating the Gemini model more deeply into the search experience. Employing a "query fan-out technique," AI Mode decomposes user queries into multiple sub-queries, executes them in parallel, and synthesizes the results. Previously limited to Google Labs users, AI Mode is now being rolled out to a broader audience, potentially reshaping how users interact with search engines and impacting SEO strategies.

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...