Showing posts with label Alibaba Cloud. Show all posts
Showing posts with label Alibaba Cloud. Show all posts

23.7.25

Qwen3‑Coder: Alibaba’s 480‑B Agentic Code Model Aims for One‑Million‑Token Repos

 When Alibaba’s Qwen research group dropped the link to “Qwen3‑Coder: Agentic Coding in the World,” AI Twitter lit up in minutes. The post introduces Qwen3‑Coder‑480B‑A35B‑Instruct, a gargantuan 480‑billion‑parameter Mixture‑of‑Experts (MoE) language model in which only 35 B parameters activate per token, making deployment far leaner than raw size suggests. Released on July 22, 2025 with permissive access points on GitHub, Hugging Face, and ModelScope, the model claims state‑of‑the‑art results in agent‑style coding and tool use—rivaling Anthropic’s Claude 4 Sonnet while remaining fully open‑weight. 

Architecture built for truly big code

The Qwen team doubled down on “scaling in three dimensions.” First, tokens: 7.5 T training tokens with a hefty 70 % code ratio to anchor programming skill while preserving math and general reasoning. Second, context: the model handles a native 256 K‑token window and can stretch to 1 M tokens using YaRN extrapolation, making whole‑repository prompts or week‑long chat traces finally practical. Third, synthetic data: Qwen2.5‑Coder was used to rewrite noisy corpora, boosting baseline cleanliness before fine‑tuning even starts. 

Reinforcement learning at industrial scale

Rather than stopping at supervised fine‑tune, Qwen3‑Coder undergoes two novel RL phases. “Scaling Code RL” turns automated unit‑test generation into millions of execution‑checked training rounds—improving code‑run accuracy and even general abilities. Then comes Agent RL, where 20 000 parallel cloud environments simulate real SWE‑Bench tickets. The model learns to plan, invoke tools, and iterate until tests pass, producing best‑in‑class scores on SWE‑Bench Verified without any test‑time tricks. 

Benchmarks and agentic chops

Early numbers show Qwen3‑Coder topping every open‑source competitor on Agentic Coding, Agentic Browser‑Use, and Agentic Tool‑Use tracks; Alibaba positions it as “comparable to Claude Sonnet 4” in practical autonomy. In short, it doesn’t just spit snippets—it reasons across multi‑file repos, calls compilers, and revises until green checks appear. For developers chasing fully automated pull‑request bots, that’s a milestone. 

Meet Qwen Code—your command‑line copilot

To make those agentic skills tangible, the team open‑sourced Qwen Code, a Node‑based CLI forked from Gemini CLI. With a one‑line npm i -g @qwen-code/qwen-code, users gain a prompt‑driven shell that speaks directly to Qwen3‑Coder via an OpenAI‑compatible endpoint. Prefer other tooling? The blog shows drop‑in guides for Claude Code, Cline, and generic REST calls, so the model can slot into VS Code, Git hooks, or CI pipelines in minutes. 

Why it matters

Qwen3‑Coder is more than another “bigger‑is‑better” headline. By combining MoE efficiency, million‑token context, and reinforcement learning tuned for agent workflows, Alibaba delivers a bridge between research hype and developer reality. Hobbyists with a single A100 can experiment with 256 K‑token coding agents, while enterprises get an Apache‑friendly alternative to closed, usage‑metered APIs. For AI enthusiasts, it’s an invitation: wire up Qwen3‑Coder to your build system, hand it a failing test, and watch an open model patch your codebase—all without leaving the command line. The age of end‑to‑end agentic coding just took a decisive step forward. 

22.7.25

Qwen3-235B-A22B-Instruct-2507: Alibaba’s New Open-Weight Flagship Redefines Efficient Megamodels

 When the Qwen team hit “post” on X announcing Qwen3-235B-A22B-Instruct-2507—plus a lightweight FP8 variant—the tweet felt less like routine release notes and more like a thunderclap across AI Twitter. The thread promised “better across the board” performance and immediate open-weights access, positioning Qwen as the most aggressive big-model vendor in the open ecosystem. 



Inside the Model

Under the hood, the new model keeps the mixture-of-experts (MoE) recipe that made earlier Qwen3 builds special: 128 experts, but only 8 fire on each forward pass, so just 22 B parameters are active even though the full network tops out at 235 B. That efficiency allows 256 K tokens of native context and enables consumer-grade deployments that once demanded datacenter GPUs. 

Benchmark Shockwaves

Numbers published with the release show why the community’s jaw dropped. On the notoriously tricky ARC-AGI benchmark, Qwen3-235B-A22B-Instruct-2507 scores 41.8 %, eclipsing Moonshot’s freshly minted Kimi K2 by nearly 29 points and edging ahead of Claude Opus 4 in non-thinking mode. Coding (LiveCodeBench v6) jumps to 51.8 %, and reasoning tasks like AIME25 leap to 70.3 %. In most rows of the evaluation table, the new Qwen flags sit comfortably ahead of DeepSeek-V3, o3-mini, and OpenAI’s o1 reference. 

Why an FP8 Build Matters

Alongside the bf16 release, Alibaba published a fully FP8-quantised version. Dropping to eight-bit floats slashes VRAM by roughly 40 % while preserving accuracy, paving the way for single-GPU inference or even multi-GPU laptop rigs. Apache-2.0 licensing means startups can bake the FP8 weights directly into commercial products without costly negotiations. 

Community Reception: K2 Who?

Reddit’s r/singularity lit up within minutes: “Kimi K2 is already irrelevant,” read the top-voted post, linking to the Qwen tweet and highlighting the model’s 4.2× smaller total size yet broader win-rate.  Analysts on Interconnects echoed the sentiment, framing the drop as part of a summer in which Chinese labs “continue to dominate” the open-weight leaderboard and openly court Western builders. 

Beyond Benchmarks: Agentic DNA

Qwen3’s team stresses that the instruct model is tuned for tool-calling and agent workflows. The official model card shows code snippets for integrating with Qwen-Agent and MCP config files, underscoring Alibaba’s push toward practical automation at 262 K-token scale—think mega-docs, legal contracts or multi-day chat histories without windowing hacks. 

Why It Matters

Qwen3-235B-A22B-Instruct-2507 sets a new bar for “open yet frontier-grade.” By decoupling “thinking” and “non-thinking” modes into separate models, Alibaba embraced community feedback while sidestepping latency complaints. The result is a release that:

  • outperforms larger proprietary models on knowledge, reasoning, and multilingual tests;

  • ships under a permissive license;

  • arrives in both bf16 and FP8 flavors for hobbyists and enterprises alike;

  • proves that giant MoEs can be resource-friendly—and, crucially, available today.

For AI enthusiasts and builders, the message is clear: grab the weights, spin up your agent stack, and see how far 22 B active parameters can take you. The open-source race just found a new pacesetter.

 Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no ...