Showing posts with label Mixture-of-Experts. Show all posts
Showing posts with label Mixture-of-Experts. Show all posts

16.8.25

“Speed Always Wins” is the field guide to building faster, cheaper LLMs

 Transformers scaled LLMs to jaw-dropping capabilities—but quadratic attention and ballooning KV caches are throttling real-world deployment. A new survey from Shanghai AI Lab, HKUST(GZ) and collaborators takes stock of what’s next, categorizing the ecosystem of efficient LLM architectures and where each shines. Think of it as a build sheet for teams trying to cut latency and cost without giving up quality. 

The efficiency playbook, in seven parts

  • Linear sequence modeling: from linearized attention to linear RNNs and state-space models that drop KV cache and push complexity toward O(N)

  • Sparse sequence modeling: static, dynamic, and training-free sparsity to compute only the most useful token-token interactions. 

  • Efficient full attention: keep softmax attention but make it practical with IO-aware, grouped, mixture, and quantized attention variants. 

  • Sparse Mixture-of-Experts: routing, expert designs and MoE conversion to grow capacity without proportional FLOPs.

  • Hybrid architectures: inter-layer and intra-layer mixes that blend linear blocks with full attention for a better speed/quality trade-off. 

  • Diffusion LLMs: non-autoregressive generation, bridges back to AR, and early steps to extend diffusion approaches to multimodality. 

  • Beyond text: how these efficiency ideas transfer to vision, audio, and multimodal stacks. 

Why this matters now

Long-context patterns—RAG, agentic tool use, deliberate reasoning, and multimodal inputs—are pushing sequence lengths and memory pressure through the roof. The survey frames these usage patterns and argues that architectural efficiency, not just better prompts or hardware, is the lever that scales the next wave of applications. 

A roadmap, not just a reading list

Beyond taxonomy, the paper stitches trends into a blueprint: pick linear/sparse methods to kill KV bloat, use efficient-full-attention where fidelity matters, layer in MoE for capacity, and consider hybrids or diffusion LLMs where generation style allows. There’s also a companion GitHub “Awesome-Efficient-Arch” list to track the space as it moves. 

If you’re building agents that browse, reason and call tools all day—or multimodal systems juggling video and audio—this survey is a timely map of the fastest lanes through today’s LLM bottlenecks.

Paper link: arXiv 2508.09834 (PDF)

12.8.25

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

 Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model. 

Why it stands out

  • ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs. 

  • Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding. 

  • Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint. 

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass). 

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores. 

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

22.7.25

Qwen3-235B-A22B-Instruct-2507: Alibaba’s New Open-Weight Flagship Redefines Efficient Megamodels

 When the Qwen team hit “post” on X announcing Qwen3-235B-A22B-Instruct-2507—plus a lightweight FP8 variant—the tweet felt less like routine release notes and more like a thunderclap across AI Twitter. The thread promised “better across the board” performance and immediate open-weights access, positioning Qwen as the most aggressive big-model vendor in the open ecosystem. 



Inside the Model

Under the hood, the new model keeps the mixture-of-experts (MoE) recipe that made earlier Qwen3 builds special: 128 experts, but only 8 fire on each forward pass, so just 22 B parameters are active even though the full network tops out at 235 B. That efficiency allows 256 K tokens of native context and enables consumer-grade deployments that once demanded datacenter GPUs. 

Benchmark Shockwaves

Numbers published with the release show why the community’s jaw dropped. On the notoriously tricky ARC-AGI benchmark, Qwen3-235B-A22B-Instruct-2507 scores 41.8 %, eclipsing Moonshot’s freshly minted Kimi K2 by nearly 29 points and edging ahead of Claude Opus 4 in non-thinking mode. Coding (LiveCodeBench v6) jumps to 51.8 %, and reasoning tasks like AIME25 leap to 70.3 %. In most rows of the evaluation table, the new Qwen flags sit comfortably ahead of DeepSeek-V3, o3-mini, and OpenAI’s o1 reference. 

Why an FP8 Build Matters

Alongside the bf16 release, Alibaba published a fully FP8-quantised version. Dropping to eight-bit floats slashes VRAM by roughly 40 % while preserving accuracy, paving the way for single-GPU inference or even multi-GPU laptop rigs. Apache-2.0 licensing means startups can bake the FP8 weights directly into commercial products without costly negotiations. 

Community Reception: K2 Who?

Reddit’s r/singularity lit up within minutes: “Kimi K2 is already irrelevant,” read the top-voted post, linking to the Qwen tweet and highlighting the model’s 4.2× smaller total size yet broader win-rate.  Analysts on Interconnects echoed the sentiment, framing the drop as part of a summer in which Chinese labs “continue to dominate” the open-weight leaderboard and openly court Western builders. 

Beyond Benchmarks: Agentic DNA

Qwen3’s team stresses that the instruct model is tuned for tool-calling and agent workflows. The official model card shows code snippets for integrating with Qwen-Agent and MCP config files, underscoring Alibaba’s push toward practical automation at 262 K-token scale—think mega-docs, legal contracts or multi-day chat histories without windowing hacks. 

Why It Matters

Qwen3-235B-A22B-Instruct-2507 sets a new bar for “open yet frontier-grade.” By decoupling “thinking” and “non-thinking” modes into separate models, Alibaba embraced community feedback while sidestepping latency complaints. The result is a release that:

  • outperforms larger proprietary models on knowledge, reasoning, and multilingual tests;

  • ships under a permissive license;

  • arrives in both bf16 and FP8 flavors for hobbyists and enterprises alike;

  • proves that giant MoEs can be resource-friendly—and, crucially, available today.

For AI enthusiasts and builders, the message is clear: grab the weights, spin up your agent stack, and see how far 22 B active parameters can take you. The open-source race just found a new pacesetter.

Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

 Modular multimodal models bolt a vision encoder onto a language model—simple but memory-hungry. Monolithic MLLMs promise sleeker deployment by folding both roles into one network, yet they struggle with catastrophic forgetting and GPU burn. Mono-InternVL-1.5—unveiled this week by OpenGVLab, Shanghai AI Lab and Tsinghua collaborators—takes a big step toward solving both problems.

How they rebuilt the brain

  • Standalone visual parameter space. Instead of retraining the whole LLM, the team delta-tunes a fresh set of visual parameters—packed as a multimodal Mixture-of-Experts—so language weights stay frozen and stable.

  • EViP → EViP++. Their Endogenous Visual Pre-training pipeline now adds visual-attention experts and a progressive schedule that learns from noisy web data without wiping language skills.

  • Fused CUDA kernel for MoE inference. A custom kernel collapses expert routing into one GPU call, trimming real-time latency.

Numbers that matter

MetricMono-InternVLMono-InternVL-1.5Δ
Pre-training data1.1 B tokens0.5 B tokens−58 %
Inference speed61 tok/s77 tok/s+26 %
VQA Bench70.170.4+0.3
MLLM Bench53.755.6+1.9

Across 15 public benchmarks the older Mono-InternVL already led on 12; the new model keeps that edge while slashing first-token latency by up to 69 % against the modular InternVL-1.5 baseline. It even lands a headline-grabbing +114-point jump over Emu-3 on OCRBench.

Why it matters

  1. Design simplicity meets deployment thrift. One model now sees and talks without an external vision tower, fits in fewer VRAM GBs, and spools responses faster—handy for edge boxes or consumer GPUs.

  2. Delta-tuning shows its muscle. Freezing language weights while grafting “visual experts” offers a clean recipe other labs can copy to preserve text quality.

  3. Open weights, real code. Checkpoints, the fused CUDA kernel and training scripts are live on GitHub, inviting startups to fine-tune for retail search, doc-QA or AR glasses.

Mono-InternVL-1.5 won’t end the debate between modular and monolithic designs, but it proves you don’t need billion-token budgets or exotic hardware to get state-of-the-art multimodal accuracy—and you might even gain a few milliseconds back for the user.

Paper link: arXiv 2507.12566 (PDF)

19.5.25

DeepSeek V3: High-Performance Language Modeling with Minimal Hardware Overhead

 DeepSeek-AI has unveiled DeepSeek V3, a large language model (LLM) that delivers high performance while minimizing hardware overhead and maximizing computational efficiency. This advancement positions DeepSeek V3 as a competitive alternative to leading models like GPT-4o and Claude 3.5 Sonnet, offering comparable capabilities with significantly reduced resource requirements. 

Innovative Architectural Design

DeepSeek V3 employs a Mixture-of-Experts (MoE) architecture, featuring 671 billion total parameters with 37 billion active per token. This design allows the model to activate only a subset of parameters during inference, reducing computational load without compromising performance. 

The model introduces Multi-Head Latent Attention (MLA), enhancing memory efficiency and enabling effective handling of long-context inputs. Additionally, DeepSeek V3 utilizes FP8 mixed-precision training, which balances computational speed and accuracy, further contributing to its efficiency. 

Efficient Training and Deployment

Trained on 14.8 trillion high-quality tokens, DeepSeek V3 underwent supervised fine-tuning and reinforcement learning stages to refine its capabilities. The training process was completed using 2,048 NVIDIA H800 GPUs over 55 days, incurring a total cost of approximately $5.58 million—a fraction of the expenditure associated with comparable models. 

The model's training infrastructure was optimized to minimize communication latency and maximize throughput, employing strategies such as overlapping computation and communication, and dynamic load balancing across GPUs. 

Benchmark Performance

DeepSeek V3 demonstrates superior performance across various benchmarks, outperforming open-source models like LLaMA 3.1 and Qwen 2.5, and matching the capabilities of closed-source counterparts such as GPT-4o and Claude 3.5 Sonnet. 

Open-Source Accessibility

Committed to transparency and collaboration, DeepSeek-AI has released DeepSeek V3 under the MIT License, providing the research community with access to its architecture and training methodologies. The model's checkpoints and related resources are available on 


References

  1. "This AI Paper from DeepSeek-AI Explores How DeepSeek V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency" – MarkTechPost MarkTechPost

  2. DeepSeek V3 Technical Report – arXiv 

  3. Insights into DeepSeek V3: Scaling Challenges and Reflections on Hardware for AI Architectures

4.5.25

Alibaba Launches Qwen3: A New Contender in Open-Source AI

 Alibaba has introduced Qwen3, a series of open-source large language models (LLMs) designed to rival leading AI models in performance and accessibility. The Qwen3 lineup includes eight models: six dense and two utilizing the Mixture-of-Experts (MoE) architecture, which activates specific subsets of the model for different tasks, enhancing efficiency.

Benchmark Performance

The flagship model, Qwen3-235B-A22B, boasts 235 billion parameters and has demonstrated superior performance compared to OpenAI's o1 and DeepSeek's R1 on benchmarks like ArenaHard, which assesses capabilities in software engineering and mathematics. Its performance approaches that of proprietary models such as Google's Gemini 2.5-Pro. 

Hybrid Reasoning Capabilities

Qwen3 introduces hybrid reasoning, allowing users to toggle between rapid responses and more in-depth, compute-intensive reasoning processes. This feature is accessible via the Qwen Chat interface or through specific prompts like /think and /no_think, providing flexibility based on task complexity. 

Accessibility and Deployment

All Qwen3 models are released under the Apache 2.0 open-source license, ensuring broad accessibility for developers and researchers. They are available on platforms such as Hugging Face, ModelScope, Kaggle, and GitHub, and can be interacted with directly through the Qwen Chat web interface and mobile applications.


Takeaway:
Alibaba's Qwen3 series marks a significant advancement in open-source AI, delivering performance that rivals proprietary models while maintaining accessibility and flexibility. Its hybrid reasoning capabilities and efficient architecture position it as a valuable resource for developers and enterprises seeking powerful, adaptable AI solutions.

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...