Transformers scaled LLMs to jaw-dropping capabilities—but quadratic attention and ballooning KV caches are throttling real-world deployment. A new survey from Shanghai AI Lab, HKUST(GZ) and collaborators takes stock of what’s next, categorizing the ecosystem of efficient LLM architectures and where each shines. Think of it as a build sheet for teams trying to cut latency and cost without giving up quality.
The efficiency playbook, in seven parts
-
Linear sequence modeling: from linearized attention to linear RNNs and state-space models that drop KV cache and push complexity toward O(N).
-
Sparse sequence modeling: static, dynamic, and training-free sparsity to compute only the most useful token-token interactions.
-
Efficient full attention: keep softmax attention but make it practical with IO-aware, grouped, mixture, and quantized attention variants.
-
Sparse Mixture-of-Experts: routing, expert designs and MoE conversion to grow capacity without proportional FLOPs.
-
Hybrid architectures: inter-layer and intra-layer mixes that blend linear blocks with full attention for a better speed/quality trade-off.
-
Diffusion LLMs: non-autoregressive generation, bridges back to AR, and early steps to extend diffusion approaches to multimodality.
-
Beyond text: how these efficiency ideas transfer to vision, audio, and multimodal stacks.
Why this matters now
Long-context patterns—RAG, agentic tool use, deliberate reasoning, and multimodal inputs—are pushing sequence lengths and memory pressure through the roof. The survey frames these usage patterns and argues that architectural efficiency, not just better prompts or hardware, is the lever that scales the next wave of applications.
A roadmap, not just a reading list
Beyond taxonomy, the paper stitches trends into a blueprint: pick linear/sparse methods to kill KV bloat, use efficient-full-attention where fidelity matters, layer in MoE for capacity, and consider hybrids or diffusion LLMs where generation style allows. There’s also a companion GitHub “Awesome-Efficient-Arch” list to track the space as it moves.
If you’re building agents that browse, reason and call tools all day—or multimodal systems juggling video and audio—this survey is a timely map of the fastest lanes through today’s LLM bottlenecks.
Paper link: arXiv 2508.09834 (PDF)