Showing posts with label linear attention. Show all posts
Showing posts with label linear attention. Show all posts

2.9.25

Jet-Nemotron: NVIDIA’s post-training NAS makes small LLMs fast and smart

 For years, efficient-attention models traded speed for smarts. Jet-Nemotron, from NVIDIA researchers, tries to end that bargain with a pragmatic recipe: don’t pretrain a new architecture—start from a strong full-attention model, keep its MLPs, and search only the attention stack. They call it Post Neural Architecture Search (PostNAS), and the result is a 2–4B-parameter family that rivals or beats same-size full-attention baselines while massively upping tokens-per-second. 

What PostNAS actually does

PostNAS is a four-step, hardware-aware exploration loop layered on a pre-trained LLM: (1) learn where to keep or drop full-attention layers; (2) select the best linear-attention block; (3) optionally design a new block (“JetBlock”); and (4) tune hyperparameters for real GPUs. Freezing MLP weights keeps search cheap while letting attention do the heavy lifting. 

JetBlock in a sentence

JetBlock mixes linear attention with dynamic, input-conditioned causal convolutions on values (and trims redundant static convs on Q/K), yielding accuracy gains with little runtime overhead. 

The headline numbers

  • Throughput: On H100s, Jet-Nemotron-2B logs up to 53.6× decoding and 6.14× prefilling speedups at 256K context vs Qwen3-1.7B-Base—and still shows gains at shorter contexts. 

  • Accuracy: Despite being hybrid (mostly linear attention), Jet-Nemotron-2B/4B match or beat leading full-attention peers (Qwen2.5/3, Gemma3, Llama3.2) across MMLU/Pro, math, retrieval, coding, and long-context suites at similar scales. 

  • Coding & long-context: In the paper’s tables, Jet-Nemotron-4B leads average coding accuracy and outpaces Qwen3-1.7B-Base on long-context tasks while running ~21× faster

Why it’s fast (and why that matters)

A core finding is blunt but useful: KV-cache size, not parameter count, is the dominant limiter of long-context throughput. Keep KV small and you can batch more sequences; decoding is typically memory-bandwidth-bound. PostNAS bakes that into a hardware-aware search that tweaks heads/keys/values to hold speed while buying back accuracy. 

Why it’s interesting for builders

  • Upgrade path, not a moonshot. You can retrofit an existing model: freeze MLPs, swap/search attention, and ship meaningful speedups without full pretraining. 

  • Hybrid done right. Strategically retain a few full-attention layers (learned placement beats uniform) to keep retrieval and tricky benchmarks strong. 

  • Long-context economics. If you serve 128K–256K prompts, the 53.6× decoding and 6.14× prefilling gains translate directly into lower latency or higher concurrency. 

Bottom line

Jet-Nemotron reframes efficient LMs as an architecture-search problem on top of pre-trained backbones. With JetBlock and a KV-aware, GPU-realistic search, it shows you don’t have to choose between accuracy and speed—especially at long context lengths that crush classic Transformers. 

Paper link: arXiv 2508.15884 (PDF)

16.8.25

“Speed Always Wins” is the field guide to building faster, cheaper LLMs

 Transformers scaled LLMs to jaw-dropping capabilities—but quadratic attention and ballooning KV caches are throttling real-world deployment. A new survey from Shanghai AI Lab, HKUST(GZ) and collaborators takes stock of what’s next, categorizing the ecosystem of efficient LLM architectures and where each shines. Think of it as a build sheet for teams trying to cut latency and cost without giving up quality. 

The efficiency playbook, in seven parts

  • Linear sequence modeling: from linearized attention to linear RNNs and state-space models that drop KV cache and push complexity toward O(N)

  • Sparse sequence modeling: static, dynamic, and training-free sparsity to compute only the most useful token-token interactions. 

  • Efficient full attention: keep softmax attention but make it practical with IO-aware, grouped, mixture, and quantized attention variants. 

  • Sparse Mixture-of-Experts: routing, expert designs and MoE conversion to grow capacity without proportional FLOPs.

  • Hybrid architectures: inter-layer and intra-layer mixes that blend linear blocks with full attention for a better speed/quality trade-off. 

  • Diffusion LLMs: non-autoregressive generation, bridges back to AR, and early steps to extend diffusion approaches to multimodality. 

  • Beyond text: how these efficiency ideas transfer to vision, audio, and multimodal stacks. 

Why this matters now

Long-context patterns—RAG, agentic tool use, deliberate reasoning, and multimodal inputs—are pushing sequence lengths and memory pressure through the roof. The survey frames these usage patterns and argues that architectural efficiency, not just better prompts or hardware, is the lever that scales the next wave of applications. 

A roadmap, not just a reading list

Beyond taxonomy, the paper stitches trends into a blueprint: pick linear/sparse methods to kill KV bloat, use efficient-full-attention where fidelity matters, layer in MoE for capacity, and consider hybrids or diffusion LLMs where generation style allows. There’s also a companion GitHub “Awesome-Efficient-Arch” list to track the space as it moves. 

If you’re building agents that browse, reason and call tools all day—or multimodal systems juggling video and audio—this survey is a timely map of the fastest lanes through today’s LLM bottlenecks.

Paper link: arXiv 2508.09834 (PDF)

 Most “agent” papers either hard-code reflection workflows or pay the bill to fine-tune the base model. Memento offers a third path: keep t...