Showing posts with label open-source models. Show all posts
Showing posts with label open-source models. Show all posts

14.7.25

Lumos-1: the LLM playbook comes to video — and it only needed 48 GPUs

 Large language models have already devoured text, images and audio. Video, with its crushing spatiotemporal footprint, has been harder to tame. Lumos-1, a new release from Alibaba DAMO Academy, claims to crack the problem without exotic architectures or 1,000-GPU clusters. The 32-page paper positions Lumos-1 as “an autoregressive video generator that keeps the vanilla LLM stack—just smarter.” 

What’s new under the hood

InnovationWhy it matters
MM-RoPE (Multimodal Rotary Position Embedding)Extends 2-D RoPE to 3-D tokens while balancing frequency spectra, so the model can juggle width, height and time without corrupting text embeddings. 
Token-dependency strategyInside every frame the self-attention is bidirectional (better detail); between frames it stays causal (keeps narrative flow). 
AR-DF (Autoregressive Discrete Diffusion Forcing)Adds tube-masking during training plus a matching inference mask, fixing the frame-loss imbalance that torpedoes earlier LLM-video hybrids. 

Training on a start-up budget

Memory-efficient tricks—activation recompute, 8-bit optimizers and a custom tokenizer—let the team pre-train on just 48 GPUs yet still scale to competitive resolution and clip length. 

Benchmark results

  • GenEval (text-to-video) – on par with EMU-3

  • VBench-I2V (image-to-video) – ties COSMOS-Video2World

  • VBench-T2V (text-to-video) – neck-and-neck with OpenSoraPlan 

That’s a first for an autoregressive model that never leaves the standard LLM decoder loop.

Open weights and real-world demos

Inference notebooks, fine-tuning scripts and checkpoints are already live on GitHub under the Lumos Project umbrella. Early Twitter/X clips show 3-second 512×512 videos generated from simple prompts in roughly real-time. 

Why it matters

  1. Unification over specialization. A single backbone now supports text-to-image, T2V and I2V; no extra encoders or diffusion cascades.

  2. Greener training curve. 48 GPUs is weekend-hackathon territory compared with the hundreds used by diffusion-based rivals.

  3. Plug-and-play ideas. MM-RoPE and AR-DF are drop-ins for any LLM aiming to swallow video tokens.

If future benchmarks confirm the paper’s claims, Lumos-1 may mark the moment autoregressive models became a serious alternative to diffusion pipelines for generative video. At the very least, it hands open-source developers a lean blueprint for multimodal LLMs that don’t melt the power bill.

Paper link: arXiv 2507.08801 (PDF)    

4.7.25

DiffuCoder rewrites the code-LLM playbook with diffusion and smarter RL

 Autoregressive (AR) giants like GPT-4o and Qwen2.5 dominate today’s leaderboard-driven coding scene, but Apple’s research group thinks the next breakthrough may come from an entirely different generation paradigm. In a paper published late last week, the team unveiled DiffuCoder — a 7 B-parameter masked diffusion language model (dLLM) designed specifically for program synthesis and repair. Unlike AR models that predict the next token left-to-right, DiffuCoder iteratively denoises whole sequences, enabling global planning and out-of-order refinement.

What’s new under the hood

  • Scaled training for code. DiffuCoder is pretrained on 130 billion code tokens, then instruction-tuned and RL-fined on curated problem sets. That makes it one of the largest diffusion-first code models publicly documented.

  • Decoding insights. The authors introduce local and global AR-ness metrics to quantify how often a diffusion model falls back to sequential generation. They show that raising temperature not only diversifies token choice but also the order in which tokens are filled — a property AR models lack.

  • Coupled-GRPO. To tame the high-variance log-likelihood estimates that plague diffusion policy gradients, Apple proposes coupled Group Relative Policy Optimization, a two-pass masking strategy that evaluates complementary token subsets in one RL rollout. The technique drops noise without resorting to semi-AR “block decoding,” keeping the model fully diffusion-native.

Benchmark scores that matter

DiffuCoder’s base model already lands in the same ballpark as leading 7/8 B AR coders. After instruction tuning and coupled-GRPO, it posts:

ModelHumanEval+MBPP+EvalPlus (avg.)BigCodeBench C-Full
DiffuCoder-Instruct72.065.275.161.9
+ coupled-GRPO73.268.378.667.5

That +4.4-point jump on EvalPlus brings the diffusion model within striking distance of Qwen2.5-Coder-SFT while comfortably outpacing earlier dLLMs like Dream-7B and LLaDA-Instruct.

Why it matters

Diffusion’s parallel denoising lets models “think in drafts,” revisiting earlier lines without paying the quadratic attention tax AR models incur for long contexts. For enterprise dev-ops teams staring down thousand-line files, a diffusion-native coder that no longer needs block-wise hacks could slash latency and memory. And because coupled-GRPO is plug-and-play, the method can in theory retrofit any masked diffusion LLM — not just Apple’s.

Early tooling and ecosystem

A DiffuCoder-7B-Instruct checkpoint is already live on Hugging Face, and the GitHub repo ships with sampling scripts, RL rewards and evaluation harnesses. That means startups building unit-test agents or code-review copilots can kick the tires today on a single A100.

The bigger question is whether diffusion LLMs can climb the performance ladder as fast as their image cousins did in 2022. Apple’s coupled-GRPO shows one path forward: make RL native to diffusion instead of forcing AR habits onto a fundamentally different beast. If follow-up work scales the idea to 34 B or 70 B parameters, AR incumbents may soon find themselves sharing the podium.

Paper link: arXiv 2506.20639 (PDF)

 Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no ...