Showing posts with label discrete visual tokens. Show all posts
Showing posts with label discrete visual tokens. Show all posts

31.7.25

X-Omni proves RL can make token-based image generators great again

 Diffusion may rule today’s text-to-image scene, but Tencent researchers just reminded everyone why discrete autoregressive models still matter. In a paper titled “X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again,” they show that a sprinkle of reward learning turns a 7 B LLM that predicts visual tokens into a Sora-class image engine—while natively sharing weights with language generation.

Three moving parts

ModuleJobRL impact
Semantic image tokenizerConverts 32 × 32 patch features into a 65 k-token vocabulary without vector-quantization blur.Supplies denser reward signals than pixel-level losses.
Unified AR backboneOne transformer handles both language and image tokens; no diffusion head during training.After SFT it over-fits, but RL fixes fidelity & instruction following.
Offline diffusion decoderA lightweight “decompressor” turns token grids into crisp 1 K-px frames.Keeps inference < 2 s on a single A100.

Why reinforcement learning?

Supervised fine-tuning left the model with warped faces and garbled typography. Policy-gradient updates—rewarded for CLIP aesthetics, OCR accuracy and prompt adherence—steadily cleaned up artifacts and nailed complex layouts, something best-of-N sampling couldn’t match.

Early numbers worth noting

  • FID 1.7 on ImageNet-256 (beating DiT-XL by 9 %)

  • 99.2 % prompt compliance on the new LongText-Bench (Chinese + English captions up to 120 chars)

  • 3.5× faster than diffusion baselines at 1024 × 1024 when streaming tokens with Flash-Attn 3.0

  • < 8.5 GB VRAM for a distilled 1.3 B variant (coming soon, according to the repo)

Why it matters

  1. Unified model, unified budget – No separate diffusion tower; language and image share the same 7 B weights, making deployment simpler and cheaper.

  2. Long-text rendering solved – Posters, UI mock-ups and meme creators finally get reliable lettering without kludgy diffusion guidance.

  3. Open everything – Code, checkpoints and the 200-prompt LongText-Bench live on GitHub under Apache-2.0. Fine-tune away.

The bigger picture

Until now, researchers had mostly written off discrete AR image models as artifacts-prone hold-overs from DALL·E 1. X-Omni flips that narrative: with the right reward design, token predictors can match (and in text rendering, beat) diffusion’s photorealism while keeping the door open for seamless language–vision fusion and future any-to-any generation. Expect a resurgence of AR tokenizers, LoRA packs for brand fonts, and perhaps a new front in the multimodal model wars.

Paper link: arXiv 2507.22058 (PDF)

 OpenAI has released GPT-OSS , a pair of open-weight language models designed for strong reasoning and agentic workflows— gpt-oss-120b and ...