Wandering Nomad: discrete visual tokens

31.7.25

X-Omni proves RL can make token-based image generators great again

Diffusion may rule today’s text-to-image scene, but Tencent researchers just reminded everyone why discrete autoregressive models still matter. In a paper titled “X-Omni: Reinforcement Learning Makes Discrete Autoregressive Image Generative Models Great Again,” they show that a sprinkle of reward learning turns a 7 B LLM that predicts visual tokens into a Sora-class image engine—while natively sharing weights with language generation.

Three moving parts

Module	Job	RL impact
Semantic image tokenizer	Converts 32 × 32 patch features into a 65 k-token vocabulary without vector-quantization blur.	Supplies denser reward signals than pixel-level losses.
Unified AR backbone	One transformer handles both language and image tokens; no diffusion head during training.	After SFT it over-fits, but RL fixes fidelity & instruction following.
Offline diffusion decoder	A lightweight “decompressor” turns token grids into crisp 1 K-px frames.	Keeps inference < 2 s on a single A100.

Why reinforcement learning?

Supervised fine-tuning left the model with warped faces and garbled typography. Policy-gradient updates—rewarded for CLIP aesthetics, OCR accuracy and prompt adherence—steadily cleaned up artifacts and nailed complex layouts, something best-of-N sampling couldn’t match.

Early numbers worth noting

FID 1.7 on ImageNet-256 (beating DiT-XL by 9 %)
99.2 % prompt compliance on the new LongText-Bench (Chinese + English captions up to 120 chars)
3.5× faster than diffusion baselines at 1024 × 1024 when streaming tokens with Flash-Attn 3.0
< 8.5 GB VRAM for a distilled 1.3 B variant (coming soon, according to the repo)

Why it matters

Unified model, unified budget – No separate diffusion tower; language and image share the same 7 B weights, making deployment simpler and cheaper.
Long-text rendering solved – Posters, UI mock-ups and meme creators finally get reliable lettering without kludgy diffusion guidance.
Open everything – Code, checkpoints and the 200-prompt LongText-Bench live on GitHub under Apache-2.0. Fine-tune away.

The bigger picture

Until now, researchers had mostly written off discrete AR image models as artifacts-prone hold-overs from DALL·E 1. X-Omni flips that narrative: with the right reward design, token predictors can match (and in text rendering, beat) diffusion’s photorealism while keeping the door open for seamless language–vision fusion and future any-to-any generation. Expect a resurgence of AR tokenizers, LoRA packs for brand fonts, and perhaps a new front in the multimodal model wars.

Paper link: arXiv 2507.22058 (PDF)