If image-centric multimodal large language models (MLLMs) were last year’s breakout stars, 2025 is shaping up to be all about video. Today Kuaishou’s research arm quietly published the Kwai Keye-VL Technical Report, unveiling an 8-billion-parameter model that claims state-of-the-art results across every major short-video benchmark — all while staying lean enough to fine-tune on a single A100 or RTX 6000.
Built on data — 600 billion tokens of it
Keye-VL’s recipe starts with scale where it matters: data. The team curated a 600 billion-token corpus heavily skewed toward short videos, supplementing it with images and pure text for balance. Training unfolds in a four-stage pre-train pipeline (image-text matching ➜ ViT-LLM alignment ➜ multi-task pre-train ➜ annealing) and a two-phase post-train that injects reasoning skill through a five-mode “cold-start” mixture (think / no-think / auto-think / think-with-image / high-quality video) plus reinforcement-learning alignment to squash repetition and hallucination.
A hybrid SigLIP + Qwen3 backbone
Under the hood, Keye-VL bolts a SigLIP vision encoder onto Qwen3-8B, then unifies text, image and video tokens with 3-D RoPE positional encoding. Dynamic-resolution support keeps aspect ratios intact, while an isomorphic-heterogeneous parameter-fusion trick averages weights from differently mixed data regimes to boost robustness without extra FLOPs.
Crushing the video leaderboards
On Video-MME, Video-MMMU, TempCompass, LongVideoBench and MMVU, Keye-VL outperforms every open-source or proprietary model in its size class, according to the authors. They also introduce KC-MMBench, a purpose-built benchmark of real-world short-video tasks, where Keye-VL “shows a significant advantage” over larger rivals. While the paper withholds exact deltas pending conference review, the accompanying GitHub charts depict double-digit gains on several suites.
Why it matters
Short-form video is the lingua franca of Gen Z commerce and social search — but decoding dozens of rapid cuts, subtitles and visual gags is still a blind spot for many MLLMs. By feeding a video-centric diet into a lightweight backbone, Kuaishou positions Keye-VL as both a production-ready recommendation engine for its 600-million-user platform and a developer-friendly alternative to heavyweight research models like Gemini 1.5 Pro or OpenAI’s rumored VideoGPT.
Open weights, open benchmark
An 8B preview checkpoint is already live on Hugging Face, complete with a keye-vl-utils helper library and Colab demo. KC-MMBench’s evaluation scripts ship in the same repo, inviting outside labs to reproduce — or refute — Kuaishou’s numbers. For startups building shopping stream copilots or automated highlight reels, a smaller, video-savvy foundation could be the missing piece.
Keye-VL still faces unanswered questions — latency under real-time loads, licensing around its internal data, and how well the “think-with-image” mode generalizes beyond curated prompts. But if the benchmarks hold up, Kuaishou just proved you don’t need GPT-sized weights to understand the world in motion.
Paper link: arXiv 2507.01949 (PDF)