Wandering Nomad: Kuaishou

23.7.25

KAT‑V1 teaches big models when to think—smarter answers, fewer tokens

Large language models excel at reasoning—but often over‑reason, spewing page‑long chains of thought that waste tokens and slow latency. Kuaishou’s Tongyi Lab says its new KAT‑V1 solves that inefficiency with an AutoThink paradigm that dynamically switches between explicit reasoning and terse replies based on task difficulty. The result: a 40 B‑parameter model that matches or beats much larger rivals on toughest‑in‑class benchmarks while trimming compute.

Three ingredients behind AutoThink

Building block	What it does	Why it matters
Dual‑regime dataset	A tagging pipeline + multi‑agent synthesis label each sample as reasoning or no‑reasoning, creating paired traces for mode training.	Gives the model a supervised sense of when to think aloud.
MTP‑enhanced knowledge distillation	Multi‑Token‑Prediction transfers fine‑grained reasoning skills from a tutor model with far less pre‑training cost.	Fine‑grained signal without billions of tokens.
Step‑SRPO RL	Reinforcement learning that adds intermediate supervision to GRPO so the agent optimises both mode selection and answer accuracy in one loop.	Aligns “think vs. skip” decisions with final reward.

Benchmark highlights

LiveCodeBench Pro (leakage‑controlled): tops all open models and edges past OpenAI o3‑mini.
Math, logic & reasoning suites: consistently equals or beats DeepSeek‑R1‑0528 and Qwen3‑235B‑A22B with 40 % fewer active parameters.
Token efficiency: AutoThink cuts average response length and thus total token usage (exact numbers vary by task but run tens of percent lower than straight chain‑of‑thought baselines).

Why this matters

Compute saves tokens, not quality. AutoThink shows you can claw back cost without the typical accuracy drop.
Controllable verbosity. Developers can enforce hard token budgets or latency targets by toggling mode thresholds.
Scales up. A 200 B Mixture‑of‑Experts version with 40 B active weights is already training and showing bigger gains, hinting at a fresh scaling path that isn’t just “more parameters.”

Open for business

KAT‑V1 weights, Step‑SRPO code, and the dual‑regime dataset are live on Hugging Face, and the model already powers Kwaipilot, Kuaishou’s internal coding copilot, where engineers report faster completions and fewer hallucinations.

AutoThink is a reminder that the next leap in LLM performance may come not from thinking harder—but from knowing when not to think at all.

Paper link: arXiv 2507.08297 (PDF)

4.7.25

Keye-VL: Kuaishou’s 8-billion-parameter bid to dominate video-first AI

If image-centric multimodal large language models (MLLMs) were last year’s breakout stars, 2025 is shaping up to be all about video. Today Kuaishou’s research arm quietly published the Kwai Keye-VL Technical Report, unveiling an 8-billion-parameter model that claims state-of-the-art results across every major short-video benchmark — all while staying lean enough to fine-tune on a single A100 or RTX 6000.

Built on data — 600 billion tokens of it

Keye-VL’s recipe starts with scale where it matters: data. The team curated a 600 billion-token corpus heavily skewed toward short videos, supplementing it with images and pure text for balance. Training unfolds in a four-stage pre-train pipeline (image-text matching ➜ ViT-LLM alignment ➜ multi-task pre-train ➜ annealing) and a two-phase post-train that injects reasoning skill through a five-mode “cold-start” mixture (think / no-think / auto-think / think-with-image / high-quality video) plus reinforcement-learning alignment to squash repetition and hallucination.

A hybrid SigLIP + Qwen3 backbone

Under the hood, Keye-VL bolts a SigLIP vision encoder onto Qwen3-8B, then unifies text, image and video tokens with 3-D RoPE positional encoding. Dynamic-resolution support keeps aspect ratios intact, while an isomorphic-heterogeneous parameter-fusion trick averages weights from differently mixed data regimes to boost robustness without extra FLOPs.

Crushing the video leaderboards

On Video-MME, Video-MMMU, TempCompass, LongVideoBench and MMVU, Keye-VL outperforms every open-source or proprietary model in its size class, according to the authors. They also introduce KC-MMBench, a purpose-built benchmark of real-world short-video tasks, where Keye-VL “shows a significant advantage” over larger rivals. While the paper withholds exact deltas pending conference review, the accompanying GitHub charts depict double-digit gains on several suites.

Why it matters

Short-form video is the lingua franca of Gen Z commerce and social search — but decoding dozens of rapid cuts, subtitles and visual gags is still a blind spot for many MLLMs. By feeding a video-centric diet into a lightweight backbone, Kuaishou positions Keye-VL as both a production-ready recommendation engine for its 600-million-user platform and a developer-friendly alternative to heavyweight research models like Gemini 1.5 Pro or OpenAI’s rumored VideoGPT.

Open weights, open benchmark

An 8B preview checkpoint is already live on Hugging Face, complete with a keye-vl-utils helper library and Colab demo. KC-MMBench’s evaluation scripts ship in the same repo, inviting outside labs to reproduce — or refute — Kuaishou’s numbers. For startups building shopping stream copilots or automated highlight reels, a smaller, video-savvy foundation could be the missing piece.

Keye-VL still faces unanswered questions — latency under real-time loads, licensing around its internal data, and how well the “think-with-image” mode generalizes beyond curated prompts. But if the benchmarks hold up, Kuaishou just proved you don’t need GPT-sized weights to understand the world in motion.

Paper link: arXiv 2507.01949 (PDF)