Wandering Nomad: Step‑SRPO

23.7.25

KAT‑V1 teaches big models when to think—smarter answers, fewer tokens

Large language models excel at reasoning—but often over‑reason, spewing page‑long chains of thought that waste tokens and slow latency. Kuaishou’s Tongyi Lab says its new KAT‑V1 solves that inefficiency with an AutoThink paradigm that dynamically switches between explicit reasoning and terse replies based on task difficulty. The result: a 40 B‑parameter model that matches or beats much larger rivals on toughest‑in‑class benchmarks while trimming compute.

Three ingredients behind AutoThink

Building block	What it does	Why it matters
Dual‑regime dataset	A tagging pipeline + multi‑agent synthesis label each sample as reasoning or no‑reasoning, creating paired traces for mode training.	Gives the model a supervised sense of when to think aloud.
MTP‑enhanced knowledge distillation	Multi‑Token‑Prediction transfers fine‑grained reasoning skills from a tutor model with far less pre‑training cost.	Fine‑grained signal without billions of tokens.
Step‑SRPO RL	Reinforcement learning that adds intermediate supervision to GRPO so the agent optimises both mode selection and answer accuracy in one loop.	Aligns “think vs. skip” decisions with final reward.

Benchmark highlights

LiveCodeBench Pro (leakage‑controlled): tops all open models and edges past OpenAI o3‑mini.
Math, logic & reasoning suites: consistently equals or beats DeepSeek‑R1‑0528 and Qwen3‑235B‑A22B with 40 % fewer active parameters.
Token efficiency: AutoThink cuts average response length and thus total token usage (exact numbers vary by task but run tens of percent lower than straight chain‑of‑thought baselines).

Why this matters

Compute saves tokens, not quality. AutoThink shows you can claw back cost without the typical accuracy drop.
Controllable verbosity. Developers can enforce hard token budgets or latency targets by toggling mode thresholds.
Scales up. A 200 B Mixture‑of‑Experts version with 40 B active weights is already training and showing bigger gains, hinting at a fresh scaling path that isn’t just “more parameters.”

Open for business

KAT‑V1 weights, Step‑SRPO code, and the dual‑regime dataset are live on Hugging Face, and the model already powers Kwaipilot, Kuaishou’s internal coding copilot, where engineers report faster completions and fewer hallucinations.

AutoThink is a reminder that the next leap in LLM performance may come not from thinking harder—but from knowing when not to think at all.

Paper link: arXiv 2507.08297 (PDF)