Showing posts with label dual‑regime dataset. Show all posts
Showing posts with label dual‑regime dataset. Show all posts

23.7.25

KAT‑V1 teaches big models when to think—smarter answers, fewer tokens

 Large language models excel at reasoning—but often over‑reason, spewing page‑long chains of thought that waste tokens and slow latency. Kuaishou’s Tongyi Lab says its new KAT‑V1 solves that inefficiency with an AutoThink paradigm that dynamically switches between explicit reasoning and terse replies based on task difficulty. The result: a 40 B‑parameter model that matches or beats much larger rivals on toughest‑in‑class benchmarks while trimming compute.

Three ingredients behind AutoThink

Building blockWhat it doesWhy it matters
Dual‑regime datasetA tagging pipeline + multi‑agent synthesis label each sample as reasoning or no‑reasoning, creating paired traces for mode training.Gives the model a supervised sense of when to think aloud.
MTP‑enhanced knowledge distillationMulti‑Token‑Prediction transfers fine‑grained reasoning skills from a tutor model with far less pre‑training cost.Fine‑grained signal without billions of tokens.
Step‑SRPO RLReinforcement learning that adds intermediate supervision to GRPO so the agent optimises both mode selection and answer accuracy in one loop.Aligns “think vs. skip” decisions with final reward.

Benchmark highlights

  • LiveCodeBench Pro (leakage‑controlled): tops all open models and edges past OpenAI o3‑mini.

  • Math, logic & reasoning suites: consistently equals or beats DeepSeek‑R1‑0528 and Qwen3‑235B‑A22B with 40 % fewer active parameters.

  • Token efficiency: AutoThink cuts average response length and thus total token usage (exact numbers vary by task but run tens of percent lower than straight chain‑of‑thought baselines).

Why this matters

  • Compute saves tokens, not quality. AutoThink shows you can claw back cost without the typical accuracy drop.

  • Controllable verbosity. Developers can enforce hard token budgets or latency targets by toggling mode thresholds.

  • Scales up. A 200 B Mixture‑of‑Experts version with 40 B active weights is already training and showing bigger gains, hinting at a fresh scaling path that isn’t just “more parameters.”

Open for business

KAT‑V1 weights, Step‑SRPO code, and the dual‑regime dataset are live on Hugging Face, and the model already powers Kwaipilot, Kuaishou’s internal coding copilot, where engineers report faster completions and fewer hallucinations.

AutoThink is a reminder that the next leap in LLM performance may come not from thinking harder—but from knowing when not to think at all.

Paper link: arXiv 2507.08297 (PDF)

 Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no ...