Wandering Nomad: Qwen3

4.7.25

Keye-VL: Kuaishou’s 8-billion-parameter bid to dominate video-first AI

If image-centric multimodal large language models (MLLMs) were last year’s breakout stars, 2025 is shaping up to be all about video. Today Kuaishou’s research arm quietly published the Kwai Keye-VL Technical Report, unveiling an 8-billion-parameter model that claims state-of-the-art results across every major short-video benchmark — all while staying lean enough to fine-tune on a single A100 or RTX 6000.

Built on data — 600 billion tokens of it

Keye-VL’s recipe starts with scale where it matters: data. The team curated a 600 billion-token corpus heavily skewed toward short videos, supplementing it with images and pure text for balance. Training unfolds in a four-stage pre-train pipeline (image-text matching ➜ ViT-LLM alignment ➜ multi-task pre-train ➜ annealing) and a two-phase post-train that injects reasoning skill through a five-mode “cold-start” mixture (think / no-think / auto-think / think-with-image / high-quality video) plus reinforcement-learning alignment to squash repetition and hallucination.

A hybrid SigLIP + Qwen3 backbone

Under the hood, Keye-VL bolts a SigLIP vision encoder onto Qwen3-8B, then unifies text, image and video tokens with 3-D RoPE positional encoding. Dynamic-resolution support keeps aspect ratios intact, while an isomorphic-heterogeneous parameter-fusion trick averages weights from differently mixed data regimes to boost robustness without extra FLOPs.

Crushing the video leaderboards

On Video-MME, Video-MMMU, TempCompass, LongVideoBench and MMVU, Keye-VL outperforms every open-source or proprietary model in its size class, according to the authors. They also introduce KC-MMBench, a purpose-built benchmark of real-world short-video tasks, where Keye-VL “shows a significant advantage” over larger rivals. While the paper withholds exact deltas pending conference review, the accompanying GitHub charts depict double-digit gains on several suites.

Why it matters

Short-form video is the lingua franca of Gen Z commerce and social search — but decoding dozens of rapid cuts, subtitles and visual gags is still a blind spot for many MLLMs. By feeding a video-centric diet into a lightweight backbone, Kuaishou positions Keye-VL as both a production-ready recommendation engine for its 600-million-user platform and a developer-friendly alternative to heavyweight research models like Gemini 1.5 Pro or OpenAI’s rumored VideoGPT.

Open weights, open benchmark

An 8B preview checkpoint is already live on Hugging Face, complete with a keye-vl-utils helper library and Colab demo. KC-MMBench’s evaluation scripts ship in the same repo, inviting outside labs to reproduce — or refute — Kuaishou’s numbers. For startups building shopping stream copilots or automated highlight reels, a smaller, video-savvy foundation could be the missing piece.

Keye-VL still faces unanswered questions — latency under real-time loads, licensing around its internal data, and how well the “think-with-image” mode generalizes beyond curated prompts. But if the benchmarks hold up, Kuaishou just proved you don’t need GPT-sized weights to understand the world in motion.

Paper link: arXiv 2507.01949 (PDF)

4.5.25

Alibaba Launches Qwen3: A New Contender in Open-Source AI

Alibaba has introduced Qwen3, a series of open-source large language models (LLMs) designed to rival leading AI models in performance and accessibility. The Qwen3 lineup includes eight models: six dense and two utilizing the Mixture-of-Experts (MoE) architecture, which activates specific subsets of the model for different tasks, enhancing efficiency.

Benchmark Performance

The flagship model, Qwen3-235B-A22B, boasts 235 billion parameters and has demonstrated superior performance compared to OpenAI's o1 and DeepSeek's R1 on benchmarks like ArenaHard, which assesses capabilities in software engineering and mathematics. Its performance approaches that of proprietary models such as Google's Gemini 2.5-Pro.

Hybrid Reasoning Capabilities

Qwen3 introduces hybrid reasoning, allowing users to toggle between rapid responses and more in-depth, compute-intensive reasoning processes. This feature is accessible via the Qwen Chat interface or through specific prompts like /think and /no_think, providing flexibility based on task complexity.

Accessibility and Deployment

All Qwen3 models are released under the Apache 2.0 open-source license, ensuring broad accessibility for developers and researchers. They are available on platforms such as Hugging Face, ModelScope, Kaggle, and GitHub, and can be interacted with directly through the Qwen Chat web interface and mobile applications.

Takeaway:
Alibaba's Qwen3 series marks a significant advancement in open-source AI, delivering performance that rivals proprietary models while maintaining accessibility and flexibility. Its hybrid reasoning capabilities and efficient architecture position it as a valuable resource for developers and enterprises seeking powerful, adaptable AI solutions.