Wandering Nomad: Alibaba

Showing posts with label Alibaba. Show all posts

6.7.25

WebSailor charts an open-source course to super-human web reasoning

For the past year, open-source web agents have looked like dinghies chasing aircraft carriers: even 70-billion-parameter models scraped single-digit accuracy on BrowseComp-en, the field’s toughest information-seeking benchmark, while closed systems such as DeepResearch and Grok-3 cruised far ahead. Tongyi Lab, Alibaba’s applied-AI skunkworks, says it has all but closed that gap with WebSailor, a post-training recipe that rewires large language models to “think like uncertainty-slayers.”

Turning the web into a maze on purpose

At the heart of WebSailor is SailorFog-QA, a synthetic dataset that bombards the model with “Level-3” problems—questions whose answers hide behind tangled entity graphs and deliberately obfuscated clues (“a musician later honored in the early 21st century,” “a chronology that ends the same year a late-antique poet died”). Random walks over real web pages build those graphs; masking, vagueness and partial names turn each query into a fog bank the agent must burn off through multi-step reasoning.

DUPO: reinforcement learning that isn’t painfully slow

Tool-using agents learn painfully slowly because every step calls a browser, but Tongyi Lab’s Duplicating Sampling Policy Optimization (DUPO) makes each RL batch pull double duty: one pass samples harder trajectories, the next re-samples mid-episode to squeeze more signal from sparse rewards. A small rejection-sampling fine-tuning (RFT) “cold start” of just 2 k expert traces primes the model so DUPO has something to optimize.

Four sizes, one giant leap

WebSailor comes in 3B, 7B, 32B and 72B flavors. Even the 7-billion-parameter version hits 6.7 % pass@1 on BrowseComp-en, trouncing agents built on 32 B backbones that manage barely 2 – 3 %. The 32 B and 72 B models push further, outscoring open-source peers on BrowseComp-en/zh, GAIA and XBench and edging past proprietary offerings like Grok-3 and Doubao-Search when those systems add browsing tools.

Why it matters

Democratizing deep search. BrowseComp-level tasks—ask a question, navigate dozen-plus pages, synthesize an answer—are what corporate knowledge-bases and vertical search startups need. WebSailor shows you no longer need a closed-source giant to play.
A recipe, not a model. The CPT + HCF routine, uncertainty-first data and DUPO optimizer are architecture-agnostic; any ReAct-style agent with tool APIs can adopt them.
Downward compatibility. Despite training only on headache-grade puzzles, WebSailor’s 72 B model scores >90 % pass@1 on the single-hop SimpleQA benchmark, proving that hard-first curricula don’t break easy tasks.

Open weights, open benchmark

Code, data-generation scripts and checkpoints live in Tongyi Lab’s GitHub repo, alongside a dockerized evaluator so outside teams can reproduce—or dispute—the numbers.

With WebSailor, the open-source fleet finally has a flagship capable of keeping proprietary juggernauts in sight. The real question now: how long before someone splices SailorFog-style data and DUPO into a general-purpose agent that can shop, schedule and navigate enterprise wikis with the same super-human calm?

Paper link: arXiv 2507.02592 (PDF)

7.6.25

Alibaba's Qwen3-Embedding and Qwen3-Reranker: Redefining Multilingual Embedding and Ranking Standards linkedin.com +3

Alibaba's Qwen team has unveiled two groundbreaking models: Qwen3-Embedding and Qwen3-Reranker, aiming to revolutionize multilingual text embedding and relevance ranking. These models are designed to address the complexities of multilingual natural language processing (NLP) tasks, offering enhanced performance and versatility.

Key Features and Capabilities

Multilingual Proficiency:
Both models support an impressive array of 119 languages, making them among the most versatile open-source offerings available today.
Model Variants:
Available in three sizes—0.6B, 4B, and 8B parameters—these models cater to diverse deployment needs, balancing efficiency and performance.
State-of-the-Art Performance:
Qwen3-Embedding and Qwen3-Reranker have achieved top rankings on multiple benchmarks, including MTEB, MMTEB, and MTEB-Code, outperforming leading models like Gemini.
Versatile Applications:
These models are optimized for a range of tasks such as semantic retrieval, classification, retrieval-augmented generation (RAG), sentiment analysis, and code search.

Technical Innovations

The Qwen3 models are built upon a dense transformer-based architecture with causal attention, enabling them to produce high-fidelity embeddings by extracting hidden states corresponding to specific tokens. The training pipeline incorporates large-scale weak supervision and supervised fine-tuning, ensuring robustness and adaptability across various applications.

Open-Source Commitment

In line with Alibaba's commitment to fostering open research, the Qwen3-Embedding and Qwen3-Reranker models are released under the Apache 2.0 license. They are accessible on platforms like Hugging Face, GitHub, and ModelScope, providing researchers and developers with the tools to innovate and build upon these models.

Implications for the AI Community

The introduction of Qwen3-Embedding and Qwen3-Reranker marks a significant advancement in the field of multilingual NLP. By offering high-performance, open-source models capable of handling complex tasks across numerous languages, Alibaba empowers the AI community to develop more inclusive and effective language processing tools.

References:

9.5.25

Alibaba’s ZeroSearch: Empowering AI to Self-Train and Slash Costs by 88%

On May 8, 2025, Alibaba Group unveiled ZeroSearch, an innovative reinforcement learning framework designed to train large language models (LLMs) in information retrieval without relying on external search engines. This approach not only enhances the efficiency of AI training but also significantly reduces associated costs.

Revolutionizing AI Training Through Simulation

Traditional AI training methods for search capabilities depend heavily on real-time interactions with search engines, leading to substantial API expenses and unpredictable data quality. ZeroSearch addresses these challenges by enabling LLMs to simulate search engine interactions within a controlled environment. The process begins with a supervised fine-tuning phase, transforming an LLM into a retrieval module capable of generating both relevant and irrelevant documents in response to queries. Subsequently, a curriculum-based rollout strategy is employed during reinforcement learning to gradually degrade the quality of generated documents, enhancing the model's ability to discern and retrieve pertinent information.

Achieving Superior Performance at Reduced Costs

In extensive evaluations across seven question-answering datasets, ZeroSearch demonstrated performance on par with, and in some cases surpassing, models trained using actual search engines. Notably, a 14-billion-parameter retrieval module trained with ZeroSearch outperformed Google Search in specific benchmarks. Financially, the benefits are substantial; training with approximately 64,000 search queries using Google Search via SerpAPI would cost about $586.70, whereas utilizing a 14B-parameter simulation LLM on four A100 GPUs incurs only $70.80—a remarkable 88% reduction in costs.

Implications for the AI Industry

ZeroSearch's introduction marks a significant shift in AI development paradigms. By eliminating dependence on external search engines, developers gain greater control over training data quality and reduce operational costs. This advancement democratizes access to sophisticated AI training methodologies, particularly benefiting startups and organizations with limited resources. Furthermore, the open-source release of ZeroSearch's code, datasets, and pre-trained models on platforms like GitHub and Hugging Face fosters community engagement and collaborative innovation.

Looking Ahead

As AI continues to evolve, frameworks like ZeroSearch exemplify the potential for self-sufficient learning models that minimize external dependencies. This development not only streamlines the training process but also paves the way for more resilient and adaptable AI systems in various applications.

4.5.25

Alibaba Launches Qwen3: A New Contender in Open-Source AI

Alibaba has introduced Qwen3, a series of open-source large language models (LLMs) designed to rival leading AI models in performance and accessibility. The Qwen3 lineup includes eight models: six dense and two utilizing the Mixture-of-Experts (MoE) architecture, which activates specific subsets of the model for different tasks, enhancing efficiency.

Benchmark Performance

The flagship model, Qwen3-235B-A22B, boasts 235 billion parameters and has demonstrated superior performance compared to OpenAI's o1 and DeepSeek's R1 on benchmarks like ArenaHard, which assesses capabilities in software engineering and mathematics. Its performance approaches that of proprietary models such as Google's Gemini 2.5-Pro.

Hybrid Reasoning Capabilities

Qwen3 introduces hybrid reasoning, allowing users to toggle between rapid responses and more in-depth, compute-intensive reasoning processes. This feature is accessible via the Qwen Chat interface or through specific prompts like /think and /no_think, providing flexibility based on task complexity.

Accessibility and Deployment

All Qwen3 models are released under the Apache 2.0 open-source license, ensuring broad accessibility for developers and researchers. They are available on platforms such as Hugging Face, ModelScope, Kaggle, and GitHub, and can be interacted with directly through the Qwen Chat web interface and mobile applications.

Takeaway:
Alibaba's Qwen3 series marks a significant advancement in open-source AI, delivering performance that rivals proprietary models while maintaining accessibility and flexibility. Its hybrid reasoning capabilities and efficient architecture position it as a valuable resource for developers and enterprises seeking powerful, adaptable AI solutions.

Qwen2.5-Omni-3B: Bringing Advanced Multimodal AI to Consumer Hardwar

Qwen2.5-Omni-3B: Bringing Advanced Multimodal AI to Consumer Hardware

Alibaba's Qwen team has unveiled Qwen2.5-Omni-3B, a streamlined 3-billion-parameter version of its flagship multimodal AI model. Tailored for consumer-grade PCs and laptops, this model delivers robust performance across text, audio, image, and video inputs without the need for high-end enterprise hardware.

Key Features:Qwen GitHub

Multimodal Capabilities: Processes diverse inputs including text, images, audio, and video, generating coherent text and natural speech outputs in real time.
Thinker-Talker Architecture: Employs a dual-module system where the "Thinker" handles text generation and the "Talker" manages speech synthesis, ensuring synchronized and efficient processing.arXiv
TMRoPE (Time-aligned Multimodal RoPE): Introduces a novel position embedding technique that aligns audio and video inputs temporally, enhancing the model's comprehension and response accuracy.
Resource Efficiency: Optimized for devices with 24GB VRAM, the model reduces memory usage by over 50% compared to its 7B-parameter predecessor, facilitating deployment on standard consumer hardware.
Voice Customization: Offers built-in voice options, "Chelsie" (female) and "Ethan" (male), allowing users to tailor speech outputs to specific applications or audiences.

Deployment and Accessibility:

Qwen2.5-Omni-3B is available for download and integration via platforms like Hugging Face, GitHub, and ModelScope. Developers can deploy the model using frameworks such as Hugging Face Transformers, Docker containers, or Alibaba’s vLLM implementation. Optional optimizations, including FlashAttention 2 and BF16 precision, are supported to enhance performance and reduce memory consumption.

Licensing Considerations:

Currently, Qwen2.5-Omni-3B is released under a research-only license. Commercial use requires obtaining a separate license from Alibaba’s Qwen team.

Takeaway:
Alibaba's Qwen2.5-Omni-3B signifies a pivotal advancement in making sophisticated multimodal AI accessible to a broader audience. By delivering high-performance capabilities in a compact, resource-efficient model, it empowers developers and researchers to explore and implement advanced AI solutions on standard consumer hardware.