3.7.25

Baidu Open-Sources ERNIE 4.5: A Full LLM Family from 0.3 B to 424 B Parameters

 

A Flagship Release for the Open-Source Community

On July 1 2025, Baidu announced the open-source launch of ERNIE 4.5, a complete large-language-model family scaling from 0.3 billion to 424 billion parameters. The weights, training code, and evaluation suites are now freely available to researchers and enterprises under the Apache 2.0 license.

Six Sizes, One Architecture

ModelDense / MoEContext WindowTarget Hardware*Intended Use
ERNIE-Tiny 0.3BDense16 KMobile/EdgeLightweight chat & IoT
ERNIE-Base 7BDense32 K1× A10 24 GBMainstream apps
ERNIE-Large 34BDense128 K2× A100 80 GBRAG & agents
ERNIE-XL 124BMoE (8 experts)256 K4× H100 80 GBMultimodal research
ERNIE-Mega 276BMoE (16)256 K8× H100 80 GBEnterprise AI
ERNIE-Ultra 424BMoE (24)1 MTPU v5p / 16× H100Frontier-level reasoning

*at int8 + FlashAttention-2 settings

Technology Highlights

  • FlashMask Dynamic Attention – a masking scheme that activates only the most relevant key-value blocks per token, cutting memory by 40 % while retaining context depth.

  • Heterogeneous Multimodal MoE – vision-audio experts share early layers with text, enabling cross-modal reasoning without separate encoders.

  • Knowledge-Centric Corpus – Baidu’s in-house “Wenxin KG-2” injects 4 T tokens of curated facts and regulations, boosting compliance answers.

  • Self-Feedback Post-Training – iterative reflection steps reduce hallucination rate by 28 % vs. ERNIE 4.0.

Benchmark Performance

Benchmark (June 2025)GPT-4.5*ERNIE 4.5-Ultra 424BERNIE 4.5-Large 34B
MMLU (5-shot)88.7 %89.3 %82.1 %
MathGLUE55.4 %57.2 %48.0 %
VQA-v2 (zero-shot)83.0 %84.6 %78.9 %
Code HumanEval+93.5 %94.1 %87.3 %

*closed model; public leaderboard values. ERNIE 4.5 data from Baidu release notes.

Why It Matters

  1. End-to-End Transparency – full training configs (FlashMask, MoE routing, safety filters) are published, enabling reproducible research.

  2. Scalable Deployment – identical API across sizes lets startups choose Tiny/7B locally and swap to 424B in the cloud without prompt changes.

  3. Multilingual & Multimodal – supports 34 languages and native image, audio, and short-video tokens out of the box.

  4. Cost Innovation – FlashMask and MoE shrink inference FLOPs by up to 55 % versus dense GPT-4-class models, lowering GPU bills for enterprise users.

Access & Tooling

  • Hugging Face Hub – weights and safetensors for all six checkpoints.

  • Docker & vLLM Images – ready-to-serve stacks with Triton / TensorRT-LLM.

  • Agent Starter Kits – sample Model-Context-Protocol (MCP) tools for retrieval, calculators, and code execution.

  • Chinese & English Docs – prompt templates, fine-tuning scripts, and safety policy examples.

Roadmap

Baidu’s research blog notes upcoming “ERNIE 4.6” experiments with FlashMask-2 and sparse Mixture-of-Experts vision heads, plus a policy-aligned Turbo variant targeting 80 % cheaper inference for chat applications.


Takeaway
With ERNIE 4.5, Baidu throws open the doors to a fully transparent, parameter-scalable, multimodal LLM family—giving practitioners a home-grown alternative to closed giants and pushing the frontier of what open-source models can achieve.

Together AI’s DeepSWE Turns Qwen3-32B into an Open-Source Coding Agent that Tops SWEBench

 

A New State of the Art for Open-Source Coding Agents

Together AI has unveiled DeepSWE, a software-engineering agent that sets a new open-weight record on the notoriously difficult SWEBench-Verified benchmark with 59 % accuracy and 42.2 % Pass@1. Built on Alibaba’s Qwen3-32B language model and trained purely with reinforcement learning, DeepSWE offers a transparent alternative to closed-source dev assistants like GitHub Copilot and Claude Code. 


Inside the Training Pipeline

StageDetails
Warm-StartInitializes from base Qwen3-32B weights (dense, 32 B params).
R2E-Gym Curriculum4,500 real GitHub issues converted into step-by-step repair tasks spanning six languages (Python, Java, JS, Go, Rust, C++).
RLHF LoopUses a reward model that scores test-suite pass rates and diff conciseness; policy optimized with PPO across 64 × H100s for six days.
Self-Reflect & DistillHigh-reward trajectories distilled back into the policy to improve “first-try” success.

The team openly publishes all training code, reward scripts, and checkpoints under Apache 2.0, enabling independent replication or domain-specific finetuning. 

Why DeepSWE Matters

  1. One-Shot Repairs over Multi-Tool Chains
    DeepSWE fixes repository-level bugs in a single forward pass, skipping heavyweight agent stacks that juggle search, planning, and external compilers.

  2. Reinforcement Learning at Scale
    Proves that RL alone—without supervised trace data—can yield production-grade coding skills when paired with a high-capacity base model.

  3. Transparent & Portable
    Enterprises can self-host the model, audit its reward functions, and retrain on private codebases without licensing friction.


Benchmark Highlights

BenchmarkDeepSWE (32 B)DeepSeek-R1-Synth (67 B)GPT-4o (closed)
SWEBench-Verified59 %46 %64 %
HumanEval Plus93.1 %87.4 %95 %
CommitPackBench71.3 %63.0 %74 %

DeepSWE closes nearly half of the gap to GPT-4-class tools while running on a single 80 GB H100 GPU in int8 mode.

Real-World Capabilities

  • Bug Repair & Refactor – Generates minimal diffs that compile and pass project test suites.

  • Feature Stubs – Adds new endpoints, CLI flags, or unit tests on request.

  • Context Stretch – Accepts up to 64 K tokens, allowing multi-file reasoning across large repos.

Together AI provides an OpenAI-compatible API plus a VS Code extension that surfaces proposed patches as Git diffs for quick human review.


Roadmap

The team plans to:

  • Release a 13 B “consumer PC” variant trained on the same reward curriculum.

  • Add tool-augmented variants that can invoke package managers and linters dynamically.

  • Expand R2E-Gym to 10 K tasks, covering Android and .NET ecosystems.


Takeaway

DeepSWE demonstrates that meticulous RL on a strong open base (Qwen3-32B) can rival closed commercial coders—while remaining fully inspectable and modifiable. For organizations seeking sovereign AI development stacks, it’s a compelling invitation to “clone the repo, load the weights, and start fixing code.”

Baidu’s “AI Search Paradigm” Unveils a Four-Agent Framework for Next-Generation Information Retrieval

 

A Blueprint for Smarter Search

Traditional RAG pipelines handle simple fact look-ups well but struggle when queries require multi-step reasoning, tool use, or synthesis. In response, Baidu Research has introduced the AI Search Paradigm, a unified framework in which four specialized LLM-powered agents collaborate to emulate human research workflows. 

AgentRoleKey Skills
MasterClassifies query difficulty & launches a workflowMeta-reasoning, task routing
PlannerBreaks the problem into ordered sub-tasksDecomposition, tool selection
ExecutorCalls external APIs or web search to gather evidenceRetrieval, browsing, code-run
WriterConsolidates evidence into fluent, cited answersSynthesis, style control

The architecture adapts on the fly: trivial queries may bypass planning, while open-ended questions trigger full agent collaboration.

Technical Innovations

  • Dynamic Workflow Graphs – Agents spawn or skip steps in real time based on intermediate results, avoiding rigid “one-size-fits-all” chains.

  • Robust Tool Layer – Executor can invoke search APIs, calculators, code sandboxes, and custom enterprise databases, all via a common interface.

  • Alignment & Safety – Reinforcement learning with human feedback (RLHF) plus retrieval-grounding reduce hallucinations and improve citation accuracy.


Benchmark Results

On a suite of open-web reasoning tasks the system, dubbed Baidu ASP in the paper, surpasses state-of-the-art open-source baselines and even challenges proprietary models that rely on massive context windows alone.

Benchmark    Prior Best (RAG)    Baidu ASP
Complex QA (avg. F1)                    46.2           57.8
Multi-hop HotpotQA (Exact Match)                41.5               53.0
ORION Deep-Search                37.1            49.6

Practical Implications

  • Enterprise Knowledge Portals – Route user tickets through Planner→Executor→Writer to surface compliant, fully referenced answers.

  • Academic Research Assistants – Decompose literature reviews into sub-queries, fetch PDFs, and synthesize summaries.

  • E-commerce Assistants – From “Find a laptop under $800 that runs Blender” to a shoppable list with citations in a single interaction.

Because each agent is modular, organisations can fine-tune or swap individual components—e.g., plugging in a domain-specific retrieval tool—without retraining the entire stack.


Looking Ahead

The team plans to open-source a reference implementation and release an evaluation harness so other researchers can benchmark new agent variants under identical conditions. Future work focuses on:

  • Reducing latency by parallelising Executor calls

  • Expanding the Writer’s multimodal output (tables, charts, code diffs)

  • Hardening the Master agent’s self-diagnosis to detect and recover from tool failures


Takeaway
Baidu’s AI Search Paradigm reframes search as a cooperative, multi-agent process, merging planning, tool use, and natural-language synthesis into one adaptable pipeline. For enterprises and researchers seeking deeper, trustable answers—not just blue links—this approach signals how tomorrow’s search engines and internal knowledge bots will be built.

 There's a video making the rounds where someone claims to build an entire affiliate marketing business in about an hour — a website, Pi...