A New State of the Art for Open-Source Coding Agents
Together AI has unveiled DeepSWE, a software-engineering agent that sets a new open-weight record on the notoriously difficult SWEBench-Verified benchmark with 59 % accuracy and 42.2 % Pass@1. Built on Alibaba’s Qwen3-32B language model and trained purely with reinforcement learning, DeepSWE offers a transparent alternative to closed-source dev assistants like GitHub Copilot and Claude Code.
Inside the Training Pipeline
Stage | Details |
---|---|
Warm-Start | Initializes from base Qwen3-32B weights (dense, 32 B params). |
R2E-Gym Curriculum | 4,500 real GitHub issues converted into step-by-step repair tasks spanning six languages (Python, Java, JS, Go, Rust, C++). |
RLHF Loop | Uses a reward model that scores test-suite pass rates and diff conciseness; policy optimized with PPO across 64 × H100s for six days. |
Self-Reflect & Distill | High-reward trajectories distilled back into the policy to improve “first-try” success. |
Why DeepSWE Matters
-
One-Shot Repairs over Multi-Tool Chains
DeepSWE fixes repository-level bugs in a single forward pass, skipping heavyweight agent stacks that juggle search, planning, and external compilers. -
Reinforcement Learning at Scale
Proves that RL alone—without supervised trace data—can yield production-grade coding skills when paired with a high-capacity base model. -
Transparent & Portable
Enterprises can self-host the model, audit its reward functions, and retrain on private codebases without licensing friction.
Benchmark Highlights
Benchmark | DeepSWE (32 B) | DeepSeek-R1-Synth (67 B) | GPT-4o (closed) |
---|---|---|---|
SWEBench-Verified | 59 % | 46 % | 64 % |
HumanEval Plus | 93.1 % | 87.4 % | 95 % |
CommitPackBench | 71.3 % | 63.0 % | 74 % |
Real-World Capabilities
-
Bug Repair & Refactor – Generates minimal diffs that compile and pass project test suites.
-
Feature Stubs – Adds new endpoints, CLI flags, or unit tests on request.
-
Context Stretch – Accepts up to 64 K tokens, allowing multi-file reasoning across large repos.
Together AI provides an OpenAI-compatible API plus a VS Code extension that surfaces proposed patches as Git diffs for quick human review.
Roadmap
The team plans to:
-
Release a 13 B “consumer PC” variant trained on the same reward curriculum.
-
Add tool-augmented variants that can invoke package managers and linters dynamically.
-
Expand R2E-Gym to 10 K tasks, covering Android and .NET ecosystems.
Takeaway
DeepSWE demonstrates that meticulous RL on a strong open base (Qwen3-32B) can rival closed commercial coders—while remaining fully inspectable and modifiable. For organizations seeking sovereign AI development stacks, it’s a compelling invitation to “clone the repo, load the weights, and start fixing code.”