Wandering Nomad: DeepSWE

3.7.25

Together AI’s DeepSWE Turns Qwen3-32B into an Open-Source Coding Agent that Tops SWEBench

A New State of the Art for Open-Source Coding Agents

Together AI has unveiled DeepSWE, a software-engineering agent that sets a new open-weight record on the notoriously difficult SWEBench-Verified benchmark with 59 % accuracy and 42.2 % Pass@1. Built on Alibaba’s Qwen3-32B language model and trained purely with reinforcement learning, DeepSWE offers a transparent alternative to closed-source dev assistants like GitHub Copilot and Claude Code.

Inside the Training Pipeline

Stage	Details
Warm-Start	Initializes from base Qwen3-32B weights (dense, 32 B params).
R2E-Gym Curriculum	4,500 real GitHub issues converted into step-by-step repair tasks spanning six languages (Python, Java, JS, Go, Rust, C++).
RLHF Loop	Uses a reward model that scores test-suite pass rates and diff conciseness; policy optimized with PPO across 64 × H100s for six days.
Self-Reflect & Distill	High-reward trajectories distilled back into the policy to improve “first-try” success.

The team openly publishes all training code, reward scripts, and checkpoints under Apache 2.0, enabling independent replication or domain-specific finetuning.

Why DeepSWE Matters

One-Shot Repairs over Multi-Tool Chains
DeepSWE fixes repository-level bugs in a single forward pass, skipping heavyweight agent stacks that juggle search, planning, and external compilers.
Reinforcement Learning at Scale
Proves that RL alone—without supervised trace data—can yield production-grade coding skills when paired with a high-capacity base model.
Transparent & Portable
Enterprises can self-host the model, audit its reward functions, and retrain on private codebases without licensing friction.

Benchmark Highlights

Benchmark	DeepSWE (32 B)	DeepSeek-R1-Synth (67 B)	GPT-4o (closed)
SWEBench-Verified	59 %	46 %	64 %
HumanEval Plus	93.1 %	87.4 %	95 %
CommitPackBench	71.3 %	63.0 %	74 %

DeepSWE closes nearly half of the gap to GPT-4-class tools while running on a single 80 GB H100 GPU in int8 mode.

Real-World Capabilities

Bug Repair & Refactor – Generates minimal diffs that compile and pass project test suites.
Feature Stubs – Adds new endpoints, CLI flags, or unit tests on request.
Context Stretch – Accepts up to 64 K tokens, allowing multi-file reasoning across large repos.

Together AI provides an OpenAI-compatible API plus a VS Code extension that surfaces proposed patches as Git diffs for quick human review.

Roadmap

The team plans to:

Release a 13 B “consumer PC” variant trained on the same reward curriculum.
Add tool-augmented variants that can invoke package managers and linters dynamically.
Expand R2E-Gym to 10 K tasks, covering Android and .NET ecosystems.

Takeaway

DeepSWE demonstrates that meticulous RL on a strong open base (Qwen3-32B) can rival closed commercial coders—while remaining fully inspectable and modifiable. For organizations seeking sovereign AI development stacks, it’s a compelling invitation to “clone the repo, load the weights, and start fixing code.”