The next wave of computer-use agents won’t just click around UIs—they’ll mix API calls and GUI interaction in one policy. That’s the bet behind ComputerRL, a new framework that treats desktop work as an end-to-end reinforcement learning problem and introduces an API-GUI paradigm so agents can call services and operate human-oriented interfaces within the same loop.
The missing infrastructure for scale
Training desktop agents with online RL has been hamstrung by slow, brittle environments. ComputerRL ships a distributed RL stack that orchestrates thousands of parallel virtual desktops, making long-horizon, on-policy training runs practical for general computer use.
Stabilizing long runs: Entropulse
Pure RL on complex desktops tends to collapse exploration entropy over time. The authors propose Entropulse, a simple but effective schedule that alternates RL with supervised fine-tuning, restoring healthy entropy while retaining the gains from policy improvement.
Results & models
Using open backbones (GLM-4-9B-0414 and Qwen2.5-14B), the team evaluates on OSWorld and reports 48.1% accuracy with AutoGLM-OS-9B, a new state of the art for general desktop automation in their setup. The framework underpins the group’s AutoGLM system.
Why it matters
-
Bridging the modality gap: Real workflows mix API calls with UI operations; ComputerRL trains a single policy to do both.
-
Throughput for RL: Parallelized desktops unlock the scale online RL has needed for computer agents.
-
Simple stability trick: Entropulse offers a practical recipe any lab can try to keep long runs from collapsing.
If your roadmap includes agents that file expenses, reconcile sheets, or run web apps end-to-end, ComputerRL reads like a blueprint for turning brittle demos into trainable, scalable systems.
Paper link: arXiv 2508.14040 (PDF)