Why this matters
If you’ve ever tried to train an LLM-powered agent with many tool calls spread across a genuine back-and-forth conversation, you’ve probably discovered that “multi-turn” means different things to different frameworks. Yanbin Jiang’s latest post shows how the VeRL team punched through that ceiling by grafting LangGraph directly onto VeRL’s reinforcement-learning rollout engine. The result is a training loop that speaks the same language as production code.
1. Where they started
-
Native VeRL multi-turn – great for quick experiments. You enable
multi_turn: True
, write a YAML schema for each tool, implement an async Python class, and you’re off; their GSM8K benchmark ran in two days. -
Pain points
-
Double bookkeeping: every tool had to be declared twice (YAML + Python).
-
Drift: schema and code fell out of sync, and prod tools (written for LangChain/LangGraph) diverged from the “training” clones.
-
2. A quick stop-gap: automatic tool wrapping
Yanbin added BaseTool.from_callable()
, which introspects any plain Python function with transformers.utils.get_json_schema
, then fabricates a VeRL-compatible wrapper on the fly. One list of callables (tool_list = [multiply, add, …]
) now powers both training and prod.
My dev take: this is the same pattern I use in LangChain when I decorate business logic with @tool
. Nice to see VeRL admit “if you can’t beat reflection, join it.”
3. The real blocker: orchestration power
Research quickly outgrew VeRL’s built-in rollout:
Need | Why VeRL fell short |
---|---|
Dynamic branches & backtracking | Native graph was too rigid. |
True multi-turn dialogue (user follow-ups) | Any assistant message without tool calls ended the convo. |
Per-node sampling / chat-template tweaks | Global settings only. |
4. Architectural insight: separation of concerns
“Let VeRL manage actor weights & hardware; let LangGraph drive the conversation.”
So they built a LangChain-compatible chat-model client for VeRL’s SGLang server. Training now works like this:
-
VeRL hands the initial messages + model handle to the user’s LangGraph.
-
The graph does its thing—branching, retrying, invoking tools—using the exact actor weights being optimized.
-
When the graph stops, VeRL collects the message history and rewards.
The PR shows a seven-line YAML snippet that swaps the old rollout for:
…and a 60-line example graph that binds tools, counts turns, and lets you vary temperature
node-by-node.
5. Why I’m excited
-
One graph to rule them all – deployment and training share code; no more “but it worked in prod!”
-
Easier ablations – want to test a new branch strategy? Edit the graph script; RL pipeline stays untouched.
-
Framework-agnostic future – the same bridge pattern could plug VeRL into OpenAI Function Calling, Microsoft’s AutoGen, or whatever framework wins next year.
My takeaway
VeRL just became a lot more attractive for serious agent RL work. By leaning on LangGraph instead of extending an in-house orchestration DSL, the team keeps VeRL laser-focused on fast rollouts, leaves graph logic to a dedicated library, and—crucially—lets devs iterate on one codebase. If you’re juggling duplicate tool definitions or fighting mismatch between training and production, clone Yanbin’s PR and breathe easier.
Explore it more here: https://jybsuper.github.io/posts/langgraph_rollout/
No comments:
Post a Comment