10.7.25

CriticLean makes the AI “grader” the hero of math formalization

 Automating the translation of plain-English math into Lean code has felt like grading your own exam: language models write a proof, a compiler checks syntax, and everyone hopes the semantics line up. CriticLean flips that script by training a dedicated critic model—dubbed CriticLeanGPT—that learns to catch logical slips the compiler can’t. Guided by reinforcement learning, that critic doesn’t just reject bad code; it drives an iterative rewrite loop that more than doubles end-to-end accuracy.

From passive judge to active coach

The team fine-tunes a lightweight Qwen backbone to score whether a Lean statement truly matches its natural-language prompt, then bakes those scores into a reward signal. Each failed attempt becomes a teaching moment, producing richer feedback than the usual “compiler error” one-liner. The critic also powers CriticLeanBench, a 500-item test set (half correct, half adversarially wrong) that shows CriticLeanGPT trouncing both open and closed-source baselines at spotting semantic mistakes.

Hard numbers: 38 % → 84 % accuracy

On a 50-problem slice of the Omni-MATH benchmark, a 7 B “Kimina-Autoformalizer” model alone solved just 38 % of tasks. A traditional compiler-feedback loop nudged that to 54 %. Swap in CriticLean’s RL-trained critic and the success rate soars to 84 %—a 30-point leap even seasoned theorem-prover veterans will notice.

A broader 500-problem stress test tells the same story: the multi-attempt CriticLean pipeline verified 52.8 % of statements under a 200-try cap, recovering forty extra points of yield that single-pass systems would toss out.

A new 285 k-problem corpus (and 36 k “diamond” stumpers)

Because the critic can certify semantic correctness without humans, the authors bootstrapped FineLeanCorpus, a 285 ,957-entry Lean dataset spanning 16 math domains with a flatter difficulty curve than the skewed Lean-Workbook previously used for fine-tuning. They also carved out a FineLeanCorpus-Diamond subset—36 k brutal problems meant to push future models beyond textbook algebra.

Why this matters

  • Reliability over compilation. Syntax is easy; semantics are king. CriticLean proves that investing compute in the grading phase pays bigger dividends than ever-bigger generators.

  • Plug-and-play RL recipe. The critic-guided loop is model-agnostic and could supervise any auto-formalizer—Lean, Isabelle, even Coq.

  • Dataset flywheel. With FineLeanCorpus open-sourced, researchers finally have a large, semantically vetted playground instead of noisy web scrapes.

Whether you’re chasing fully automated theorem proving or just want ChatGPT to stop hallucinating Lean syntax, CriticLean’s message is clear: the smartest way forward is to teach your models how to critique themselves.

Paper link: arXiv 2507.06181 (PDF)

Phi-4-mini-flash-reasoning: Microsoft’s 3.8 B “Pocket” LLM that Delivers 10× Faster Math & Logic on Edge Devices

 

Why Another “Mini” Phi Model?

After a year of shipping tightly-focused small language models (SLMs) for reasoning, Microsoft’s Azure AI team has unveiled Phi-4-mini-flash-reasoning—a drop-in upgrade to the earlier Phi-4-mini that targets one pain point: speed. Where the original model excelled at step-by-step maths and logic, the new flash edition achieves up to 10 × higher token throughput and 2-3 × lower latency without sacrificing accuracy. It is purpose-built for resource-constrained hardware such as mobile handsets, single-GPU servers, classroom laptops, and IoT gateways. 


Inside the New Architecture

InnovationWhat It DoesWhy It Matters
SambaY Self-DecoderBlends state-space Mamba blocks with Sliding-Window Attention (SWA).Provides linear-time prefilling and local context capture.
Gated Memory Units (GMU)Tiny gating layers share representations between decoder blocks.Slashes compute during generation without harming quality.
Decoder-Hybrid-Decoder LayoutOne full-attention layer for KV cache, surrounded by lightweight Sambas and GMUs.Maintains long-context power (64 K tokens) while accelerating every other step.

Together these tricks let Phi-4-mini-flash-reasoning outrun not only its mini predecessor but also larger 6-7 B dense models on vLLM in real-time tests. 

Benchmark Snapshot

Metric (single A100-80 GB)Phi-4-mini-flashPhi-4-miniLlama-3-8B-Instruct
Inference latency (256 tok)≈ 40 ms95 ms120 ms
Throughput (tok/s)> 1 000110240
AIME 24/25 (Math, Pass@1)72 %70 %68 %
Math50081 %78 %73 %
GPQA-Diamond62 %60 %55 %

Microsoft internal numbers shown in the blog post graphs 

Developer Access & Tooling

  • Open Weights: Download from Hugging Face or the NVIDIA API Catalog under a permissive MIT-style licence.

  • Azure AI Foundry: One-click deployment with managed GPUs, safety filters, and function-calling.

  • vLLM-Ready: Microsoft supplies a reference --flash config enabling the advertised latency on a single GPU.

  • Edge Builds: TensorRT-LLM and ONNX Runtime packages for Jetson Orin, Apple Silicon, and high-end Android phones.


Ideal Use-Cases

  1. On-Device STEM Tutors – Real-time solution steps for maths homework without cloud calls.

  2. Industrial Logic Controllers – Quick symbolic reasoning for quality-control or robotics arms.

  3. AR/VR Headsets – Localised puzzle hints or game logic with < 50 ms response.

  4. Classroom Labs – Affordable single-GPU servers hosting dozens of simultaneous reasoning sessions.


Looking Ahead

The Azure team hints that the SambaY + GMU blueprint will flow into Phi-4-multimodal-flash later this year, targeting low-latency image and audio reasoning on the same small-footprint devices. Meanwhile, Phi-4-mini-flash-reasoning is live today—ready for developers who need big-brain logic in a micro power envelope.

Whether you’re building an educational app, a smart sensor, or just trimming cloud compute bills, “flash” Phi brings full reasoning to the edge—no compromise required.

Meta AI’s grand blueprint for embodied agents: put a world model at the core

 Move over “chatbots with arms.” Meta AI has published a sweeping manifesto that recasts embodied intelligence as a world-model problem. The 40-page paper, Embodied AI Agents: Modeling the World (July 7, 2025), is signed by a who’s-who of researchers from EPFL, Carnegie Mellon, NTU and Meta’s own labs, and argues that any meaningful agent—virtual, wearable or robotic—must learn a compact, predictive model of both the physical and the mental worlds it inhabits.

Three kinds of bodies, one cognitive engine

The authors sort today’s prototypes into three buckets:

  • Virtual agents (think emotionally intelligent avatars in games or therapy apps)

  • Wearable agents that live in smart glasses and coach you through daily tasks

  • Robotic agents capable of general-purpose manipulation and navigation

Despite wildly different form factors, all three need the same six ingredients: multimodal perception, a physical world model, a mental model of the user, action & control, short-/long-term memory, and a planner that ties them together.

What “world modeling” actually means

Meta’s framework breaks the catch-all term into concrete modules:

  1. Multimodal perception – image, video, audio and even touch encoders deliver a unified scene graph.

  2. Physical world model – predicts object dynamics and plans low- to high-level actions.

  3. Mental world model – tracks user goals, emotions and social context for better collaboration.

  4. Memory – fixed (weights), working and external stores that support life-long learning.

The paper contends that current generative LLMs waste compute by predicting every pixel or token. Instead, Meta is experimenting with transformer-based predictive models and JEPA-style latent learning to forecast just the state abstractions an agent needs to plan long-horizon tasks.

New benchmarks to keep them honest

To measure progress, the team proposes a suite of “world-model” stress tests—from Minimal Video Pairs for perceptual prediction to CausalVQA and the WorldPrediction benchmark that evaluates high-level procedural planning. Early results show humans near-perfect and SOTA multimodal models barely above chance, highlighting the gap Meta hopes to close.

Where they’re headed next

Two research directions top the agenda:

  • Embodied learning loops that pair System A (learning by passive observation) with System B (learning by physical action), each bootstrapping the other.

  • Multi-agent collaboration, where a family of specialized bodies—your glasses, a kitchen robot, and a home avatar—share a common world model and negotiate tasks.

Ethics is a running theme: privacy for always-on sensors and the risk of over-anthropomorphizing robots both get dedicated sections.

Why it matters

Meta isn’t open-sourcing code here; it’s setting the intellectual agenda. By declaring world models—not ever-larger GPTs—the “missing middle” of embodied AI, the company positions itself for a future where agents must act, not just talk. Expect the next iterations of Meta’s smart-glasses assistant (and perhaps its humanoid robot partners) to lean heavily on the blueprint sketched in this paper.

Paper link: arXiv 2506.22355 (PDF)

9.7.25

GPT-4o aces its multimodal classmates—but still can’t dethrone specialist vision models

 OpenAI’s GPT-4o may be the first flagship model to unify text, image and audio in a single stack, but a new EPFL benchmarking effort shows just how far even the best “everything” model still lags behind purpose-built computer-vision networks. In “How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks,” researchers tested GPT-4o alongside six other foundation models—o4-mini, Gemini 1.5 Pro, Gemini 2.0 Flash, Claude 3.5 Sonnet, Qwen2-VL and Llama 3.2—on six bread-and-butter CV tasks that every undergrad knows: ImageNet-style classification, MS-COCO object detection, semantic segmentation, instance grouping, monocular depth and surface-normal prediction.

Turning text-only giants into pixel workers

Most API-level models can’t output polygons or depth maps, so the team invented a prompt-chaining framework that decomposes each vision problem into a sequence of classification subtasks that any chatty LLM can answer. A recursive “zoom-and-vote” routine localises objects, SLIC superpixels stand in for pixels in segmentation, and pairwise ranking lets the models infer relative depth.

Key takeaways

FindingWhat happenedWhy it matters
Generalist, not specialistAll MFMs landed well below state-of-the-art CV models on every benchmark.Even massive cross-modal pre-training doesn’t yet replace task-specific supervision.
Semantic > geometricScores on classification, detection and segmentation were much higher than on depth or normals.MFMs learn semantics from caption data but have little innate 3-D understanding.
GPT-4o still best of breedGPT-4o topped the non-reasoning field in four of six tasks.Its larger context window and image-generation head translate into better pixel comprehension.
Reasoning helps with 3-DSmaller “o3” reasoning models edged GPT-4o on depth and normals.Structured chain-of-thought may compensate for weaker raw vision priors.
Prompt sensitivity drops with qualityHigher-capacity models varied less when the researchers tweaked prompt chains.Robustness could become a practical proxy for measuring model quality without labels.

The bigger picture

For product builders eyeing GPT-4o as a drop-in object detector, the study is a sobering reality check; you’ll still need a Mask R-CNN or SAM in the loop for pixel-perfect jobs. But the results also highlight the upside of super-general models: with zero fine-tuning and only clever prompting, GPT-4o can solve half a dozen vision tasks “well enough”—a compelling baseline for multimodal agents that prefer breadth over razor-edge accuracy.

The authors have open-sourced their fm-vision-evals framework so future models can be dropped into the same gauntlet—no weight access required. Expect the next wave of Gemini, Claude and Llama releases to cite these scores the way language-model papers brag about MMLU.

Paper link: arXiv 2507.01955 (PDF)

8.7.25

Context Engineering in AI: Designing the Right Inputs for Smarter, Safer Large-Language Models

 

What Is Context Engineering?

In classic software, developers write deterministic code; in today’s AI systems, we compose contexts. Context engineering is the systematic craft of designing, organizing and manipulating every token fed into a large-language model (LLM) at inference time—instructions, examples, retrieved documents, API results, user profiles, safety policies, even intermediate chain-of-thought. Well-engineered context turns a general model into a domain expert; poor context produces hallucinations, leakage or policy violations. 


Core Techniques

TechniqueGoalTypical Tools / Patterns
Prompt Design & TemplatesGive the model clear role, task, format and constraintsSystem + user role prompts; XML / JSON schemas; function-calling specs
Retrieval-Augmented Generation (RAG)Supply fresh, external knowledge just-in-timeVector search, hybrid BM25+embedding, GraphRAG
Context CompressionFit more signal into limited tokensSummarisation, saliency ranking, LLM-powered “short-former” rewriters
Chunking & WindowingPreserve locality in extra-long inputsHierarchical windows, sliding attention, FlashMask / Ring Attention
Scratchpads & CoT ScaffoldsExpose model reasoning for better accuracy and debuggabilitySelf-consistency, tree-of-thought, DST (Directed Self-Testing)
Memory & ProfilesPersonalise without retrainingVector memories, episodic caches, preference embeddings
Tool / API ContextLet models call and interpret external systemsModel Context Protocol (MCP), JSON-schema function calls, structured tool output
Policy & GuardrailsEnforce safety and brand styleContent filters, regex validators, policy adapters, YAML instruction blocks

Why It Matters

  1. Accuracy & Trust – Fact-filled, well-structured context slashes hallucination rates and citation errors.

  2. Privacy & Governance – Explicit control over what leaves the organisation or reaches the model helps meet GDPR, HIPAA and the EU AI Act.

  3. Cost Efficiency – Compressing or caching context can cut token bills by 50-80 %.

  4. Scalability – Multi-step agent systems live or die by fast, machine-readable context routing; good design tames complexity.


High-Impact Use Cases

SectorHow Context Engineering Delivers Value
Customer SupportRAG surfaces the exact policy paragraph and recent ticket history, enabling a single prompt to draft compliant replies.
Coding AgentsFunction-calling + repository retrieval feed IDE paths, diffs and test logs, letting models patch bugs autonomously.
Healthcare Q&AContext filters strip PHI before retrieval; clinically-approved guidelines injected to guide safe advice.
Legal AnalysisLong-context models read entire case bundles; chunk ranking highlights precedent sections for argument drafting.
Manufacturing IoTStreaming sensor data is summarised every minute and appended to a rolling window for predictive-maintenance agents.

Designing a Context Pipeline: Four Practical Steps

  1. Map the Task Surface
    • What knowledge is static vs. dynamic?
    • Which external tools or databases are authoritative?

  2. Define Context Layers
    Base prompt: role, format, policy
    Ephemeral layer: user query, tool results
    Memory layer: user or session history
    Safety layer: filters, refusal templates

  3. Choose Retrieval & Compression Strategies
    • Exact text (BM25) for short policies; dense vectors for semantic match
    • Summaries or selective quoting for large PDFs

  4. Instrument & Iterate
    • Log token mixes, latency, cost
    • A/B test different ordering, chunking, or reasoning scaffolds
    • Use self-reflection or eval suites (e.g., TruthfulQA-Context) to measure gains


Emerging Tools & Standards

  • MCP (Model Context Protocol) – open JSON schema for passing tool output and trace metadata to any LLM, adopted by Claude Code, Gemini CLI and IBM MCP Gateway.

  • Context-Aware Runtimes – vLLM, Flash-Infer and Infinity Lite stream 128 K-1 M tokens with optimized KV caches.

  • Context Observability Dashboards – Startups like ContextHub show token-level diff, attribution and cost per layer.


The Road Ahead

As context windows expand to a million tokens and multi-agent systems proliferate, context engineering will sit alongside model training and fine-tuning as a first-class AI discipline. Teams that master it will ship assistants that feel domain-expert-smart, honest and cost-efficient—while everyone else will chase unpredictable black boxes.

Whether you’re building a retrieval chatbot, a self-healing codebase or an autonomous research agent, remember: the model is only as good as the context you feed it.

AIRA shows how better operators — not just bigger models — turbo-charge AI research agents

 Large language models that write code have already stormed GitHub, but turning them into full-blown research agents—systems that iterate on entire ML pipelines until they medal on Kaggle—has proved trickier. The latest state-of-the-art, AIDE, could grab a medal on roughly 40 % of MLE-bench tasks. Now Meta AI and UCL push that rate to 47.7 % with AIRA, a rethink that says the secret isn’t a flashier LLM, it’s the operators and search policy you wrap around it. 

From one-shot “Draft, Debug, Improve” to a toolbox of surgical edits

AIRA introduces OAIRA, a new operator set that goes beyond AIDE’s three blunt actions. Scoped memory keeps prompts lean, “think tokens” force structured reasoning, and a prompt-adaptive complexity cue decides whether the agent should sketch a quick baseline or engineer a deep ensemble. The result: twice the reasoning tokens per call and far less mode collapse. 

Search policies finally get room to shine

When AIDE’s old operators were plugged into greedy, MCTS and evolutionary searches, the fancier algorithms gained zero ground—operator bottlenecks were that severe. Swap in OAIRA and those same policies leapfrog greedy search, proving that exploration muscle only pays off once edits are expressive enough. 

The scoreboard (MLE-bench Lite, 22 Kaggle tasks)

  • AIDE (o1-preview, greedy): 39.6 % medal rate

  • AIRA (greedy + OAIRA): 45.5 %

  • AIRA (MCTS + OAIRA): 47.7 %

  • AIRA (Evolutionary + OAIRA): 47.3 %
    All agents ran under identical 24-hour, single-GPU budgets inside AIRA-dojo, a new sandbox that hands each run a root-privileged H200 container yet isolates filesystem side effects. 

Mind the generalization gap

The study also spotlights a pitfall for auto-ML agents: validation scores routinely over-estimate test-set gains, steering greedy searches into dead ends. By examining thousands of runs, the team quantifies that “proxy-test gap” and urges future benchmarks to track it explicitly. 

Why it matters

  • Agent design ≠ model scale. The leap came without touching the underlying LLM (DeepSeek-R1 or GPT-4o). That’s good news for teams capped by API limits.

  • Composable recipe. OAIRA operators, MCTS search and the open-source aira-dojo testbed (GitHub link in the paper) can bolt onto any ReAct-style coding agent.

  • Toward autonomous ML ops. AIRA’s 24-hour, single-GPU constraint mirrors real-world hack-day budgets, making the findings immediately useful for startups chasing continuous Kaggle pipelines or internal model tuning bots.

Auto-ML agents are no longer judged solely by the size of their LLM brains; the tools they wield and the ways they explore the search space may count just as much. AIRA’s 8-point jump on MLE-bench suggests that the next frontier in agentic ML will be won with sharper scalpels, not bigger hammers.

Paper link: arXiv 2507.02554 (PDF)

DeepMesh makes artist-quality 3D meshes a one-click affair

 Triangle-mesh modelling is the CAD world’s equivalent of hand-drawn in-betweens: essential, mind-numbing and painfully slow. A new paper out of Tsinghua University, NTU and ShengShu AI says it can hand that job to an LLM-sized transformer without melting your GPU.

The team’s framework, DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning, marries a clever compression trick with a dose of RLHF to crank out clean, editable topology directly from point clouds or images. 


Why previous mesh LLMs hit the wall

Most auto-regressive mesh generators treat every vertex coordinate as a token. Feed them a high-poly model and the sequence balloons into tens of thousands of steps, torpedoing training stability and inference speed. Worse, their loss functions optimise geometry alone, so outputs pass numeric checks yet still look like Swiss cheese to artists.


Two upgrades, one big leap

PillarWhat they didWhy it matters
72 % shorter sequencesA hierarchical patch-based tokenization merges duplicate offsets and encodes connectivity inline, shrinking vertex strings by nearly three-quarters without dropping detail. Cuts pre-training FLOPs and lets the model scale to 30 k-face meshes on a single A100.
Human-aligned RLCollected 5 000 preference pairs scored with a hybrid of human rating and 3D metrics, then ran Direct Preference Optimization (DPO) on the base model. Removes holes and stray faces while nudging topology toward “artist-grade” layouts.

The researchers also trimmed an 800 k-mesh corpus to a cleaner 500 k set, tamping down the loss spikes that plague raw WebGL scrapes. 

Results: fewer faces, better faces

  • Up to 1 B parameters: two Hourglass-style transformer variants (500 M & 1 B) both converge in 100 k steps thanks to shorter sequences. 

  • Topology wins: DeepMesh’s large model eliminates 90 % of non-manifold edges that slip through MeshGPT and Nautilus, according to the authors’ “topology-valid” metric.

  • Visual quality: crowd-sourced raters picked DeepMesh over MeshGPT by 68 % on identical point-cloud prompts (exact numbers in paper’s Sec. 4.3).

  • Speed: a full 30 k-face generation takes ≈10 min, versus 20–25 min for LoRA-fine-tuned diffusion baselines reported in prior work.

A public demo gallery already shows clean Watertight dragons, furniture and stylised characters rendered straight from sparse point clouds. 


Why this is bigger than 3D fan art

Game studios, AR platforms and online-creator tools alike are sitting on troves of unoptimised 3D scans. A transformer that understands connectivity as well as shape could batch-convert those scans into lightweight, animation-ready assets—no retopology pass required. And because DeepMesh’s DPO loop is “just” another RLHF recipe, the same pipeline could teach a mesh LLM brand-specific style or IP-safe anatomy without touching the base weights.

The authors hint at scaling past one billion parameters and adding text-conditioned generation. Given how fast 3D GenAI is snowballing, don’t bet against DeepMesh—or its tokenization trick—showing up in the next wave of text-to-world engines.

Paper link: arXiv 2503.15265 (PDF)

7.7.25

ARAG puts a multi-agent brain inside your RAG stack — and Walmart’s numbers look eye-popping

 Retrieval-augmented generation (RAG) has become the go-to recipe for giving large language models real-world context, but most deployments still treat retrieval as a dumb, one-shot lookup. Researchers at Walmart Global Tech think that leaves serious money on the table — especially in e-commerce, where user intent shifts by the minute. Their new framework, ARAG (Agentic Retrieval-Augmented Generation), adds a four-agent reasoning layer on top of vanilla RAG and reports double-digit gains across every metric that matters.

Four specialists, one conversation

  1. User-Understanding Agent distills long-term history and the current session into a natural-language profile.

  2. NLI Agent performs sentence-level entailment to see whether each candidate item actually supports that intent.

  3. Context-Summary Agent compresses only the NLI-approved evidence into a focused prompt.

  4. Item-Ranker Agent fuses all signals and produces the final ranked list.

Each agent writes to — and reads from — a shared blackboard-style memory, so later agents can reason over earlier rationales rather than raw text alone.

How much better? Try 42 %

On three Amazon Review subsets (Clothing, Electronics, Home), ARAG beats both a recency heuristic and a strong cosine-similarity RAG baseline:

DatasetNDCG@5 ↑Hit@5 ↑
Clothing+42.1 %+35.5 %
Electronics+37.9 %+30.9 %
Home & Kitchen+25.6 %+22.7 %

An ablation test shows that yanking either the NLI or context-summary modules knocks as much as 14 points off NDCG, underlining how critical cross-agent reasoning is to the win.

Why it matters

  • Personalization that actually reasons. By turning retrieval and ranking into cooperative LLM agents, ARAG captures the nuance of why an item fits, not just whether embeddings are close.

  • No model surgery required. The team wraps any existing RAG stack; there’s no need to fine-tune the base LLM, making the upgrade cloud-budget friendly.

  • Explainability for free. Each agent logs its own JSON-structured evidence, giving product managers a breadcrumb trail for every recommendation.

The bigger picture

Agentic pipelines have taken off in code generation and web browsing; ARAG shows the same trick pays dividends in recommender systems, a multi-billion-dollar battleground where percent-level lifts translate into real revenue. Expect retailers and streaming platforms to test-drive multi-agent RAG as they chase post-cookie personalization.

Paper link: arXiv 2506.21931 (PDF)

6.7.25

LangGraph Rollout: how VeRL leveled-up multi-turn Agent RL

 

Why this matters

If you’ve ever tried to train an LLM-powered agent with many tool calls spread across a genuine back-and-forth conversation, you’ve probably discovered that “multi-turn” means different things to different frameworks. Yanbin Jiang’s latest post shows how the VeRL team punched through that ceiling by grafting LangGraph directly onto VeRL’s reinforcement-learning rollout engine. The result is a training loop that speaks the same language as production code. 


1. Where they started

  • Native VeRL multi-turn – great for quick experiments. You enable multi_turn: True, write a YAML schema for each tool, implement an async Python class, and you’re off; their GSM8K benchmark ran in two days. 

  • Pain points

    1. Double bookkeeping: every tool had to be declared twice (YAML + Python).

    2. Drift: schema and code fell out of sync, and prod tools (written for LangChain/LangGraph) diverged from the “training” clones. 


2. A quick stop-gap: automatic tool wrapping

Yanbin added BaseTool.from_callable(), which introspects any plain Python function with transformers.utils.get_json_schema, then fabricates a VeRL-compatible wrapper on the fly. One list of callables (tool_list = [multiply, add, …]) now powers both training and prod. 

My dev take: this is the same pattern I use in LangChain when I decorate business logic with @tool. Nice to see VeRL admit “if you can’t beat reflection, join it.”


3. The real blocker: orchestration power

Research quickly outgrew VeRL’s built-in rollout:

NeedWhy VeRL fell short
Dynamic branches & backtrackingNative graph was too rigid.
True multi-turn dialogue (user follow-ups)Any assistant message without tool calls ended the convo.
Per-node sampling / chat-template tweaksGlobal settings only.

Enter LangGraph: a lightweight DAG engine already shipping in production.

4. Architectural insight: separation of concerns

“Let VeRL manage actor weights & hardware; let LangGraph drive the conversation.” 

So they built a LangChain-compatible chat-model client for VeRL’s SGLang server. Training now works like this:

  1. VeRL hands the initial messages + model handle to the user’s LangGraph.

  2. The graph does its thing—branching, retrying, invoking tools—using the exact actor weights being optimized.

  3. When the graph stops, VeRL collects the message history and rewards. 

The PR shows a seven-line YAML snippet that swaps the old rollout for:

yaml
multi_turn:
chat_template_kwargs: {enable_thinking: false} langgraph: path: /path/to/graph.py graph_config: {recursion_limit: 100}

…and a 60-line example graph that binds tools, counts turns, and lets you vary temperature node-by-node. 


5. Why I’m excited

  • One graph to rule them all – deployment and training share code; no more “but it worked in prod!”

  • Easier ablations – want to test a new branch strategy? Edit the graph script; RL pipeline stays untouched.

  • Framework-agnostic future – the same bridge pattern could plug VeRL into OpenAI Function Calling, Microsoft’s AutoGen, or whatever framework wins next year.


My takeaway

VeRL just became a lot more attractive for serious agent RL work. By leaning on LangGraph instead of extending an in-house orchestration DSL, the team keeps VeRL laser-focused on fast rollouts, leaves graph logic to a dedicated library, and—crucially—lets devs iterate on one codebase. If you’re juggling duplicate tool definitions or fighting mismatch between training and production, clone Yanbin’s PR and breathe easier.

Explore it more here: https://jybsuper.github.io/posts/langgraph_rollout/ 

What Claude offers now From Anthropic’s announcements: Creates and edits real files directly in chats or the desktop app: Excel (.xlsx)...