13.7.25

PyVision lets multimodal models write their own vision tools—and the accuracy jump is eye-opening

 Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image, run a captioner, answer the question—done. PyVision blows up that rut. The 26-page technical report shows GPT-4.1 and Claude-4 Sonnet literally writing Python code mid-conversation, executing it, checking the output and iterating until they solve the task. The result is an agent that treats PIL, NumPy and Matplotlib as an expandable toolbox rather than a fixed pipeline. 

From static workflows to dynamic “code-as-tool”

A traditional vision agent might have 10 pre-defined ops; PyVision can spawn hundreds. The authors catalogue the emergent tools into four buckets—basic image processing, advanced processing, visual sketching and numerical analysis—plus a long-tail of creative task-specific snippets. On perception-heavy problems the model leans on cropping and contrast boosts; on math puzzles it sketches diagrams or counts pixels. 

Multi-turn loop under the hood

  1. System prompt primes the LLM to plan, code, run and reflect.

  2. Python sandbox executes each snippet and streams results back.

  3. Reflection step lets the model critique outputs, revise code or answer.

The dance repeats until the agent is confident—or it times out. Crucially, there’s no fixed library list; the model imports what it thinks it needs. 

Benchmarks: big wins, bigger where it hurts most

BackendMathVista ↑Visual-Puzzles ↑V* ↑VLMsAreBlind-mini ↑
GPT-4.1+1.8+2.5+7.8+2.6
Claude-4 Sonnet+3.3+8.3+0.3+31.1

Claude-4’s massive jump on VLMsAreBlind-mini—a dataset designed to fool pattern-matchers—suggests PyVision’s code probes puncture spurious visual shortcuts. GPT-4.1, already strong at fine-grained perception, gains most on the V* visual-search test. 

Why this matters

  • Grounded answers, verifiable steps. The agent surfaces intermediate plots, masks and arrays, giving product teams a check-pointable audit trail.

  • Amplifier, not crutch. PyVision “dials up” whatever the base model is best at—perception for GPT-4.1, abstract reasoning for Claude-4—rather than papering over weaknesses.

  • Tool invention is the new frontier. Instead of waiting for human engineers to wire in functions, the LLM autogenerates them, inching closer to Benjamin Franklin’s “tool-making animal.”

What’s next

The paper’s GitHub repo ships inference code, a dockerised Python sandbox and demo notebooks. The authors hint at plugging reinforcement learning into the loop and expanding beyond vision to 3-D geometry and web interaction tooling. Expect startups to wrap this framework into agents that can diagnose X-ray anomalies, audit engineering schematics or spot product-label defects—without a human ever defining “defect detector.”

Paper link: arXiv 2507.07998 (PDF)

No comments:

 Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image...