Wandering Nomad

14.7.25

NeuralOS wants to deep-learn your desktop, window by window

Ask any LLM-first startup what the future of computing looks like and you’ll hear something about conversational agents buried inside 1980-era text terminals. Luke Rivard and colleagues think we can do better. In “NeuralOS: Towards Simulating Operating Systems via Neural Generative Models,” they present the first end-to-end system that predicts full-resolution screen frames—icons, windows, even cursor movements—from raw user input streams the way a video model predicts the next pixel.

How it works

Layer	Role	Rough analog in a real OS
Recurrent “kernel” (2-tier LSTM)	Ingests the last frame plus mouse / key events and updates a compact hidden state that remembers which apps are open, where the cursor is, and what happened a few seconds ago	Task manager & window server
Diffusion UNet renderer	Takes that hidden state—and an explicit cursor-position map—and paints the next 512 × 384 frame	GPU compositor

Running autoregressively, the pair turns a stream of clicks into a playable video that shows, say, a user double-clicking the Home icon, waiting for the file manager, then closing the window—no hard-coded widget logic, no X11 messages.

A purpose-built dataset

Training relied on tens of hours of Ubuntu XFCE recordings that mix random, scripted and AI-generated sessions. The team first pre-trained the RNN on the 2.8 % “hard transition” subset (where the screen changes a lot between frames), then joint-trained kernel + renderer and finally doubled the context window to 64 frames—all on a single H200 GPU.

What can it actually do?

Realistic mouse tracking. The model keeps the cursor glued to the icon or button the user is aiming for—even after long delays such as a Firefox launch.
State-aware transitions. It learns that double-clicking a folder spawns a window and that closing it removes the decoration, without seeing explicit OS messages.
Limits. Fine-grained keyboard input (think live typing) still trips it up, and rendering resolution is modest to keep diffusion latency reasonable.

Why it matters

From scripted to generative UIs. If a network can hallucinate a working desktop, future interfaces could be described in natural language instead of coded in Qt.
A fresh testbed for agent research. RL agents that currently learn Atari could learn “Ubuntu tasks” inside NeuralOS, no virtual machine needed.
GPU-native desktop pipelines. Because state and rendering both live in tensors, the whole stack parallelises naturally—handy for cloud streaming.

First step, not final word

NeuralOS doesn’t yet click buttons for you or compile your code; it draws what would happen if you did. But that alone hints at interfaces where the boundary between app, OS and model blurs into a single, adaptive canvas. The authors have open-sourced code, checkpoints and a live demo at neural-os.com; expect mash-ups with language agents—and, inevitably, AI-generated prank desktops—before long.

Paper link: arXiv 2507.08800 (PDF)

MetaStone-S1 shows how to scale ‘thinking time’ instead of parameter count

For the past year, the mantra in large-language-model land has been simple: bigger weights, better brains. A new paper from the University of Science and Technology of China, Nanjing University and collaborators argues there’s another dial to turn—reasoning time at inference—and it introduces a purpose-built architecture called MetaStone-S1 to prove the point.

A reflective twist on the policy-reward combo

Standard alignment pipelines bolt a separate process-reward model (PRM) onto a frozen policy network, adding hundreds of millions of parameters and latency. MetaStone-S1 bundles both roles into one backbone and sprinkles in two task-specific heads: one for next-token prediction, the other for step-level scoring. The resulting Self-supervised Process Reward Model (SPRM) weighs in at just 53 M parameters—99 % smaller than conventional PRMs.

Dial-a-brain at test time

Because reward scoring lives inside the model, MetaStone-S1 can stretch or shrink its chain-of-thought on the fly:

Mode	Avg. reasoning steps	Typical use
Low	~8 steps	latency-sensitive chat
Medium	~24 steps	balanced Q&A
High	up to 64 steps	Olympiad math, code generation

The team coins this knob Test-Time Scaling (TTS) and backs it with an empirical scaling law linking “thinking FLOPs” to quality gains.

Benchmark bump without parameter bloat

Running in high mode, the 32 B-parameter MetaStone-S1 matches or beats OpenAI o3-mini across AIME ’24/’25, LiveCodeBench and C-EVAL—despite using roughly half the weights.

Why it matters

Cheaper alignment. Folding the PRM inside the policy cuts training and inference costs.
User-controllable latency. Products can trade speed for depth without model swaps.
Open playground. All code, checkpoints (1.5 B→32 B) and the reasoning-length scheduler are on GitHub under an Apache-2 license.

MetaStone-S1 won’t end the parameter-scaling race, but it offers a reminder that when and how long a model thinks can count as much as how big it is. Expect TTS dials and reflective reward heads to surface quickly in next-gen open-source stacks.

Paper link: arXiv 2507.01951 (PDF)

13.7.25

PyVision lets multimodal models write their own vision tools—and the accuracy jump is eye-opening

Large language models have learned to call external tools, but in computer vision they still walk a narrow, hand-coded path: crop the image, run a captioner, answer the question—done. PyVision blows up that rut. The 26-page technical report shows GPT-4.1 and Claude-4 Sonnet literally writing Python code mid-conversation, executing it, checking the output and iterating until they solve the task. The result is an agent that treats PIL, NumPy and Matplotlib as an expandable toolbox rather than a fixed pipeline.

From static workflows to dynamic “code-as-tool”

A traditional vision agent might have 10 pre-defined ops; PyVision can spawn hundreds. The authors catalogue the emergent tools into four buckets—basic image processing, advanced processing, visual sketching and numerical analysis—plus a long-tail of creative task-specific snippets. On perception-heavy problems the model leans on cropping and contrast boosts; on math puzzles it sketches diagrams or counts pixels.

Multi-turn loop under the hood

System prompt primes the LLM to plan, code, run and reflect.
Python sandbox executes each snippet and streams results back.
Reflection step lets the model critique outputs, revise code or answer.

The dance repeats until the agent is confident—or it times out. Crucially, there’s no fixed library list; the model imports what it thinks it needs.

Benchmarks: big wins, bigger where it hurts most

Backend	MathVista ↑	Visual-Puzzles ↑	V* ↑	VLMsAreBlind-mini ↑
GPT-4.1	+1.8	+2.5	+7.8	+2.6
Claude-4 Sonnet	+3.3	+8.3	+0.3	+31.1

Claude-4’s massive jump on VLMsAreBlind-mini—a dataset designed to fool pattern-matchers—suggests PyVision’s code probes puncture spurious visual shortcuts. GPT-4.1, already strong at fine-grained perception, gains most on the V* visual-search test.

Why this matters

Grounded answers, verifiable steps. The agent surfaces intermediate plots, masks and arrays, giving product teams a check-pointable audit trail.
Amplifier, not crutch. PyVision “dials up” whatever the base model is best at—perception for GPT-4.1, abstract reasoning for Claude-4—rather than papering over weaknesses.
Tool invention is the new frontier. Instead of waiting for human engineers to wire in functions, the LLM autogenerates them, inching closer to Benjamin Franklin’s “tool-making animal.”

What’s next

The paper’s GitHub repo ships inference code, a dockerised Python sandbox and demo notebooks. The authors hint at plugging reinforcement learning into the loop and expanding beyond vision to 3-D geometry and web interaction tooling. Expect startups to wrap this framework into agents that can diagnose X-ray anomalies, audit engineering schematics or spot product-label defects—without a human ever defining “defect detector.”

Paper link: arXiv 2507.07998 (PDF)