Wandering Nomad

26.7.25

MegaScience formalizes science reasoning data—and smaller models suddenly look smarter

Open-source LLMs can do math and code, but ask them to reason through a physics word problem or a cell-biology puzzle and they wobble. The GAIR Lab at Shanghai Jiao Tong University thinks the culprit is data, not architecture. Their new paper introduces TextbookReasoning (650 k Q&A pulled from 12 k university textbooks) and MegaScience (a 1.25 M‑sample mix of cleaned public science sets), then shows that models post‑trained on these datasets outperform their own official instruct variants—while using far shorter responses.

The problem: bad science data, bad evals

Most “science” corpora rely on noisy web text, weak decontamination and multiple‑choice benchmarks that don’t probe true reasoning. The authors flag four pain points: unreliable benchmarks, flimsy leakage checks, low‑quality references and shallow CoT distillation.

Two datasets, one pipeline

TextbookReasoning – 650 k verified questions across seven disciplines (physics → economics), built via textbook digitization, QA pair extraction, deduping, refinement and LLM‑assisted decontamination.
MegaScience – 1.25 M high‑quality instances from NaturalReasoning, Nemotron‑Science and TextbookReasoning, curated with a three‑way selection scheme: response‑length, difficulty, and random sampling, plus solution annotation.

Notably, answers are short: 410 tokens (TextbookReasoning) and 721 tokens (MegaScience) on average—meaning cheaper training and inference than CoT-heavy rivals.

Proof in the checkpoints

Fine‑tuning Llama3.1, Qwen2.5 and Qwen3 base models on MegaScience consistently beats their official instruct models across “general,” “specific,” and “math” categories. Example: Qwen3‑30B jumps from 55.66 → 61.12 average, with math rising to 89.33.

Ablations back the pipeline: drop refinement and performance collapses (58.33 % → 13.15 % overall); remove the extra CoT step and scores slide to 57.33 %. Decontamination matters too—without it, leakage inflates averages to 58.57 %.

Why this matters

Science is more than math/code. The field lacked open, verifiable, long‑form reasoning sets; MegaScience fills that gap.
Shorter CoT ≈ cheaper scaling. The datasets’ concise answers let bigger models benefit more from fine‑tuning—hinting at a “scaling law for data efficiency” in science domains.
Open everything. The team releases the full curation pipeline, eval system, seven trained models and all datasets, inviting the community to iterate.

If your lab is chasing AI scientists rather than chatty coders, MegaScience is a ready-made jumpstart—and a reminder that better questions and cleaner answers can beat another billion tokens of sludge.

Paper link: arXiv 2507.16812 (PDF)

RoGuard 1.0: Roblox’s Open-Source Guardrail LLM Raises the Bar for Safe Generation

When Roblox quietly pushed RoGuard 1.0 to Hugging Face, it wasn’t just another model drop—it was a statement that safety tooling can be both state-of-the-art and open. Built on top of Llama‑3.1‑8B‑Instruct, RoGuard is an instruction‑tuned classifier that decides whether a prompt or a model’s reply violates policy—covering both ends of the conversation loop.

Google, Meta, NVIDIA, OpenAI—pick your favorite heavyweight; Roblox claims RoGuard is beating their guardrail models on leading safety benchmarks, from Llama Guard and ShieldGemma to NeMo Guardrails and GPT‑4o. That’s a bold flex, backed by F1 scores across a mix of in‑domain and out‑of‑domain datasets.

Dual-layer defense, single lightweight core

Most moderation stacks bolt together multiple filters. RoGuard streamlines that: one 8B‑parameter model, two checkpoints of scrutiny—prompt and response. This dual‑level assessment matters because unsafe content doesn’t just come from users; it can leak from the model itself.

Data done right (and openly)

Roblox emphasizes no proprietary data—only synthetic and open-source corpora tuned to diverse safety taxonomies. They even sprinkle in chain‑of‑thought rationales so the model learns to justify its calls, not just spit out “violation” labels. The result: stronger generalization and clearer internal reasoning.

Benchmarks, but with context

RoGuard isn’t a single leaderboard cherry-pick. Roblox released RoGuard‑Eval, a 2,873‑example dataset spanning 25 safety subcategories, hand‑labeled by policy experts and adversarially probed by internal red teams. Reporting in binary F1 keeps things honest and comparable, and the model still leads.

Why builders should care

If you’re wiring generative text into games, chatbots, or UGC platforms, moderation often becomes a patchwork of regexes, keyword lists, and black-box APIs. RoGuard’s Apache‑friendly weights (via OpenRAIL license) let you self‑host a modern guardrail without vendor lock‑in—and fine‑tune it to your own taxonomy tomorrow.

Plug, play, and iterate

Weights live on Hugging Face; code and eval harness sit on GitHub. Spin up inference with any OpenAI‑compatible stack, or slot RoGuard in front of your generation model as a gating layer. Because it’s an 8B model, you can realistically serve it on a single high‑RAM GPU or even CPU clusters with batching.

The bigger picture

We’re entering an era where “safety” can’t be an afterthought—especially as APIs enable unlimited text generation inside social and gaming ecosystems. By open‑sourcing both the toolkit and the yardstick, Roblox invites the community to audit, extend, and pressure-test what “safe enough” really means.

RoGuard 1.0 shows that thoughtful guardrails don’t have to be proprietary or flimsy. They can be transparent, benchmarked, and built to evolve—exactly what AI enthusiasts and responsible builders have been asking for. Now the ball’s in our court: fork it, test it, and make the open internet a bit less chaotic.

23.7.25

ThinkAct lets robots “think, then act” — and the payoff is new SOTA across embodied AI benchmarks

Anyone who has watched today’s end‑to‑end robot policies fail a complex kitchen task knows the weakness: they map pixels to motors with no explicit plan. ThinkAct flips that script. The NTU‑NVIDIA team behind the paper trains a multimodal LLM to write a high‑level reasoning plan, turns that plan into a compact visual‑plan latent, then hands it to a DiT‑based action model that executes at control‑loop speed. The result is an agent that deliberates like GPT‑4o yet moves with the reactivity of classic policies.

How ThinkAct pulls it off

Component	What it does	Why it matters
Reinforced visual latent planning	Rewards the reasoning LLM with goal‑completion and trajectory‑consistency signals derived from vision, forcing plans that actually work in the scene.	Bridges abstract language plans to pixel‑level feedback.
Visual‑plan latent	Compresses the entire chain‑of‑thought into a fixed‑size latent that conditions a frozen DiT policy.	Keeps the policy lightweight and allows asynchronous slow‑think / fast‑act loops.
Dual‑system inference	LLM thinks a few times per second; the action model ticks every 20 ms.	Yields real‑time control without sacrificing deliberation.

Benchmark sweep at two skill levels

Suite	Metric	Prev SOTA	ThinkAct
EgoPlan‑Bench2	Acc. ↑	Qwen 2.5‑VL* 66.3	71.4
RoboVQA	Acc. ↑	Qwen 2.5‑VL* 63.5	69.2
OpenEQA	Acc. ↑	OpenVLA 52.1	57.8
SimplerEnv (manip.)	Succ.% ↑	DiT‑Policy 45.2	62.7
LIBERO (manip.)	Succ.% ↑	OpenVLA 48.9	60.3

Qwen 2.5‑VL numbers are the authors’ fine‑tuned baseline.

Few‑shot powers

With just 5–10 demos per LIBERO task, ThinkAct’s policy finetunes to new objects and layouts, beating OpenVLA by 9–12 points.o

Why this matters

Plan‑centric embodied AI. ThinkAct shows that giving agents an explicit, reward‑aligned plan latent trumps opaque end‑to‑end policies for long‑horizon tasks.
Self‑reflection in the loop. The reasoning LLM can detect a failure mid‑episode, revise its latent plan, and rescue the run — a first for open‑source VLA systems.
Few‑shot deployment. Labs can adapt to a new kitchen or warehouse with handfuls of tele‑op traces instead of days of retraining.

ThinkAct’s code is coming soon, but the project page already hosts videos of robots closing drawers, shifting condiments and answering environment‑specific questions after reasoning out loud. The message is clear: future embodied agents won’t just map images to torque — they’ll think, decide why, then act.

Paper link: arXiv 2507.16815 (PDF)

26.7.25

MegaScience formalizes science reasoning data—and smaller models suddenly look smarter

The problem: bad science data, bad evals

Two datasets, one pipeline

Proof in the checkpoints

Why this matters

RoGuard 1.0: Roblox’s Open-Source Guardrail LLM Raises the Bar for Safe Generation

Dual-layer defense, single lightweight core

Data done right (and openly)

Benchmarks, but with context

Why builders should care

Plug, play, and iterate

The bigger picture

23.7.25

ThinkAct lets robots “think, then act” — and the payoff is new SOTA across embodied AI benchmarks

How ThinkAct pulls it off

Benchmark sweep at two skill levels

Few‑shot powers

Why this matters

RoGuard 1.0: Roblox’s Open-Source Guardrail LLM Raises the Bar for Safe Generation