Wandering Nomad

15.8.25

Oracle Will Offer Google’s Gemini Models via OCI—A Pragmatic Shortcut to Agentic AI at Enterprise Scale

Oracle and Google Cloud have expanded their partnership so Oracle customers can tap Google’s latest Gemini family directly from Oracle Cloud Infrastructure (OCI) and across Oracle’s business applications. Announced on August 14, 2025, the deal aims squarely at “agentic AI” use cases—bringing planning, tool use, and multimodal generation into day-to-day enterprise workflows.

What’s new: Oracle says it will make “the entire range” of Google’s Gemini models available through OCI Generative AI, via new integrations with Vertex AI. That includes models specialized for text, image, video, speech and even music generation, with the initial rollout starting from Gemini 2.5. In other words, teams can compose end-to-end agents—retrieve data, reason over it, and produce rich outputs—without leaving Oracle’s cloud.

Enterprise reach matters here. Beyond developer access in OCI, Oracle notes that customers of its finance, HR, and supply-chain applications will be able to infuse Gemini capabilities into daily processes—think automated close packages, job-description drafting, supplier-risk summaries, or multimodal incident explainers. The practical promise: fewer swivel-chair handoffs between tools and more AI-assisted outcomes where people already work.

Buying and operating model: Reuters reports customers will be able to pay for Google’s AI tools using Oracle’s cloud credit system, preserving existing procurement and cost controls. That seemingly small detail removes a classic blocker (separate contracts and billing) and makes experimentation less painful for IT and finance.

Why this partnership, and why now?

• For Oracle, it broadens choice. OCI already aggregates multiple model providers; adding Gemini gives customers a top-tier, multimodal option for agentic patterns without forcing a provider switch.
• For Google Cloud, it’s distribution. Gemini lands in front of Oracle’s substantial enterprise base, expanding Google’s AI footprint in accounts where the “system of record” lives in Oracle apps.

What you can build first

Multimodal service agents: ingest PDFs, images, and call transcripts from Oracle apps; draft actions and escalate with verifiable citations.
Supply-chain copilots: analyze shipments, supplier news, and inventory images; generate risk memos with recommended mitigations.
Finance and HR automations: summarize ledger anomalies, produce policy-compliant narratives, or generate job postings with skills mapping—then loop a human approver before commit. (All of these benefit from Gemini’s text, image, audio/video understanding and generation.)

How it fits technically

The integration path leverages Vertex AI on Google Cloud as the model layer, surfaced to OCI Generative AI so Oracle developers and admins keep a single operational pane—policies, observability, and quotas—while calling Gemini under the hood. Expect standard SDK patterns, prompt templates, and agent frameworks to be published as the rollout matures.

Caveats and open questions

Availability timing by region, specific pricing tiers, and which Gemini variants (e.g., long-context or domain-tuned models) will be enabled first weren’t fully detailed in the initial announcements. Regulated industries will also look for guidance on data residency and cross-cloud traffic flows as deployments move from pilots to production. For now, the “pay with Oracle credits” and “build inside OCI” signals are strong green lights for proofs of concept.

The takeaway

By making Google’s Gemini models first-class citizens in OCI and Oracle’s application stack, both companies reduce friction for enterprises that want agentic AI without a multi-vendor integration slog. If your roadmap calls for multimodal assistants embedded in finance, HR, and supply chain—or developer teams building agents against Oracle data—this partnership lowers the barrier to getting real value fast.

DINOv3: Meta’s Next-Gen Self-Supervised Vision Backbone for Real-World Tasks

Meta has introduced DINOv3, a major step forward in self-supervised learning (SSL) for vision. Rather than relying on costly human labels, DINOv3 learns from raw images and produces features that transfer cleanly to downstream tasks like detection, segmentation, retrieval, and zero-shot classification. Alongside the research, Meta released a reference PyTorch implementation, pretrained backbones, and plug-and-play heads for popular benchmarks—giving practitioners a practical path from foundation features to production models.

What’s new and why it matters

1) A modern SSL recipe built for scale.
DINOv3 extends the DINO/DINOv2 line with a three-stage pipeline—pretraining, “gram anchoring,” and high-resolution adaptation—to stabilize long runs and preserve fine-grained visual structure. The approach targets reliable, high-resolution features that work across tasks without supervised labels.

2) From backbone to task in one repo.
Beyond feature extractors, Meta ships torch.hub entries for task-ready heads: an object detector trained on COCO and a semantic segmentor trained on ADE20K, both driven by DINOv3 backbones. That means you can evaluate transfer performance quickly—no need to re-implement decoders or heads.

3) Text alignment for zero-shot use.
DINOv3 can be aligned to text (the “dino.txt” setup) to enable zero-shot classification and open-vocabulary tasks, following the DINOv2 Meets Text procedure. Meta’s repo includes configuration examples to train this alignment (with your choice of caption data), so teams can mix SSL visual features with lightweight text heads.

4) Scales from ImageNet to very large ViTs.
The codebase illustrates two ends of the spectrum: a ViT-L/16 recipe that reaches ~83.5% linear-probe accuracy on ImageNet-1k after ~14 hours (multi-GPU) and guidance for training a ViT-7B/16 backbone using the full three-stage pipeline. This shows DINOv3 is both practical for modest budgets and capable at frontier scale.

How DINOv3 compares

Earlier DINO work showed that SSL on ViTs yields representations with strong segmentation-like attention and excellent k-NN/linear-probe performance, often rivaling supervised counterparts while generalizing better out of distribution. DINOv3 continues this trend, packaging those benefits with clearer training recipes, large-model guidance, and ready-to-use task heads—reducing the gap between research features and deployable models.

What you can build today

Open-vocabulary detectors and segmentors. Start from the provided COCO/ADE20K heads and swap in your DINOv3 backbone to adapt to new domains (retail shelves, medical imagery, satellite scenes).
Zero-shot classifiers without full re-training. Use dino.txt alignment to attach a compact text head for open-set recognition or data exploration.
Fast baselines on standard GPUs. Reproduce the ImageNet-1k ViT-L/16 pretrain in hours, then linear-probe or k-NN for quick feasibility studies before scaling up.

Notes on licensing and access

The repository provides code, checkpoints, and model cards under the DINOv3 License (read it before commercial use). Torch Hub entries simplify loading both backbones and task heads; example notebooks cover PCA of patch features, dense/sparse matching, and video tracking with non-parametric methods.

Limits and open questions

DINOv3’s text alignment requires additional data and compute; quality depends on captions or paired text. Very large backbones (e.g., ViT-7B/16) still demand cluster-scale training, and domain gaps (e.g., industrial inspection vs. natural images) may require brief adaptation or data filtering. Nonetheless, the release meaningfully lowers the barrier to robust, label-efficient vision systems.

Takeaway
DINOv3 turns self-supervised visual features into a practical foundation for real products. You get a scalable SSL recipe, big-model guidance, task-ready heads, and optional text alignment—so you can move from unlabeled images to detection, segmentation, and zero-shot classification with far less labeling and glue code than before. For teams seeking strong, transferable features without massive annotation budgets, DINOv3 is the most complete, production-minded DINO yet.

Gemini CLI GitHub Actions: Google’s Free AI Teammate for Issue Triage, PR Reviews, and On-Demand Coding

Google has rolled out Gemini CLI GitHub Actions, a new way to bring its AI directly into your repository’s workflows. Unlike a chat plug-in or IDE sidebar, this agent runs as part of your CI: it watches for events like new issues or pull requests, works asynchronously with the full context of your codebase, and posts results back to GitHub. It’s free in beta, with generous quotas through Google AI Studio, and supports Vertex AI and Gemini Code Assist tiers out of the box.

What it does—out of the box

Google is shipping three open-source workflows to start: intelligent issue triage (auto-label and prioritize new issues), accelerated PR reviews (quality, style, and correctness feedback), and on-demand collaboration via @gemini-cli mentions that can trigger tasks like “write tests for this bug,” “implement suggested changes,” or “fix this well-defined issue.” All are customizable to match your team’s conventions.

Under the hood, the action wraps the open-source Gemini CLI project—Google’s terminal-first agent that exposes Gemini 2.5 Pro with long context and tool use, plus MCP support—so you can get the same capabilities in automation that you have locally.

Security and control for enterprises

Google emphasizes three design pillars:

Credential-less auth with Workload Identity Federation (WIF) for Vertex AI and Gemini Code Assist Standard/Enterprise, removing long-lived API keys from your CI.
Granular permissions including command allowlisting and the ability to assign a dedicated service identity to the agent with least-privilege scopes.
Full observability via OpenTelemetry, so logs and metrics stream to your preferred platform (e.g., Cloud Monitoring) for auditing and debugging.

Setup and availability

Getting started is straightforward: install Gemini CLI v0.1.18+ locally and run /setup-github to scaffold the workflows, or add the published action—google-github-actions/run-gemini-cli—to existing YAML. The launch is beta and worldwide, with no-cost usage for Google AI Studio (and free Code Assist for individual users “coming soon” per Google). Vertex AI as well as Gemini Code Assist Standard and Enterprise are supported from day one.

Where it helps right now

Backlog hygiene: Let the agent categorize, label, and prioritize a flood of inbound issues so humans focus on high-impact work.
PR quality gates: Automate first-pass reviews to catch obvious regressions, style drift, or missing tests before a human’s turn.
Burst capacity on demand: Mention @gemini-cli to generate tests, draft fixes, or brainstorm alternatives when the team is stretched.
Early coverage highlights precisely these collaborative patterns—an AI teammate that’s both autonomous (for routine tasks) and summonable (for specific requests).

Why this matters

By moving AI from the editor to the repository layer, Google is formalizing a new collaboration model: AI as a first-class project member. This reduces context switching, keeps code review throughput high, and turns repetitive maintenance into automation. Crucially, the security posture (WIF, allowlists, telemetry) acknowledges that enterprises won’t adopt repo-level agents without strict guardrails and visibility.

Takeaway

Gemini CLI GitHub Actions is a pragmatic step toward AI-assisted software development at team scale. If you’ve been trialing the open-source Gemini CLI locally, this release lets you standardize those gains across your org’s CI—with enterprise-ready auth, logging, and quotas that make early adoption low-risk. Start with triage and PR reviews, tune the workflows to your norms, and layer in @-mention tasks as your contributors get comfortable.

Gemma 3 270M: Google’s Tiny, Task-Tunable Model Built for On-Device Speed and Efficiency

Google has introduced Gemma 3 270M, a compact 270-million-parameter model designed specifically for task-focused fine-tuning and on-device deployment. Unlike general chat models, this release emphasizes reliable instruction-following, tight text structuring, and extremely low power draw—ideal for teams that want small, specialized models they can train and ship quickly.

What’s inside a “270M” Gemma

Gemma 3 270M splits its parameters into ~170M for embeddings and ~100M for transformer blocks. The unusually large 256k token vocabulary helps it handle rare and domain-specific tokens, making it a strong base for targeted tasks across languages and verticals. In Google’s IFEval tests, the model sets a new bar for instruction adherence in its size class.

Built for batteries, browsers, and bare-metal

Efficiency is the headline: Google reports that an INT4-quantized build on a Pixel 9 Pro used roughly 0.75% battery over 25 conversations, making this the most power-frugal Gemma yet. Production-ready Quantization-Aware Training (QAT) checkpoints are available at launch, so developers can serve INT4 with minimal quality loss on phones, laptops, or small servers.

What it’s good at (and what it isn’t)

Out of the box, Google is shipping both a pre-trained and an instruction-tuned checkpoint. The tuned variant is not aimed at long, free-form conversations; instead, it excels at structured tasks—classification, entity extraction, routing, policy or compliance checks, and converting unstructured text into schema-bound outputs. This “right tool for the job” stance mirrors results seen when enterprises fine-tune larger Gemma models for narrow domains (e.g., Adaptive ML’s SK Telecom moderation project), but now at a fraction of the cost and latency.

Developer on-ramp

Getting started is intentionally trivial. You can download weights from Hugging Face, Ollama, Kaggle, LM Studio, or Docker Hub, try the model on Vertex AI, and run locally with llama.cpp / Gemma.cpp / LiteRT / Keras / MLX. For tuning, Google documents full fine-tuning recipes and points to Hugging Face, Unsloth, and JAX toolchains. The model inherits Gemma 3’s architecture, so existing Gemma-based pipelines and guardrails transfer cleanly.

Where it fits in your stack

If you’ve been defaulting to big models for every job, 270M argues for fleet thinking: deploy multiple tiny experts—one for routing, one for extraction, one for compliance—each fine-tuned on a few thousand examples. You gain latency, privacy, and cost wins (especially on devices), and you reduce failure modes tied to long prompts and brittle few-shot scaffolds. For retrieval pipelines, 270M can act as the fast, deterministic head that classifies queries or validates outputs before a heavier model is invoked.

Practical pointers

Quantize early. Start with the QAT INT4 checkpoint to match the power and memory profile you’ll ship with.
Constrain formats. Lean into schema-first prompting (JSON schemas) so the model’s instruction-following strengths show up in production logs.
Measure ROI. Compare a fine-tuned 270M against your current medium/large model on latency, accuracy for your narrow task, and unit cost per 1k requests.

The bigger Gemma picture

Gemma 3 spans from nano-class on-device models like 3n to larger multimodal variants. The 270M release fills a clear gap: a production-oriented “smallest useful” text model with first-party quantization and batteries-included docs, distribution, and tooling. For many workflows, that’s the difference between a cool demo and a service you can afford to run 24/7.

Takeaway: Gemma 3 270M is a pragmatic tool for shipping AI where efficiency, control, and privacy matter more than sheer breadth of capability. If your team needs fast, reliable, structured text handling on phones or low-cost servers—and wants to fine-tune in hours, not days—this tiny Gemma may be the new default.

13.8.25

Claude Sonnet 4 Now Handles 1M Tokens: Anthropic’s Big Leap in Long-Context Reasoning

Anthropic has expanded Claude Sonnet 4’s context window to a full 1,000,000 tokens, a five-fold jump that shifts what teams can do in a single request—from whole-repo code reviews to end-to-end research synthesis. In practical terms, that means you can feed the model entire codebases (75,000+ lines) or dozens of papers at once and ask for structured analysis without manual chunking gymnastics. The upgrade is live in public beta on the Anthropic API and Amazon Bedrock; support on Google Cloud’s Vertex AI is “coming soon.”

Why this matters: bigger context changes workflows, not just numbers. When prompts can carry requirements, source files, logs, and prior discussion all together, you get fewer lost references and more coherent plans. It also smooths multi-agent and tool-calling patterns where a planner, executor, and reviewer share one evolving, grounded workspace—without constant re-fetching or re-summarizing. Press coverage framed the jump as removing a major pain point: breaking big problems into fragile fragments.

What you can do today

• Audit whole repos: Ask for dependency maps, risky functions, and minimally invasive refactors across tens of thousands of lines—then request diffs.
• Digest literature packs: Load a folder of PDFs and prompt for a matrix of methods, datasets, and limitations, plus follow-up questions the papers don’t answer.
• Conduct long-form investigations: Keep logs, configs, and transcripts in the same conversation so the model can track hypotheses over hours or days.

Where to run it

• Anthropic API: public beta with 1M-token support.
• Amazon Bedrock: available now in public preview.
• Google Vertex AI: listed as “coming soon.”

How to get the most from 1M tokens

Keep retrieval in the loop. A giant window isn’t a silver bullet; relevant-first context still beats raw volume. Anthropic’s own research shows better retrieval reduces failure cases dramatically. Use hybrid search (BM25 + embeddings) and reranking to stage only what matters.
Structure the canvas. With big inputs, schema matters: headings, file paths, and short summaries up top make it easier for the model to anchor its reasoning and cite sources accurately.
Plan for latency and cost. Longer prompts mean more compute. Batch where you can, and use summaries or “table of contents” stubs for less-critical sections before expanding on demand. (Early reports note the upgrade targets real enterprise needs like analyzing entire codebases and datasets.)

Competitive context

Anthropic’s 1M-token Sonnet 4 puts the company squarely in the long-context race that’s become table stakes for serious coding and document-intelligence workloads. Trade press called out the move as catching up with million-token peers, while emphasizing the practical benefit: fewer seams in real projects.

The bottom line

Claude Sonnet 4’s 1M-token window is less about bragging rights and more about coherence at scale. If your teams juggle sprawling repos, dense discovery packets, or multi-day investigations, this update lets you bring the full problem into one place—and keep it there—so plans, diffs, and decisions line up without constant re-stitching. With availability on the Anthropic API and Bedrock today (Vertex AI next), it’s an immediately useful upgrade for engineering and research-heavy organizations.

12.8.25

From Jagged Intelligence to World Models: Demis Hassabis’ Case for an “Omni Model” (and Why Evals Must Grow Up)

DeepMind’s cadence right now is wild—new drops practically daily. In this conversation, Demis Hassabis connects the dots: “thinking” models (Deep Think), world models that capture physics, and a path toward an omni model that unifies language, vision, audio, and interactive behavior. As an AI practitioner, I buy the core thesis: pure next-token prediction has hit diminishing returns; reasoning, tool-use, and grounded physical understanding are the new scaling dimensions.

I especially agree with the framing of thinking as planning—AlphaGo/AlphaZero DNA brought into the LLM era. The key is not the longest chain of thought, but the right amount of thought: parallel plans, prune, decide, iterate. That’s how strong engineers work, and it’s how models should spend compute. My caveat: “thinking budgets” still pay a real latency/energy cost. Until tool calls and sandboxed execution are bulletproof, deep reasoning will remain spiky in production.

The world model agenda resonates. If you want robust robotics or assistants like Astra/Gemini Live, you need spatiotemporal understanding, not just good text priors. Genie 3 is a striking signal: it can generate coherent worlds where objects persist and physics behaves sensibly. I’m enthusiastic—and I still want tougher tests than “looks consistent.” Sim-to-real is notorious; we’ll need evaluations for controllable dynamics, invariances (occlusion, lighting, continuity), and goal-conditioned behavior before I call it solved.

Hassabis is refreshingly blunt about jagged intelligence. Yes, models ace IMO-style math yet bungle simple logic or even chess legality. Benchmarks saturate (AIME hitting ~99%); we need new stressors. I like Game Arena with Kaggle—self-advancing tournaments give clear, leak-resistant signals and scale with capability. Where I push back: games aren’t the world. Outside well-specified payoffs, reward specification gets messy. The next wave of evals should be multi-objective and long-horizon—measuring planning, memory, tool reliability, and safety traits (e.g., deception) under distribution shift, not just single-shot accuracy.

Another point I applaud: tools as a scaling axis. Let models reason with search, solvers, and domain AIs (AlphaFold-class tools) during planning. The open question—what becomes a built-in capability versus an external tool—is empirical. Coding/math often lifts general reasoning; chess may or may not. My hesitation: as “models become systems,” provenance and governance get harder. Developers will need traceable tool chains, permissions, and reproducible runs—otherwise we ship beautifully wrong answers faster.

Finally, the omni model vision—converging Genie, Veo, and Gemini—feels inevitable. I’m aligned on direction, wary on product surface area. When base models upgrade every few weeks, app teams must design for hot-swappable engines, stable APIs, and eval harnesses that survive version churn.

Net-net: I’m excited by DeepMind’s trajectory—reasoning + tools + world modeling is the right stack. But to turn wow-demos into trustworthy systems, we must grow our evaluations just as aggressively as our models. Give me benchmarks that span days, not prompts; measure alignment under ambiguity; and prove sim-to-real. Do that, and an omni model won’t just impress us—it’ll hold up in the messy, physical, human world it aims to serve.

MolmoAct brings editable spatial plans to robot foundation models

Most robot FMs still map pixels + instructions straight to torques—a shortcut that crumbles on long-horizon tasks. MolmoAct proposes a cleaner recipe: an Action Reasoning Model (ARM) that explicitly separates perception, planning, and control so robots can reason about where to act before deciding how.

A three-stage pipeline you can steer

MolmoAct encodes images and instructions into depth-aware perception tokens, then produces a mid-level spatial plan as editable trajectory traces, and finally emits precise low-level actions. Because the plan lives as a manipulable trajectory, behavior is explainable—and steerable—without retraining.

Numbers that move the needle

SimplerEnv (Visual Matching, zero-shot): 70.5%, beating closed models like Pi-0 and GR00T N1.
LIBERO (avg): 86.6% success, including a +6.3-point gain over ThinkAct on long-horizon tasks.
Real-world fine-tuning: additional +10% task progression on single-arm and +22.7% on bimanual setups vs Pi-0-FAST.
OOD generalization: +23.3% over baselines; also top human-preference scores for instruction following and trajectory steering.

An open blueprint, not just a model

The team releases MolmoAct-7B-D weights, training code, and—importantly—the MolmoAct Dataset, over 10,000 high-quality robot trajectories spanning diverse scenarios. Adding this mid-training set yields an average +5.5% performance lift over the base model, making it a practical plug-in for existing stacks.

Why it matters

By promoting spatial plans to first-class citizens, MolmoAct bridges the gap between language-level intent and controller-level execution. For labs and startups, that means debuggable policies, few-shot steerability, and a realistic path to explainable manipulation at scale—without signing away to a closed stack.

Paper link: arXiv 2508.07917 (PDF)

GLM-4.5 wants to be the open-source workhorse for agents, reasoning, and code

Zhipu AI just dropped GLM-4.5, a Mixture-of-Experts LLM built to juggle three hard modes at once: agentic tasks, deep reasoning, and real-world coding. The headline specs: 355B total parameters with 32B active per token, a 23-trillion-token training run, and a hybrid reasoning switch that flips between “think-out-loud” and terse answers based on task demands. There’s also a slimmer GLM-4.5-Air (106B/12B active) for teams who can’t babysit a mega-model.

Why it stands out

ARC trifecta focus. Across 12 benchmarks, GLM-4.5 places #3 overall and #2 on agentic suites—with marquee scores like 91.0 on AIME’24, 64.2 on SWE-bench Verified, and 70.1 on TAU-Bench. It also reports 26.4 on BrowseComp for web agents, near OpenAI’s o4-mini-high in the authors’ runs.
Parameter-efficient MoE. Compared to some giant peers, GLM-4.5 keeps active params modest while stacking deeper layers, 96 attention heads, partial RoPE, QK-Norm, and a built-in MTP layer for speculative decoding.
Hybrid reasoning as a product feature. Both GLM-4.5 and Air support thinking (for complex tool use) and non-thinking (instant replies) modes from the same checkpoint.

The training recipe (quick hits)

A two-stage pretraining + mid-training stack mixes high-quality web, multilingual, code, math/science, then adds repo-level code, synthetic reasoning, 128K-token long-context, and agent trajectories to push real software-engineering and planning skills. Post-training distills expert Reasoning, Agent, and General models into one hybrid generalist, followed by targeted RL (including a “pathology RL” cleanup pass).

What you can actually download

Zhipu has published code, evals, and model cards on GitHub; weights are also listed on Hugging Face. The team pitches GLM-4.5 as agent-first and ships a simple eval harness to reproduce scores.

Bottom line

Open-source has plenty of great single-skill models. GLM-4.5 is aiming for a different bullseye: one backbone that can browse, reason, and patch code without feeling second-tier. If the reported ARC numbers hold up in the wild, this could become the go-to open checkpoint for production-grade agents.

Paper link: arXiv 2508.06471 (PDF)

8.8.25

GPT-5 Arrives: A Quantum Leap or an Incremental Step Toward Everyday AGI?

OpenAI CEO Sam Altman opened the launch keynote with a statistic that still jolts me: 700 million weekly ChatGPT users. If accurate, that is the fastest adoption curve of any software platform in history. Altman framed GPT-5 as the model that finally feels like “talking to a PhD-level expert in anything,” capable of planning a birthday party, writing a full software stack, or parsing biopsy results in seconds. As someone who has lived through GPT-3’s flashes of brilliance and GPT-4o’s solid utility, I’m impressed by the live demos—particularly the on-the-fly 3-D castle game and the finance dashboard spun up in minutes. Yet part of me wonders how often real-world edge-cases will still trip the model, PhD metaphors aside.

Reasoning + Speed = Default
One genuine breakthrough is that GPT-5 merges OpenAI’s slow “reasoning models” and fast “standard models” into a single pipeline. The system decides—dynamically—how much chain-of-thought to spend on each request. As a developer, I love the promise of no more model-picker gymnastics. But the skeptic in me notes that latency remains physics-bound; the keynote glossed over how much extra compute the “perfect amount of thinking” really burns.

Safer, but Still a Work in Progress
Safety lead Saachi emphasized safe completions: instead of the binary comply/refuse we’ve grown used to, GPT-5 offers partial, contextual answers plus policy pointers. I applaud the nuance (the potassium perchlorate fireworks example was spot-on), and early physician-audited benchmarks suggest lower hallucination rates. Still, bi-modal safety often fails at scale. Until we see longitudinal data from millions of prompts, I reserve judgment on whether “significantly less deceptive” translates into materially fewer bad outcomes.

Coding Superpowers—and Benchmarks That May Be Peaking
On SWEBench, GPT-5 posts 74.9 %—state-of-the-art by a wide margin—and Cursor’s integration shows real autonomy: the model searches code, patches errors after compiling, and writes explanatory READMEs. That’s developer candy. Yet I can’t ignore Michael Truell’s aside that models are saturating classic evals. When a leaderboard hits 99 %, the next delta in usefulness won’t come from marginal accuracy boosts; it will come from deeper tool integration, live debugging, and sustained multi-day agent runs—areas GPT-5 only begins to address.

Health and Personalization
The on-stage story of Carolina using GPT-5 to weigh radiation options was moving and highlights the model’s strength as a patient advocate. Free-tier voice chat, Gmail/calendar integration, and memory all point toward a more personal assistant future. My worry is data consent and provenance: when GPT-5 merges personal email with medical queries, the privacy surface expands dramatically. OpenAI’s policies will need the same iterative care the model architecture received.

What I’m Excited About—and Watching Carefully
I love the 400 K context window, the new “minimal reasoning” knob for latency-sensitive tasks, and regular-expression-constrained outputs. Those are practical, developer-driven wins. I’m less convinced by the AGI framing; Altman downplayed compute bottlenecks and energy costs, and benchmark fatigue is real. GPT-5 feels like the best general-purpose model we’ve seen—but whether it inaugurates a “team of experts in your pocket” or reveals the limits of current scaling will depend on how it behaves over the next billion prompts.

Overall, GPT-5 is a thrilling upgrade—smarter, faster, and more context-aware. Just remember: even PhD-level experts can be confidently wrong, and the same will be true for the most intuitive model yet.

6.8.25

OpenAI Unveils GPT-OSS: Two Apache-Licensed Open-Weight Models Aimed at Reasoning, Agents, and Real-World Deployment

OpenAI has released GPT-OSS, a pair of open-weight language models designed for strong reasoning and agentic workflows—gpt-oss-120b and gpt-oss-20b—marking the company’s most significant “open” move since GPT-2. Both models are distributed under Apache 2.0 (with an accompanying GPT-OSS usage policy), positioning them for commercial use, customization, and local deployment.

What’s in the release

Two sizes, one family. The larger gpt-oss-120b targets top-tier reasoning; gpt-oss-20b is a lighter option for edge and on-prem use. OpenAI says 120b achieves near-parity with o4-mini on core reasoning benchmarks, while 20b performs similarly to o3-mini—a notable claim for open-weight models.
Hardware footprint. OpenAI highlights efficient operation for the 120b model (single 80 GB GPU) and 20b running with as little as 16 GB memory in edge scenarios, enabling local inference and rapid iteration without costly infrastructure.
Licensing & model card. The company published a model card and licensing details (Apache 2.0 + usage policy), clarifying intended use, evaluations, and limitations.

Why this matters

For years, OpenAI prioritized API-only access to frontier systems. GPT-OSS signals a strategic broadening toward open-weight distribution, meeting developers where they build—local, cloud, or hybrid—and competing more directly with leaders like Llama and DeepSeek. Early coverage underscores the shift: outlets note this is OpenAI’s first open-weight release since GPT-2 and frame it as both an ecosystem and competitive move.

Where you can run it (day one)

OpenAI launched with unusually wide partner support, making GPT-OSS easy to try in existing MLOps stacks:

Hugging Face: downloadable weights and a welcome post with implementation details.
AWS SageMaker JumpStart: curated deployment templates for OSS-20B/120B.
Azure AI Foundry & Windows AI Foundry: managed endpoints and tooling for fine-tuning and inference.
Databricks: native availability with 131k-context serving options and enterprise controls.
NVIDIA: performance tuning for GB200 NVL72 systems; NVIDIA cites up to ~1.5M tokens/sec rack-scale throughput for the 120B variant.

Developer ergonomics: Harmony & agents

OpenAI also published Harmony, a response format and prompt schema that GPT-OSS models are trained to follow. Harmony standardizes conversation structure, reasoning output, and function-calling/tool-use—useful for building agents that require predictable JSON and multi-step plans. If you’re serving via common runtimes (Hugging Face, vLLM, Ollama), the formatting is handled for you; custom servers can adopt the schema from the public repo.

Safety posture

OpenAI says GPT-OSS went through Preparedness Framework testing, including trials where a maliciously fine-tuned 120B model was evaluated for risky capabilities. The company reports that such variants did not reach high-capability thresholds, presenting a measured step forward in open-model safety practices.

How it stacks up (early read)

Early reports highlight the significance of the move and the headline performance claims—near-o4-mini for 120B and o3-mini-like results for 20B—alongside the practical win of local, customizable models under a permissive license. Analysts also point out the competitive context: GPT-OSS arrives as open-weight ecosystems (Llama, DeepSeek, Qwen, Kimi) surge in adoption.

What to build first

Agent backends that rely on structured tool use and local policy control (Harmony + Apache 2.0 helps here).
Sovereign/air-gapped deployments in regulated environments using on-prem GPUs or edge hardware, especially with the 20B model.
Cost-sensitive RAG and analytics where fine-tuning and local inference can beat per-token API economics—now supported across major clouds and MLOps platforms.

The takeaway

GPT-OSS is OpenAI’s clearest embrace of the open-weight ecosystem to date: credible reasoning performance, permissive licensing, broad partner availability, and practical tooling for agents. If your roadmap calls for customizable, locally deployable models with strong reasoning, GPT-OSS belongs on your shortlist—whether you’re targeting laptops, single-GPU servers, or GB200-class scale.

5.8.25

MLE-STAR: Google’s ML Engineering Agent Is Impressive—But Real-World Automation Still Needs Guardrails

Google Research just unveiled MLE-STAR, a machine-learning engineering agent that treats model building like a guided search-and-refine loop rather than a single shot of LLM codegen. The announcement (August 1, 2025) positions MLE-STAR as a state-of-the-art ML engineering agent capable of automating diverse tasks.

At a high level, the system does three things I really like:

Bootstraps from the web. Instead of relying purely on prior LLM knowledge (which often overfits to familiar libraries), MLE-STAR first uses web search to pull task-appropriate, modern model patterns and builds an initial solution from them. In other words, it goes looking for today’s best practice before writing code.
Refines the right part of the pipeline. Many agents rewrite whole scripts every iteration; MLE-STAR runs ablation studies to find the code block with the biggest performance impact (e.g., feature engineering vs. model vs. ensembling), then iteratively refines that block using feedback from prior runs. This targeted loop is far closer to how strong human MLEs work day-to-day.
Ensembles with intent. Rather than naive voting, the agent proposes and improves ensemble strategies to merge multiple candidate solutions into a single, better one.

The team also built pragmatic safety rails I’m thrilled to see in an autonomous coder: a debugging agent for traceback-driven fixes, a data-leakage checker to catch test-time contamination, and a data-usage checker so scripts don’t ignore provided modalities. These modules address common failure modes I’ve encountered with LLM-generated pipelines.

On benchmarks, the results are eye-catching. MLE-STAR won medals in ~63–64% of Kaggle competitions in MLE-Bench-Lite, a massive jump over prior agents; the blog cites 63.6% any-medal (with 36% gold), and the arXiv v2 reports 64%. Either way, it’s a big leap.

I also appreciate the ops mindset: there’s open-source code built with Google’s Agent Development Kit (ADK) so teams can reproduce the workflow and extend it.

Now, where I’m cautious:

Generalization. MLE-Bench-Lite is a valuable proxy, but medals on curated Kaggle tasks aren’t the same as long-lived production systems with shifting data, compliance constraints, and messy labels. The refinement loop may still need human “taste” to set success metrics and pick trade-offs (latency vs. accuracy, cost vs. recall). The paper itself stresses targeted refinement and web retrieval as the key innovations—not a claim that human MLEs are obsolete.
Licensing & provenance. Because the agent retrieves models and code from the web, verifying permissive licenses and acceptable usage is non-negotiable—Google explicitly flags MLE-STAR as research-only and expects users to check licensing of retrieved assets. That’s the right call, and enterprises should wire in policy checks before any auto-generated PRs land.
Evaluation drift. The ablation-guided focus is elegant, but it assumes your validation signal is representative. In many real datasets, weak labels or distribution shift can mislead the ablation and push the agent to overfit the “most impactful block.” Tight data splits and independent holdouts remain essential.

Bottom line: MLE-STAR advances the state of autonomous ML engineering—web-aware bootstrapping, ablation-driven targeted refinement, and smarter ensembling are exactly the techniques I want in an agentic MLE. I’m ready to use it as a co-engineer on well-scoped problems, with humans owning metrics, governance, and final review. If we pair this agent with robust eval harnesses and license compliance, the payoff could be faster iteration and stronger baselines—without losing the engineering discipline that production ML demands.

ReaGAN turns every node into an agent—with a plan, memory, and tools

Classical GNNs push messages with one global rule per layer—great for tidy graphs, brittle for messy ones. ReaGAN (Retrieval-augmented Graph Agentic Network) breaks that mold by treating each node as an autonomous agent that decides whether to aggregate locally, retrieve globally, predict now, or do nothing—based on its own memory and a plan drafted by a frozen LLM.

What’s new

Node-level autonomy. At every layer, a node queries the LLM for an action plan, executes it, and updates memory—no globally synchronized rulebook.
Local + global context. Beyond neighbors in the graph, nodes invoke RAG to retrieve semantically similar but structurally distant nodes, then fuse both sources.
Memory as glue. Nodes persist aggregated text snippets and few-shot (text, label) exemplars, enabling in-context prediction later.

Why it matters

Real-world graphs are sparse and noisy; uniform propagation amplifies junk. ReaGAN’s per-node planning and local-global retrieval adapt to informativeness imbalances and long-range semantics—key gaps in standard GNNs. In experiments, the authors report competitive few-shot performance using only a frozen LLM (no fine-tuning), highlighting a compute-friendly path for graph ML.

How it runs (at a glance)

Each node iterates a loop: perceive → plan → act (LocalAggregation / GlobalAggregation / Predict / NoOp) → update memory. A simple algorithmic skeleton formalizes the layer-wise cycle and action space.

Paper link: https://arxiv.org/pdf/2508.00429

4.8.25

The Agentic Web: when bots become the primary users of the internet

Search boxes and feeds defined the first two web eras. A new position paper proposes the third: the Agentic Web, where autonomous software agents—often LLM-powered—act on our behalf, coordinate with other agents, and execute long-horizon tasks across services. The authors offer a working definition and argue the shift is already visible in consumer assistants that can plan purchases and book reservations end-to-end.

A framework in three dimensions

The paper lays out a conceptual stack for this world: intelligence (reasoning, memory, planning), interaction (tools, APIs, multi-agent protocols), and economics (incentives, pricing, marketplaces). These dimensions, taken together, underpin capabilities like retrieval, recommendation, planning and collaboration that move beyond single-turn chat.

From retrieval to planning to coordination

Architecturally, the authors chart algorithmic transitions: user-issued queries give way to agentic retrieval; recommender systems evolve into agent planners; and isolated tools become multi-agent collectives able to decompose and delegate work. A worked example walks through agents co-planning a travel itinerary, highlighting orchestration and memory.

New pipes: MCP and agent-to-agent messaging

HTTP and RPC weren’t built for autonomous, negotiated workflows. The paper surveys emerging Model Context Protocol (MCP) interfaces and purpose-built agent-to-agent (A2A) messaging layers to support capability discovery, tool brokering and structured negotiations between services—foundational plumbing for an internet of bots.

The Agent Attention Economy

If algorithms once competed for human attention, services on the Agentic Web will compete to be selected by agents mid-plan. That reframes ranking, pricing and attribution around machine decision-makers—an attention market where tools, APIs and even other agents bid for inclusion in workflows.

What breaks (and who pays)

The authors predict “agent browsers” will disrupt today’s user-centric browsing model, shifting interfaces from manual clicks to delegated execution. They also flag a looming billing problem for complex, multi-step agent services that span providers and time windows—who gets paid, and how, when dozens of tools contribute to one outcome?

Risks, red teaming and defense

A full section maps threats across layers (prompt-/tool-injection, data exfiltration, compromised marketplaces), and compares human-in-the-loop versus automated red teaming for agent systems. The authors argue for hybrid approaches, inference-time guardrails, and controllable planning to keep autonomous workflows within safe bounds.

Why it matters

If the Agentic Web arrives, the primary “users” of the internet won’t be humans but agents negotiating with each other—demanding new protocols, marketplaces, governance and safety tooling. For startups, the opportunity is to build the pipes, policies and platforms that let those agents cooperate—and compete—reliably.

Paper link: arXiv 2507.21206 (PDF)

2.8.25

MetaStone-S1 makes “how long to think” a first-class dial—and it pays off

Frontier models are learning to trade more inference compute for better answers. MetaStone-S1 turns that trend into a clean architecture: a Reflective Generative Form where the policy and a process reward model live in the same network, adding a light 53M-parameter scoring head instead of a separate, heavyweight judge. The scoring head is trained self-supervised from outcome rewards—no step-by-step human labels—so the system can generate multiple chains of thought and select the best one efficiently.

Three “reasoning effort” modes, one model

Because the verifier is built-in, MetaStone-S1 exposes controllable thinking lengths—low, medium, high—implemented via different candidate counts (k = 2/8/32) at inference. That makes test-time scaling a product feature rather than a research trick.

Benchmarks: o3-mini territory at 32B

Across AIME’24/’25 (math), LiveCodeBench (code), and C-Eval (Chinese reasoning), the 32B MetaStone-S1 variants lift accuracy over a strong 32B baseline and land comparable to OpenAI o3-mini (medium)—with the high mode leading math by a sizable margin. Example table slice (Pass@1): AIME’24 85.2, AIME’25 73.6, LiveCodeBench 64.2, C-Eval 89.7 for MetaStone-S1-32B-high vs. o3-mini-medium 79.6 / 74.8 / 67.4 / 75.9.

At smaller scales, the 1.5B and 7B versions also beat peer open models (e.g., R1-Distill 7B/8B) on AIME and LiveCodeBench, showing the approach is not just a big-model hack.

Why this matters

Unified policy+PRM = cheaper selection. Sharing the backbone removes a second giant model from the loop and still delivers strong external TTS gains.
Label-free verifier training. The SPRM head learns step scoring from outcome signals, sidestepping costly, noisy process annotations.
Production-ready knob. Teams can ship speed/quality dials (k=2/8/32) instead of maintaining separate models for different latency tiers.
Open release. Code and checkpoints are public, inviting replication and adaptation.

MetaStone-S1’s take-home: reasoning power isn’t only about bigger weights or longer chains—it’s about selecting the right trajectory at inference, with a verifier you can actually afford to run.

Paper link: arXiv 2507.01951 (PDF)

Computing Changes How We Think—But Creativity, Not Just GPUs, Will Decide AI’s Next Decade

In a wide-ranging Bloomberg interview, Dr. Wang Jian (founder of Alibaba Cloud) makes a forceful case that the era of AI “toy problems” is over. I agree. The last two years moved us from brittle demos to systems that reliably draft code, analyze documents, and support human decision-making. His analogy that more compute is like upgrading from a bicycle to a rocket is compelling: when the cost and scale of computation change, the feasible solution space—and our mental models—change with it.

Where I especially align is his view that markets are not just places to sell, but living testbeds where technology matures under real constraints. This resonates with best practices in ML ops: no benchmark, however well chosen, substitutes for deployment feedback. China’s dense competitive landscape, as he notes, creates short iteration loops—startups push features, rivals answer, users vote—accelerating collective learning. In ML terms, it’s a virtuous cycle of data, gradient steps, and evaluation at production scale.

I also appreciate his skepticism about tidy labels like AI → AGI → ASI. In practice, capability is a continuum: larger context windows, better tool use, richer memory, and planning—these blur categorical boundaries. Treating progress as increasing capability across tasks avoids false thresholds and keeps builders focused on measurable gains.

That said, I diverge on several points.

First, Dr. Wang downplays compute as a long-term bottleneck. I’m not fully convinced. While creativity and product insight absolutely dominate value creation, frontier training remains capital- and energy-intensive. Export controls, supply chain variability, and power availability still shape who can train or serve the most advanced models. For many labs, clever data curation and distillation help—but they don’t erase the physics and economics of scaling laws.

Second, on robotics, he frames AI as a new “engine” for an existing vehicle. Conceptually useful—but today’s embodied intelligence also requires tight integration across perception, control, simulation, and safety, not just swapping motors. Progress is real (foundation models for vision and language transfer surprisingly well), yet reliable grasping, long-horizon autonomy, and recovery from edge cases remain research frontiers. The “AI engine” metaphor risks underestimating those system-level challenges.

Third, the notion that no current advantage forms a durable moat is directionally optimistic and healthy for competition; still, moats can emerge from datasets with verified provenance, reinforcement-learning pipelines at scale, distribution, and compliance. Even if individual components commoditize, the orchestration (agents, tools, retrieval, evals, and workflow integration) can compound into real defensibility.

Finally, I agree with his emphasis that creativity is the scarcest input. Where I’d extend the argument is execution discipline: teams need evaluation harnesses, safety checks, and shipping cadences so creativity feeds a measurable loop. In other words, pair inspired ideas with ruthless metrics.

The upshot: Dr. Wang’s thesis—compute reshapes thinking, markets mature tech, creativity drives breakthroughs—captures much of what’s powering AI right now. My caveats don’t negate his vision; they refine it. The winners will be those who marry inventive product design with pragmatic engineering and acknowledge that, even in a marathon, hardware, data, and distribution still set the course.

Hierarchical Reasoning Model: a tiny, brain-inspired model that out-reasons giant CoT LLMs

Most frontier models “reason” by narrating token-by-token chains of thought. Sapient Intelligence’s Hierarchical Reasoning Model (HRM) argues you don’t need that narration—or billions of parameters—to solve hard puzzles. The 27 M-parameter model runs two coupled recurrent modules at different timescales (a slow H-module for abstract planning and a fast L-module for detailed computation) to perform deep latent reasoning in a single forward pass. Trained from scratch with no pretraining and no CoT supervision, HRM hits standout scores across inductive-reasoning and search-heavy tasks.

Why it works: depth without the usual pain

HRM’s core trick is hierarchical convergence: the fast L-module iterates to a local equilibrium, then the slow H-module updates once and “resets” context for the next refinement cycle—stacking many effective computation steps without vanishing into a fixed point. To train it efficiently, the authors derive a one-step gradient approximation that avoids backpropagation-through-time, cutting memory from O(T) to O(1) per sequence.

There’s also an adaptive halting head (a small Q-learner) that decides whether to stop or continue another reasoning segment, enabling “think-more-if-needed” behavior at inference time—useful when a problem demands longer planning.

The receipts

With roughly 1,000 training examples per task, HRM posts numbers that would make far larger CoT systems blush:

ARC-AGI-1: 40.3 %, beating o3-mini-high (34.5), Claude-3.7 8K (21.2) and DeepSeek-R1 (21.0); a Transformer trained directly on IO pairs manages 15.8.
ARC-AGI-2: HRM reaches 5.0 % where strong CoT baselines hover near zero—consistent with the benchmark’s step-up in compositional difficulty.
Sudoku-Extreme (9×9, 1k ex.): 55.0 % accuracy; on the full Sudoku-Extreme-Full (3.83 M puzzles), HRM approaches near-perfect accuracy.
Maze-Hard (30×30, 1k ex.): 74.5 % optimal-path success—where CoT baselines flatline.

What this means for builders

Latent > linguistic reasoning: HRM shows you can get deep, backtracking-style reasoning inside hidden states—no verbose CoT, fewer tokens, lower latency.
Tiny models, big compute depth: By recycling computation through nested recurrent cycles, HRM attains “depth” that standard Transformers don’t, even when you stack layers.
Knob for “thinking time”: The halting mechanism effectively scales compute at inference—handy for tasks like Sudoku where a few extra cycles pay off more than on ARC-style transformations.

Dataset & evaluation notes

Sudoku-Extreme combines easier Kaggle-style puzzles with community “forum-hard” sets; difficulty is measured by average backtracks (≈22 per puzzle on the new subset—much tougher than common datasets). Maze-Hard requires optimal 30×30 paths; ARC-AGI results follow the official challenge protocols with standard augmentations.

If subsequent open-sourced code (the paper links a GitHub repo) spurs replication, expect a wave of BPTT-free recurrent designs and “reason-more-on-demand” controls to show up in lightweight agents—especially where token budgets and latency matter more than eloquent chain-of-thoughts.

Paper link: arXiv 2506.21734 (PDF)