21.6.26

Agent Loops, Explained Simply (And Why They're Not Magic)

There's a lot of buzz online lately about "agent loops" and "loop engineering," with some people claiming you shouldn't even prompt your AI anymore — you should build systems that prompt it for you. A recent video does a nice job cutting through the hype and explaining what this actually means. Here's a plain-English version, plus an honest take on what's useful and where the hype gets ahead of reality.

What's an agent loop?

Strip away the jargon and a loop is simple. Instead of asking the AI to do something once and accepting whatever comes back, you let it repeat a cycle: reason (figure out what to do), act (do it), and observe (check the result). It keeps going — adjusting each time — until it hits a goal you've defined, then stops and tells you it's done.

The video compares it to a smart intern you don't micromanage. You hand them a goal and a clear definition of "finished," and they figure out the steps, check their own work, and only come back when it's actually done.

A loop really comes down to three things: a trigger (what starts it), an action (what it does), and a stop condition (how it knows to quit). And two pillars matter most — a clear goal and a way to verify progress.

Why this is genuinely useful

The big insight is that AI rarely nails something on the first try. Normally you would look at the output, give feedback, and ask for changes — over and over until it's good enough. A loop just hands that tedious back-and-forth to the AI itself, so it arrives much closer to "good" before it ever shows you the result.

In the video, the creator builds things this way: thumbnails scored against a rubric, a 3D plane that the AI repeatedly renders and inspects, even video edits where the AI cuts pauses and checks its own timing. The magic of "I did this in one prompt" is usually just a well-designed loop with verification baked in.

Where the hype gets ahead of reality

Here's the refreshing part of the video: most tasks don't need elaborate loops. You don't need five agents running 24/7 commanding other agents. The creator admits that for his own knowledge work, a single agent with a good prompt does the job most of the time, and that round-the-clock "swarms" mostly make sense for full-time coders building software.

The honest takeaways worth remembering:

Loops aren't meant to deliver 100% perfect output — they just get you much closer, faster. In the video, the AI's attempt to recreate a famous photo in code still came out looking nothing like the original.
A loop is only as good as its "done" check. Vague criteria like "until you're satisfied" lead to vague results; the best loops aim for something measurable.
Just because an expert online does it doesn't mean you should. AI shows up differently in every job — staying informed isn't the same as copying someone else's setup.

The part that stays human

Notice what you still own in all of this. You define the goal, and you decide how "done" gets verified. The video is blunt that AI is never perfect and a loop without a good verification check just produces polished-looking mistakes faster. Some runs even went 12 hours and weren't useful.

So the human role doesn't disappear — it moves up a level. Instead of correcting every draft, you're designing clear goals, sensible stop conditions, and real checks — and reviewing the final result before trusting it.

The bottom line

Agent loops are a genuinely smart idea: let the AI handle its own feedback-and-iterate cycle so you get better results sooner. But they're a tool, not a miracle. Start simple, define "done" clearly, and keep checking the work — because a loop will happily repeat a mistake until you tell it not to.

Anthropic Just Made Claude Agents Boring. That's the Whole Point.

The flashy AI announcements get the headlines — new model, higher benchmark, longer context. But if you've ever tried to actually deploy an agent inside a company with a security team, you know the model was never the hard part. The hard part is the question every CISO asks in the first meeting: how does this thing touch our systems without becoming the breach we read about next quarter?

On June 18, Anthropic answered the last open piece of that question. And the answer is delightfully unglamorous.

The news: identity finally caught up

Anthropic shipped Enterprise-Managed Authorization for Claude's MCP connectors. In plain terms: an admin provisions a connector once through the company's identity provider, and every employee inherits access automatically on first login. No individual OAuth consent screens. No "click allow" forty thousand times across the org. When someone leaves or changes roles, their connector access gets revoked alongside every other app — because it's governed by the same identity rules.

Okta is the first identity provider, using its Cross App Access plumbing. The connectors live at launch: Asana, Atlassian, Canva, Figma, Granola, Linear, and Supabase, with Slack rolling out. It's in beta, Team and Enterprise plans only.

If you've never run IT for a large org, this sounds like a footnote. If you have, you know it's the difference between "we piloted it with five people" and "it's live for the whole company."

The two pieces this builds on

Here's the context most of the recap posts are getting wrong: this isn't a standalone drop. It's the capstone on two features Anthropic shipped a month earlier, on May 19. Together the three are the actual enterprise story.

Self-hosted sandboxes (public beta) moved tool execution out of Anthropic's infrastructure and into an environment you control — your own infra, or a managed provider like Cloudflare, Daytona, Modal, or Vercel. The clever bit is the split: the agent loop that handles orchestration, context, and error recovery stays on Anthropic's side, but the code actually runs inside your perimeter. Your files don't leave. Your network policies and audit logging already apply. You set the compute.

MCP tunnels (research preview) solved the other half. Your agents can now reach MCP servers sitting inside your private network without exposing them to the public internet. A lightweight gateway you deploy makes a single outbound connection — no inbound firewall rules, no public endpoint, encrypted end to end. Your internal Postgres, your private APIs, your ticketing system become tools the agent can call, and none of them ever face the open web.

Why the trio matters

Line them up and you can see the strategy. The classic enterprise objection to AI agents has always been some version of "we can't let an external service into our internal systems." Anthropic just dismantled that objection at three different layers at once.

Tunnels mean no public endpoint and no VPN exceptions. Sandboxes mean code execution never leaves your walls. And enterprise-managed auth means access is provisioned and revoked through the identity system you already trust. Each one removes a specific veto that a security team can throw. Stack them and the "no" gets very hard to justify.

That's the real shift here. The bottleneck for enterprise AI was never reasoning quality. It was governance, and governance is exactly what this release is about.

The honest caveats

I'd be doing you a disservice if I made this sound finished. MCP tunnels is still a research preview — you request access, it's not broadly available. Self-hosted sandboxes is public beta, which means it's real but you should expect rough edges. And the enterprise-managed auth is beta, Team and Enterprise only, with Okta as the sole identity provider for now. If your stack runs on a different IdP, you're waiting.

So this is a direction, not a finished product. But it's the right direction, and it's further along than anything comparable from the other labs.

The takeaway

This release won't trend the way a new model does. There's no benchmark to screenshot. But if your job is getting agents past a security review and into production, this is the most important thing Anthropic has shipped this quarter. They made the deployment story boring — predictable, governable, auditable — and boring is precisely what enterprise buyers have been waiting for.

The model was always good enough. Now the plumbing is catching up.

20.6.26

Building an Affiliate Marketing Business with AI: An Honest, Friendly Look

There's a video making the rounds where someone claims to build an entire affiliate marketing business in about an hour — a website, Pinterest pins, an email system, even the emails themselves — using Claude plus an AI tool called GenSpark. It looks almost magical. So is it real, and should you try it?

Here's a plain-English take on what's genuinely great about the idea, what's harder than it looks, and the one habit you can't skip.

The idea in a nutshell

Affiliate marketing just means promoting someone else's product and earning a commission when people buy through your link. The video's plan is simple: pick a niche (say, kitchen gadgets), build a clean website with AI, add an email signup with a free guide, and create eye-catching Pinterest pins that send curious people to your site. AI does most of the heavy lifting — writing, designing, and even building the website from a single prompt.

What's genuinely good about it

The biggest win is speed. Things that used to take days — designing a website, writing emails, making pins in Canva — can now come back in minutes. For someone starting out with no budget for a designer or developer, that's a real head start.

It's also more approachable than ever. You describe what you want in normal language and watch the website build itself, with no code to touch. And the underlying strategy is sound: sending people to your own site and capturing emails (so you "own" your audience) is smarter than dropping raw affiliate links on social media and hoping.

Finally, it's easy to experiment. Once you've built one funnel, you can repeat it across niches and see what sticks.

What's harder than the video makes it sound

A polished website is the easy 10%. The hard 90% is getting actual people to visit — and that part the video mostly skips. Pinterest, traffic, and steady sales take time, consistency, and a bit of luck. Most affiliate sites earn little or nothing for a long while.

There are also rules you have to follow, not optional extras. Amazon and other programs require you to clearly disclose that your links are affiliate links, and they have strict terms you can get banned for breaking. AI won't handle that compliance for you.

And be skeptical of the "people are making money with this" framing. Real money is possible, but these videos rarely show the failures, the months of effort, or the fact that the easiest person to make money is often the one selling you the tools.

The rule you can't skip: check everything yourself

This is the part to underline. AI makes mistakes, and a human always needs to review the work before it goes live.

AI will confidently invent product details, quote wrong prices, recommend items that are out of stock, or write claims about a product that simply aren't true — and it sounds just as sure when it's wrong. In affiliate marketing, that's not just embarrassing; misleading claims can break platform rules or even consumer-protection laws.

So treat every output as a first draft. Before anything is published, verify each product, price, and link is real and current, read every email and pin for accuracy and honest claims, and make sure your affiliate disclosures are clearly visible. You are the editor and the one responsible for what your audience sees — not the AI.

The bottom line

The tools really can collapse hours of work into minutes, and that's exciting, especially if you're not technical. But building the site is the beginning, not the business. Go in with realistic expectations, follow the disclosure rules, and keep a human firmly in the loop. AI can do the building — you do the checking.

Can You Really Build a Whole AI Marketing Team? A Friendly Look at the Idea

There's a popular video going around that promises something pretty wild: turn Claude into a full marketing team — five "agents" and a dozen "skills" — all working together to research, write, design, and even build landing pages for you. And the best part of the pitch? "Even if you're not technical, let's go."

It's a genuinely exciting idea. But before you dive in, here's an honest, plain-English take on what's great about it, what's tricky, and the one rule you should never skip.

What's the big idea?

Think of it like hiring a small team that never sleeps. You give the AI some "skills" (reusable instructions for tasks you do all the time, like making a branded slide deck or writing a blog post) and a few "agents" (specialists, like a data analyst or a content writer). Then you hand it a job — "launch our summer campaign" — and it produces research, social posts, images, and a landing page, mostly on its own.

In the video, it works impressively well. The decks follow the brand template, the images match the style, and the whole package looks professional.

What's genuinely good about this

The most appealing part is the time saved. Tasks that used to eat a whole afternoon — drafting posts, pulling a report together, mocking up a deck — can come back in minutes. For a small business or a solo marketer, that's a real gift.

It's also more approachable than it used to be. You're mostly talking to the AI in normal language, not writing code. And the idea of building reusable "skills" is smart: you teach it your style once, and it remembers. That consistency is hard to get when you're rushing.

Finally, it lowers the barrier to trying things. Want three versions of a campaign to compare? You can have them quickly, then pick what actually works.

What's not so easy (especially if you're non-technical)

Here's the honest part. The video says "even if you're not technical," but the setup involves downloading VS Code, installing tools, connecting "MCPs," and editing configuration files. That's a fair bit more technical than the friendly framing suggests. None of it is impossible to learn, but expect a real learning curve, not a five-minute setup.

There's also a cost to maintaining all this. Skills and agents need updating as your brand and goals change. And the more you pile on, the more you have to keep organized.

The rule you should never skip: check everything

This is the most important point, so I'll say it plainly. AI makes mistakes, and a human always needs to check the work.

Even in the video, the creator admits the result is "90% done" and that some charts still need fixing. That last 10% matters enormously. AI can confidently state facts that are wrong, invent statistics, misquote a source, or get your pricing or product details subtly off. It won't always tell you when it's unsure — it often sounds just as confident when it's wrong as when it's right.

So treat every output as a first draft, never a finished product. Before anything goes public:

Fact-check the claims and numbers against a trusted source.
Read it for tone and accuracy — does it actually sound like your brand, and is everything true?
Double-check names, prices, links, and dates, which AI gets wrong surprisingly often.

You are the editor-in-chief. The AI is a fast, tireless assistant — not a replacement for your judgment.

The bottom line

Building an "AI marketing team" is a powerful idea, and the tools really can save you hours. If you're non-technical, go in knowing the setup is more involved than it looks, and start small with one or two simple skills.

Most of all, keep a human in the loop. AI can do the heavy lifting, but a person should always have the final read before anything reaches your audience.

5.4.26

I'm Stealing This AI Researcher's Workflow for My Own Projects

Karpathy doesn't use a fancy app to manage his research. He uses a folder, Obsidian, and an AI — and I want to copy it.

He posted about it last week. The short version: he dumps raw material — articles, notes, papers, images — into a folder, then lets a large language model (LLM — the AI brain behind tools like Claude or ChatGPT) build a wiki from it automatically. The LLM writes the summaries, creates the links between ideas, organizes everything into categories. He barely touches the wiki himself. When it gets big enough, he asks it questions and gets answers pulled from his own research.

I've been sitting with this for a few days, thinking about what it would look like for my work.

---

What My Work Actually Looks Like

I build things. Agents, content apps, Claude Code workflows, automation scripts. A lot of what I do involves figuring something out — what tool does what, how to wire two things together, what prompt pattern produces the right output, what broke last time and why.

Most of that knowledge lives in my head, or in scattered notes, or in past conversations I can't find anymore.

That's the problem. Every time I start something new, I spend time re-learning things I already know. What flags to use in Claude Code. What agent structure works for what kind of task. What API response format caused everything to break last month.

Karpathy's idea is simple: stop keeping that knowledge in your head. Dump it in a folder. Let the AI organize it. Ask it back when you need it.

—

The Specific Thing I Keep Thinking About

He mentioned that his wiki grows and gets more useful with every question he asks. He asks something, the AI goes through his notes and answers it — and then he saves that answer back into the wiki. So every session adds something. Nothing gets lost.

That hit me because the opposite is true for how I work right now. Every build session ends and most of the small things I figured out just disappear. The next session starts almost from scratch on some of the same ground.

If I had a knowledge base for my Claude Code workflows alone — prompts that worked, structures that didn't, patterns I figured out, error fixes — and an AI that could surface the right piece when I needed it, I'd stop repeating myself.

—

The Part That Actually Excited Me

He also runs "health checks" on his wiki. He asks the AI to find gaps, spot inconsistencies, and find connections between ideas he hadn't noticed yet. The AI suggests new things to look into.

That's the part I can't stop thinking about.

Not just a system that stores what I know. A system that notices what's missing. For someone building content automation apps, that means the system isn't just remembering what tools I've used — it's noticing when two things I built separately could be connected. It's pointing to the next piece.

That changes how building feels. Less like starting from zero every time, more like picking up a thread.

—

What I'm Going to Test

I'm starting with one folder. My Claude Code workflows — the scripts, prompts, notes, fixes, things that broke and how I solved them.

I'll ask Claude to read through everything and build an index: summaries of each file, links between related ideas, a map of what I already know.

From there, I'll ask it questions mid-project. "What pattern did I use last time for a multi-step agent?" "What was the issue I kept hitting with streaming output?" Instead of digging through old files or trying to remember, I just ask.

I'm not building the full Karpathy setup yet. I'm testing whether the core idea holds: does having a searchable, AI-organized version of my own work actually save time and reduce the re-learning?

14.3.26

I Told an AI to Build a Full Newsletter. One Sentence. No Code.

I used to spend half a day on client newsletters. Last week I did it in 15 minutes — and I'll show you exactly what I typed.

🤖 What Claude Code Actually Is

Claude Code is an extension you install inside VS Code — a free code editor. Once it's set up, you have an AI agent sitting inside your project that can read files, write code, call APIs, fix its own errors, and run automations — all through a chat window.

You talk to it. It builds things. That's the whole loop. You don't need to know how to code. You need to know what you want.

🧪 What I Actually Tested

I set up a project, gave the agent a brand logo and a color guide, connected a few API keys (one for research, one for Gmail), and typed one prompt. That was it.

What came back:

Research pulled from live sources
Three branded infographics generated with AI images
Full HTML newsletter formatted to the brand's colors and fonts
Email sent directly to a list via Gmail
Sources linked at the bottom, clickable

It hit errors along the way. An API endpoint had changed. A Gmail formatting issue broke the layout. Both times, it found the problem, explained what happened, fixed the tool, and kept running. I didn't touch anything.

🏗️ How the Whole Thing Is Organized

The system behind this is called

Component	What It Is
Pipeline	The instructions, written in plain English — like a step-by-step recipe
Tools	The specific actions the agent can take (research, generate image, send email)
Orchestrator	Claude Code itself, reading the pipeline and running each tool in order

You also give it a file called CLAUDE.md — a standing brief that tells the agent what the project is, where things live, and what it's supposed to do. Every time you open the project, it reads that first.

Once you've built and tested a pipeline, you can schedule it to run on its own — every Monday morning, every time a form is submitted, whatever you need. At that point the agent isn't running live. The code it built is. It runs predictably, like any other automation.

💼 What This Means for VAs and Small Business Owners

You don't need to learn to code. You need to be able to describe a process clearly.

If you manage newsletters, reports, client updates, or social content for clients — build a pipeline, test it once, let it run. If you spend hours every week on the same repetitive tasks in your own business, same idea.

What makes this different from something like Zapier is what happens during the build. The agent adapts. It hits an error, investigates, fixes the tool, and updates the instructions so the same thing doesn't break again. The final automation is more reliable — and you got there faster than mapping every node by hand.

💰 If You're Thinking About Offering This as a Service

Businesses aren't paying for a pipeline or a blueprint. They're paying for a solved problem.

Don't open with "do you want AI automation?" Ask: where are you losing the most time or money right now? Find that, build the thing that fixes it, and price it on what it's worth — hours saved, errors cut, costs gone — not on how long you spent building it.

A pipeline that saves a client 20 hours a week isn't a half-day job. Over a year, that's tens of thousands of dollars in time they're not paying someone else to cover. Price it that way.

🚀 Where to Start


Tool	Claude Code — runs inside VS Code (free to download)
Cost	Paid Claude subscription — $17/month on the Pro plan
Setup time	~1 hour, most of it connecting API keys

Pick one task you do every single week. Something repetitive, something predictable. Let the agent build the first version, watch it run, fix what doesn't land, and build from there.

The hardest part isn't the tool. It's deciding what to automate first.

This Paper Says We've Been Fine-Tuning the Hard Way

I was reading a paper that dropped in March 2026 and about three paragraphs in, I had to stop. The claim seemed too simple: you can adapt a large AI model to a specific task without gradient descent at all. Just random sampling and a vote.

What Fine-Tuning Actually Is

When an AI model gets trained, it learns general patterns from a massive dataset. Fine-tuning is the step where you take that general model and push it toward a specific task — coding, reasoning, following instructions — using reinforcement learning or optimization algorithms.

Methods like PPO and GRPO (types of reinforcement learning commonly used to fine-tune large language models) work well. But they're expensive, require careful setup, and involve a lot of iteration.

The Paper's Core Claim

Title: Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights Authors: Yulu Gan and Phillip Isola Published: March 12, 2026 · arXiv: 2603.12228

The idea: in large, well-pretrained models, the weight space around the original parameters is already densely packed with useful task-specific solutions. The authors call these clusters "neural thickets."

In smaller models, good solutions are scattered — you need gradient-based search to find them. In large models, they're close to where you already are.

The Method — Remarkably Simple

They call it RandOpt. Here's how it works:

Take the pretrained model weights
Randomly sample N small perturbations of those weights
Evaluate each perturbation on your task
Keep the top K performers
Combine them with a majority vote

No gradients. No reward model. No RL training loop. Perturb, evaluate, vote.

💡 Code at github.com/sunrainyg/RandOpt

What the Results Showed

RandOpt kept up with PPO, GRPO, and evolutionary strategies on the tasks they tested — and this held on large-scale contemporary models.

That stopped me. These optimization methods have entire research communities, years of papers, and significant infrastructure built around them. The idea that randomly sampling neighbors of the original weights and taking a vote can match them says something about what's already sitting inside a well-trained model — waiting, not absent.

The way they frame it: small models have sparse expert solutions, so you need search to find them. Large models have dense ones. The pretrained weights aren't just a starting point. They're already rich.

What I Took From This

This paper shifted something in how I think about pretraining. We usually assume the pretrained model is a rough starting point and fine-tuning is where the real work happens. This flips that. If the solution space is already dense around the initial weights, the pretrained model is doing more work than we give it credit for.

There's also a question I don't have a full answer to yet: if random sampling with ensemble voting matches expensive RL fine-tuning, what does that mean for how we should be spending compute? I'm still working through the paper and the code, but it's the kind of result that sits with you.

Full paper: arXiv 2603.12228

12.3.26

Google Just Replaced Five AI Search Tools With One

Have you ever tried searching through a client’s content library where videos are in one folder, PDFs in another, and audio recordings scattered everywhere?

That’s the reality for most content libraries.

Until now, AI search tools struggled with this kind of setup because each type of content needed a different system to process it.

But that may be changing.

Gemini Embedding 2 — recently released by Google — can search across text, images, audio, video, and PDFs at the same time, without converting everything first.

For anyone managing knowledge bases, course content, research archives, or client media libraries, this could be a major shift.

What an Embedding Model Actually Does

Before explaining why this matters, it helps to understand what an embedding model is.

When AI systems search through content, they don’t read information the same way humans do. Instead, they convert content into numerical representations that capture meaning.

For example:

A sentence about a cat
A photo of a cat

Both produce similar number patterns, which allows the AI to recognize that they are related.

That’s how modern AI-powered search works.

The tool responsible for converting content into these numerical representations is called an embedding model.

The Problem With Older AI Search Systems

Until recently, every type of content required a different embedding system.

Typical setups looked like this:

Text → processed by a text embedding model
Images → processed by image models such as CLIP or SigLIP
Audio → first transcribed using systems like Whisper
Video → broken into frames or transcripts
PDFs → converted into plain text

This created several issues:

Multiple models to manage
Several conversion steps
More chances for things to break
Slower search performance

In many cases, five different pipelines were required just to search one content library.

What Gemini Embedding 2 Changes

Gemini Embedding 2 solves this by creating one shared search space for multiple content types.

Instead of converting everything separately, the model processes different media formats directly and places them into the same semantic search system.

That means a single query can return results from:

Documents
Images
Audio clips
Video files
PDFs

All at once.

For example, you could:

Upload a photo and find related videos
Submit a voice recording and find matching documents
Search inside PDF files without converting them

Supported Input Types

Gemini Embedding 2 currently supports multiple media types in one system:

Text
Up to roughly 8,000 words

Images
Up to six images in one request

Audio
Raw audio files — no transcription required

Video
Clips up to two minutes long

PDFs
Original files can be processed without converting to plain text

All of this works through one model instead of multiple specialized ones.

Combining Multiple Inputs in One Search

One interesting feature is the ability to combine different types of input into a single query.

For example, you might have:

A photo of a product
A text description of what you want

Both can be submitted together, and the system generates one combined embedding representing the meaning of both inputs.

This allows searches that were previously impossible using single-modality tools.

Easy Integration for Developers

Another surprising detail is how quickly developers can start using it.

Gemini Embedding 2 launched with support for popular AI development frameworks, including:

LangChain
LlamaIndex
ChromaDB
QDrant

Because many AI applications are already built on these frameworks, developers can integrate the model without building a new infrastructure from scratch.

It’s available through:

Google AI Studio (free tier for experimentation)
Vertex AI (enterprise deployment)

Why This Matters for Virtual Assistants and Content Managers

Think about the kinds of content many clients manage.

A podcast brand might have:

Audio episodes
Show notes
PDFs
Promotional images

A course creator may have:

Video lessons
Slide decks
Written summaries

A consultant might maintain:

Recorded calls
Presentations
Research reports

Searching across all of that in a single step has been extremely difficult.

With models like Gemini Embedding 2, developers can build search tools where one query instantly returns:

the right video segment
the correct slide
the relevant document section

All from one search bar.

The Bigger Picture

You probably won’t interact with Gemini Embedding 2 directly.

Instead, it will power the next generation of search tools used in:

knowledge management systems
research databases
course platforms
internal company search tools

But knowing that technology like this exists helps you understand what’s becoming possible.

That knowledge can make a big difference when clients start asking about AI-powered search, automation, or content organization systems.

If you manage content libraries, research archives, or client knowledge bases, this is a technology worth paying attention to.

The tools many teams will rely on in the near future are already being built on models like this.

28.2.26

Google Just Had a Huge Week — Nano Banana 2, Opal's Agent Step and a New Developer Ecosystem for ADK

There's been a lot happening at Google lately, and honestly, the updates from this past week alone are worth talking about. In just a few days, Google dropped a new default image model, upgraded their Opal workflow tool with real AI agent capabilities, and announced a brand-new integrations ecosystem for developers building AI agents. That's a lot to unpack — and I've been digging into all three.

I've been paying close attention to how Google is quietly (and sometimes not so quietly) building out their AI stack. These three updates feel connected in a bigger way. They're all pushing toward the same idea: faster, smarter, more autonomous AI tools that don't require you to be an engineer to use — or that supercharge you if you are one.

Nano Banana 2 — Pro-Quality Images at Flash Speed

One of the things that's always frustrated me about AI image generation is the trade-off. You either get fast and mediocre, or slow and beautiful. Nano Banana 2 — Google's new default image model built on Gemini 3.1 Flash Image — is trying to change that entirely.

What makes this interesting is the combination it's pulling off. It's generating at Flash speed while delivering what Google is calling Pro-level quality. To put that in practical terms: you're not waiting around for results, but you're also not getting blurry, inconsistent images. And the resolution range is wide — from 512px all the way up to 4K. That means the same model works whether you're making a quick thumbnail or something that needs to look polished and production-ready.

Here's the detail that really caught my attention: it can consistently handle up to 5 characters and 14 objects in a single image. If you've ever tried to generate a scene with multiple people or items and watched AI completely fall apart — characters blending together, objects disappearing — you know why this matters. Consistency across complex compositions has been a real weak spot in image generation, so this feels like a genuine step forward.

Nano Banana 2 is rolling out across the Gemini app and Google Workspace, and it's also available for enterprise use through Google Cloud. It's already the new default — which means if you're using Gemini for image generation, you're already on it.

Google Labs Opal Gets an Agent Step — Workflows Just Got Smarter

I've been curious about Opal since Google Labs introduced it as a workflow builder. The idea is that you set up a series of steps — like a recipe — and Opal runs through them to help you create something, whether that's a video, a piece of content, or a research brief. It's been useful, but the steps were static. You'd set it up once and it would just follow the same path every time.

The new agent step changes that completely. Instead of a fixed sequence, Opal now has a step where an actual AI agent takes over — it understands your goal, picks the right tools (like Veo for video generation, or web search for research), manages memory across the workflow, and routes dynamically based on what's needed. Think of it like the difference between following a printed map and having a navigation app that can reroute you in real time when things change.

This is part of a bigger shift we're seeing across the AI space, where "agentic" capabilities — AI that can reason, decide, and act rather than just respond — are becoming the new baseline. Google adding this to Opal means even people without a coding background can now build workflows that genuinely adapt. You don't have to anticipate every scenario upfront; the agent figures it out.

You can try Opal right now at opal.google — the agent step is available for all users.

Google ADK's New Integrations Ecosystem — Big News for Developers

This one is more developer-focused, but it's worth knowing about even if you're not building apps yourself. Google announced a new integrations ecosystem for their Agent Development Kit, or ADK — which is the framework developers use to build AI agents.

The idea is to make it easier for developers to connect their AI agents with external tools and services, so agents can actually do useful things in the real world instead of just talking about them. It's similar to how apps on your phone connect to different services — except here, we're talking about AI agents that can go off and complete tasks on your behalf.

What I find fascinating is how this fits into the broader picture. Between Nano Banana 2 (powerful image generation, accessible to everyone), Opal's agent step (autonomous workflows without code), and the ADK ecosystem (tools for developers to build custom agents), Google is building out every layer of the stack at once. There's something for the casual user, the content creator, and the professional developer — all in the same week.

21.2.26

AI Is Finally Fighting Back — And Anthropic Just Made It Official

I've been watching the cybersecurity space for a while now, and I have to be honest — it's one of those areas that used to feel completely out of reach for someone like me. No coding background, no deep technical knowledge of exploits or patches. Just a person who's curious about what AI can actually do in the real world.

But here's the thing I kept noticing over the years: the good guys were always playing catch-up.

Think back to how security worked — and honestly, how it still works for most teams. You'd have a tool scan your code, it would match against a list of known bad patterns, and spit out a report. The problem? The sneaky stuff, the subtle logic flaws, the vulnerabilities that had been hiding in open-source code for decades — those never showed up. Because rule-based tools can't reason. They can only recognize what they've already been told to look for.

Meanwhile, attackers got smarter. And faster.

That gap — between what automated tools could catch and what skilled human researchers could catch — was always the weak point. And there just aren't enough human security researchers to close it. That's not a criticism, that's just math. The attack surface keeps growing. The backlogs keep piling up.

This is why what Anthropic announced on February 20, 2026 actually stopped me mid-scroll.

Claude Code Security is now in limited research preview, and what it does is genuinely different from what I'd seen before. Instead of scanning for known patterns, Claude reads your code the way a human security researcher would — tracing how data moves, understanding how different parts of an application talk to each other, and catching the complex, context-dependent vulnerabilities that traditional tools walk right past.

What really got me is the verification layer. Claude doesn't just flag something and move on. It goes back and tries to disprove its own findings, filtering out false positives before anything reaches a developer. Every validated finding comes with a severity rating and a confidence score, so teams know what to prioritize. And nothing gets applied automatically — a human always has to approve the fix. I love that. It's AI as a sharp, tireless assistant, not a rogue decision-maker.

But here's the connection I keep thinking about: Anthropic's Frontier Red Team has been quietly building toward this for over a year. They entered Claude in cybersecurity competitions. They partnered with the Pacific Northwest National Laboratory to test AI on critical infrastructure defense. They used Claude to review their own internal code. This wasn't a product announcement that came from nowhere — it's the result of real, careful work testing what Claude could actually do before putting it in the hands of others.

And the results of that work? Using Claude Opus 4.6, their team found over 500 vulnerabilities in production open-source codebases. Bugs that had survived years of expert human review, undetected.

That's the part that really lands for me. These weren't theoretical vulnerabilities. They were sitting in real code, in real projects that real people depend on — sometimes for decades.

The reason I find this so meaningful isn't just the technology. It's the timing and the intent. Anthropic is releasing this in a limited preview specifically because the same capabilities that help defenders could help attackers. They're being deliberate about who gets access first — Enterprise and Team customers, plus open-source maintainers who can apply for free expedited access. They're working with the community to get this right before it scales.

That's a different posture than "ship it and see what happens."

We're at a point where AI is going to scan a significant share of the world's code — that's not speculation anymore, it's the direction things are clearly heading. The question has always been who benefits from that first. Attackers who use AI to find weaknesses faster? Or defenders who use it to find and patch those same weaknesses before they're exploited?

Claude Code Security is Anthropic's answer to that question.

20.2.26

Gemini 3.1 Pro Changed How I Actually Use AI — And It Has Nothing to Do With Benchmarks

I've been using AI tools almost every day for the past couple of years now, and if there's one thing I've learned, it's that a smarter model doesn't automatically mean you're getting smarter results. How you use the model matters just as much as what the model can do. And with Google's Gemini 3.1 Pro, there's something that I think is being really underrated in all the release coverage — the thinking level system, and what it actually means for the way you work.

This isn't a conversation about whether 77% on ARC-AGI beats 31%. It's about something more practical: the moment you realize you've been using these tools wrong, and how this new release hands you a dial you probably didn't know you needed.

What Even Is a "Thinking Level"?

Here's the quick version for anyone who hasn't come across this yet. Gemini 3.1 Pro lets you set how much thinking the model does before it responds. With the previous Gemini 3 Pro, you had two choices: low or high. With 3.1 Pro, there are now three — low, medium, and high.

Think of it like choosing between a quick gut reaction, a considered opinion, and a deep research session. The model isn't just choosing between "fast" and "slow" — it's choosing how much internal reasoning to do before it gives you an answer. At high, the model essentially behaves like a mini version of Gemini Deep Think, which is Google's most powerful reasoning model. That's a significant thing to have access to in a general-purpose assistant.

What surprised me when I started playing around with this is how much the choice actually matters. For a tricky math problem, setting thinking to high produced the right answer after several minutes of work. Setting it to low gave a fast but wrong answer. Same prompt, same model, completely different outcome.

The Problem Nobody Is Talking About

Here's what I find really fascinating about this. Most people who use AI tools have never really thought about what mode they should be in for a given task. We all tend to just fire off a prompt and expect the model to figure it out. But the thinking level system kind of forces you to be intentional, and that intentionality is where the real upgrade lives.

I started thinking about all the times I've used AI for tasks that fell into two completely different buckets. There's the stuff I need quickly — drafting a quick reply, summarizing a short article, brainstorming a list of ideas, generating a social post. And then there's the stuff where I actually want the model to sit with a problem — writing a full script outline, analyzing something complex, working through a nuanced question. Those two categories have always existed. What's new is that now I have a setting that actually reflects that difference.

Before 3.1 Pro, running everything at the same compute level was a bit like always driving in the same gear. Sometimes it worked. Sometimes it didn't. Now there's a gearshift.

How This Actually Changes My Workflow

When I started being intentional about thinking levels, a few things shifted for me pretty quickly.

For anything where I need a fast creative spark — like coming up with a hook for a video, finding synonyms, or doing a quick rewrite of a sentence — low thinking is more than enough. It's snappy, it's responsive, and frankly it's exactly what you want when you're in flow and don't want to wait. Speed matters there.

For medium tasks — things like drafting a structured outline, explaining a concept clearly, or building a content calendar — medium thinking has become my go-to. It takes a little longer, but the output feels more considered. Less surface-level. Like the model actually thought about the structure before it started writing.

And then there's high. I've started reserving high thinking for the things that actually deserve it. Complex analysis, tricky research questions, anything where getting the answer wrong would cost me time. The wait is longer — we're talking several minutes in some cases — but the quality of what comes back is on a different level. It's not just more text. It's more thoughtful text.

Why This Matters Even If You're Not Technical

I know a lot of people who use AI tools but feel like they're not getting as much out of them as they should. And honestly, after thinking about this thinking level system, I wonder if part of that frustration is just a mismatch between the task and the mode.

If you've ever asked an AI a complicated question and gotten a shallow answer, it might not be a model quality problem. It might be a compute budget problem. The model didn't spend enough time thinking. And now, for the first time, you have direct control over that.

That's actually a pretty big shift. Instead of just hoping the model figures out when to try harder, you get to tell it. It puts a little more responsibility on the user, sure. But it also puts a lot more power in your hands.

The Bigger Picture

What I keep coming back to is this: Gemini 3.1 Pro isn't just a smarter model. It's a model that respects the fact that not every question deserves the same amount of effort. And it's the first time I've felt like a general-purpose AI assistant is actually designed around how I naturally work — some things fast, some things slow, some things in between.

The AI tools that stick around aren't always the ones with the highest benchmark numbers. They're the ones that fit into how people actually think and work. This thinking level system feels like a step in that direction — and it's one I don't think enough people are paying attention to yet.

Grok 4.20: What Nobody Is Actually Talking About

There's been a lot of noise about Grok 4.20 this week. The four agents, the multi-agent setup, the "AI brains arguing with each other" framing. I get it — it's a fun story. But the more I dug into this release, the more I felt like everyone was covering the surface and missing the stuff underneath that's actually worth paying attention to.

So here's what caught my eye.

The model solved a real math problem nobody had solved before

I came across this one almost by accident and it stopped me completely.

Paata Ivanisvili, a math professor at UC Irvine, had been working on an open problem in harmonic analysis — trying to find the exact maximum of something called a Bellman function. He and his student had already published a paper on it. They had a partial answer. The problem was still technically unsolved.

He got early access to the Grok 4.20 beta, fed it the problem, and within five minutes it handed him an explicit formula — a cleaner, sharper result than what he and his student had managed. His reaction on X was basically just: "Wow."

What makes this different from the usual "AI is smart" story is that benchmarks test problems with known answers. This didn't. Grok wasn't pulling a solution from somewhere — it was reasoning toward something that hadn't been written down yet. That's a genuinely different category of thing, and it happened because the Benjamin agent (the math and logic one) is specifically built for this kind of step-by-step proof reasoning.

One result doesn't prove everything. But it happened, it's documented, and it made me think differently about what "AI for research" actually looks like in practice.

It was already winning competitions before anyone knew it existed

Here's something that got buried in the launch coverage: Grok 4.20 didn't launch when the public beta dropped on February 17. It had been running — quietly, under fake names — for weeks before that.

On LMArena it showed up as "Theta-hat" and "Slateflow." On DesignArena it appeared as "Pearl" and "Obsidian." And in Alpha Arena's live AI stock trading competition, a "Mystery Model" was quietly the only entry making money while every other AI lost. That Mystery Model was Grok 4.20.

xAI's last official model announcement on their news page is still Grok 4.1 from November 2025. There was no blog post for this launch. No announcement thread. Just posts from employees on X and a new option appearing quietly in the grok.com dropdown. By the time most people heard about it, it had already been competing for weeks.

The "four brains" framing is wrong — and the real version is more interesting

Everyone's been describing this as four separate AI models arguing with each other. That's not quite what's happening, and I think the actual explanation is more impressive.

Grok 4.20 is one model — estimated around 3 trillion parameters — with four different roles running from the same base weights. Grok the coordinator, Harper the real-time researcher, Benjamin the math and logic brain, Lucas the contrarian wildcard. They're not four separate models. They're four ways of prompting the same model, each with different tool access and context framing.

Why does this matter? Cost. Running four truly separate models would be four times as expensive. Because these agents share weights and the same input context, xAI puts the overhead at 1.5x to 2.5x — not 4x. That's what makes this available at a consumer price point rather than Grok Heavy's $300/month tier. The agents don't have long debates either. The rounds are short, structured, trained through reinforcement learning to be efficient. They send targeted corrections and move on.

I found it interesting that nobody in the coverage this week mentioned this. "Four brains" is a better headline, sure, but one brain wearing four hats is actually a harder technical trick to pull off.

The trading competition result deserves more scrutiny than it got

Alpha Arena Season 1.5 was a live stock trading competition — real transactions, verified on the blockchain. GPT-5, Gemini, Claude, open-source models, Chinese models — everyone participated in multiple strategy variants over several weeks. When it ended, every single configuration of every other model finished in the red. Grok 4.20's four variants were the only ones in profit, with the best posting around 35% returns from a $10,000 start.

A few people mentioned this. Nobody really asked why.

My read: it wasn't that Grok out-thought everyone on finance. It's that Harper — the agent that pulls from X's real-time feed of roughly 68 million English tweets per day — gave it a live information edge the other models didn't have. The competition fed other models summarized news prompts. Grok was processing raw market sentiment as it happened.

That's worth being honest about, because it means the result says more about data access than raw intelligence. But here's the thing — that data access is the architecture. That's what the Harper agent was built to do. So the win was real, and it was structural, and it came from a deliberate design choice.

Grok 4.20 is the proof of concept. Grok 5 is the actual bet.

While everyone's been testing the 4.20 beta, xAI is already training its successor. Grok 5 is reportedly sitting around 6 trillion parameters — roughly double the size of 4.20 — with a target launch somewhere between April and June 2026.

For context, Grok 4.20 arrived in a crowded week. OpenAI shipped GPT-5.3-Codex on February 5. Anthropic released Claude Sonnet 5 on February 3. By that measure, this was a delayed point release catching up while competitors were already moving to the next generation. Even xAI treated it like a soft launch.

Grok 5 will not be a soft launch. xAI is spending roughly $1 billion a month. SpaceX formally acquired them weeks ago. An IPO is on the table. A 6-trillion-parameter release is exactly the kind of moment you build that story around.

What I keep thinking about is that the ecosystem is already being built before most people have noticed. Perplexity is reportedly running Grok 4.20 under the hood for a new search mode. There are signs of something called Grok Build — a coding tool with up to 8 parallel agents. The multi-agent architecture isn't a feature. It's the foundation, and xAI is building on top of it fast.

Grok 4.20 showed the approach works. The math problem, the trading competition, the benchmark trajectory — all of it points in the same direction. Grok 5 is where we find out if xAI can actually deliver on the scale of that promise.

I'll have a full video on this on YouTube soon — I want to actually get hands-on time with it first before I say anything definitive. But if you're following AI closely, this release is worth paying more attention to than the "four agents" headline suggests.

19.2.26

Claude Sonnet 4.6 Is Here — And There's Way More Going On Than the Benchmarks Show

There's been a lot of buzz this week about Claude Sonnet 4.6, Anthropic's latest model release. And honestly? Most of the coverage I've seen is stopping at the surface level — benchmarks, pricing, computer use scores. All of that is interesting, sure. But when I started looking closer at what actually shipped alongside this model, I found four things that nobody's really talking about that I think matter a lot more for people who are actually using Claude day to day.

That 1 Million Token Context Window Has a Catch

Okay, so everyone is excited about the 1 million token context window. And I get it — the number sounds incredible. A million tokens is enough to hold entire codebases, thousands of documents, months of conversations. But here's what I didn't realize until I looked into this more carefully: a big context window doesn't automatically mean a useful context window.

There's a real issue in AI called context rot. The basic idea is that as you keep filling up a model's memory during a conversation, its ability to actually use older information starts to degrade. It's a bit like trying to hold too many things in your head at once — the stuff from earlier starts slipping away. Anthropic's previous model, Sonnet 4.5, scored just 18.5% on a benchmark called MRCR v2, which specifically tests whether a model can find information buried deep in a long context. That's... not great.

Sonnet 4.6 is genuinely better at this. But there are two things I think people need to know before they get too excited. First, the 1 million token window is still in beta — it's not just available to everyone right now, and you need to explicitly turn it on. Second, and this one surprised me, the pricing changes dramatically once you cross 200,000 tokens. Below that threshold you pay the standard rate. Above it, the pricing structure shifts significantly and applies to your entire request, not just the extra tokens. So if you're planning to throw huge amounts of text at this model, make sure you understand what that's going to cost before you build a workflow around it.

Here's the Part That Actually Impressed Me

Now here's where it gets interesting, because Anthropic didn't just train a better model and call it a day. They shipped two features alongside Sonnet 4.6 that directly address the context rot problem — and almost nobody is talking about them.

The first is dynamic web search filtering. When Claude searches the web, it now writes and runs its own code to filter search results before loading them into context. Instead of pulling in a full HTML page and reasoning over all of it — ads, navigation menus, irrelevant paragraphs and everything — it strips out the noise first and only keeps what's relevant. Anthropic tested this across two benchmarks and found it improved accuracy by an average of 11% while actually cutting input tokens by 24%. That's a really meaningful result. Quora's Poe platform tested it against all the major frontier models and said Opus 4.6 with this feature "achieved the highest accuracy on our internal evals" — specifically because it approaches research the way a human researcher would, filtering information programmatically instead of reasoning over raw noise.

The second is programmatic tool calling, which just hit general availability. This one is more for developers, but the idea is fascinating. When Claude needs to call multiple tools — say, querying a database five times to compare regions, then sorting and summarizing the results — it can now write code that does all of that inside a sandboxed container without bringing each intermediate result back into the conversation context. The result only shows up once everything is done. Think of it like doing all your rough math on a scratch pad before writing the final answer on the paper — the scratch work never clutters the main context.

Together, these two features tell a really clear story: Anthropic's answer to context rot isn't just a bigger bucket. It's smarter filtering so less noise goes in.

Should You Actually Cancel Opus?

This is the question I keep seeing pop up and I wanted to think through it properly. Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Opus 4.6 is roughly 1.7x more expensive on input and output. For most individual users that difference feels abstract — but if you're building on the API and processing a lot of requests, it compounds fast.

What made me look at this differently is where Sonnet 4.6 actually beats Opus. On agentic financial analysis — think tasks like researching stock data, calculating ratios, pulling together a market summary — Sonnet 4.6 scored 63.3% versus Opus 4.6's 60.1%. On office tasks, Sonnet leads again. On computer use benchmarks, the gap is almost nothing: 72.5% for Sonnet versus 72.7% for Opus. For everyday knowledge work, Sonnet 4.6 is genuinely at the same level for a meaningfully lower price.

The places where Opus still wins are specific: novel problem-solving, deep codebase work, complex situations where getting it exactly right is more important than getting it fast and cheap. Anthropic themselves describe Opus as the right choice for "codebase refactoring, coordinating multiple agents in a workflow, and problems where getting it just right is paramount." That's a real distinction — but it's a narrower one than it used to be.

And here's one more thing that matters for API developers. Because of programmatic tool calling, tool results from multi-step workflows don't count toward your token usage at all — only the final output does. So if you have workflows that currently make eight or ten tool calls in sequence, each one loading results back into context, you may be spending significantly more than you need to. That changes the cost math even further in Sonnet's favor for the right use cases.

The Security Problem Nobody Mentioned

I want to talk about prompt injection because every video I watched mentioned it in one sentence and moved on. I think it deserves more attention than that — especially now that Claude has computer use and can take real actions on your behalf.

Here's what prompt injection actually means in practice. Imagine you ask Claude to read through your emails and schedule any meeting requests it finds. One of those emails was crafted by someone who knew an AI would read it. Inside that email, hidden in the text, are instructions that tell Claude to forward any email with the word "confidential" to an external address before it drafts your replies. You'd never see it happen. That's the attack. Anthropic has described exactly this scenario in their own research on browser-agent security, and it's not hypothetical — it's an active concern for anyone building or using agentic AI.

Anthropic says Sonnet 4.6 is significantly better at detecting and resisting these attacks compared to its predecessor, and their evaluations show it performs similarly to Opus 4.6 in this area. That's meaningful progress. But independent testing on earlier Claude models found that while the model actively resists simple injection attempts, it can still be confused when the malicious instructions are buried inside what looks like a legitimate document or data structure.

What I didn't see anyone mention is a second security issue that comes with programmatic tool calling. When Claude calls your tools programmatically and gets results back, those results come back as raw strings — and those strings can contain anything, including code snippets that might get processed by the execution environment. If your tools are pulling data from external sources or user inputs, there's a real code injection risk if you're not validating what comes back before acting on it. This is separate from prompt injection — it's a layer deeper, and it's something every developer building agentic workflows needs to think about before shipping.

The honest summary is this: Sonnet 4.6 is more secure than what came before. But "more secure" and "fully solved" are very different things. The more autonomous you make your agents, the more carefully you need to think about what they can be tricked into doing.

What This All Means

I find the Sonnet 4.6 release genuinely exciting — not just because of the model improvements, but because of what the surrounding features tell us about where Anthropic is heading. They're building a system where Claude reasons over less noise, not more. Dynamic filtering, programmatic tool calling, context compaction, memory tools — these are all solving the same underlying problem. And the fact that accuracy went up while token usage went down on web search benchmarks suggests this approach is actually working.

If you're a knowledge worker using Claude through the app, the takeaway is pretty straightforward: this model is fast, capable, and meaningfully better at the kind of office work most people actually do. If you're a developer building on the API, there are real architectural decisions to make now — about when to use Sonnet versus Opus, how to take advantage of programmatic tool calling, and how to think about security in agentic workflows.

What I'm watching next is how the 1 million token context window performs in real production use cases over the next few weeks. The beta label is still on it for a reason. But the direction is clear — and it's more interesting than most of the coverage I've seen this week.

Sources: Anthropic Claude Sonnet 4.6 announcement · Dynamic Web Search Filtering · Programmatic Tool Calling Docs