20.2.26

Gemini 3.1 Pro Changed How I Actually Use AI — And It Has Nothing to Do With Benchmarks

I've been using AI tools almost every day for the past couple of years now, and if there's one thing I've learned, it's that a smarter model doesn't automatically mean you're getting smarter results. How you use the model matters just as much as what the model can do. And with Google's Gemini 3.1 Pro, there's something that I think is being really underrated in all the release coverage — the thinking level system, and what it actually means for the way you work.

This isn't a conversation about whether 77% on ARC-AGI beats 31%. It's about something more practical: the moment you realize you've been using these tools wrong, and how this new release hands you a dial you probably didn't know you needed.

What Even Is a "Thinking Level"?

Here's the quick version for anyone who hasn't come across this yet. Gemini 3.1 Pro lets you set how much thinking the model does before it responds. With the previous Gemini 3 Pro, you had two choices: low or high. With 3.1 Pro, there are now three — low, medium, and high.

Think of it like choosing between a quick gut reaction, a considered opinion, and a deep research session. The model isn't just choosing between "fast" and "slow" — it's choosing how much internal reasoning to do before it gives you an answer. At high, the model essentially behaves like a mini version of Gemini Deep Think, which is Google's most powerful reasoning model. That's a significant thing to have access to in a general-purpose assistant.

What surprised me when I started playing around with this is how much the choice actually matters. For a tricky math problem, setting thinking to high produced the right answer after several minutes of work. Setting it to low gave a fast but wrong answer. Same prompt, same model, completely different outcome.

The Problem Nobody Is Talking About

Here's what I find really fascinating about this. Most people who use AI tools have never really thought about what mode they should be in for a given task. We all tend to just fire off a prompt and expect the model to figure it out. But the thinking level system kind of forces you to be intentional, and that intentionality is where the real upgrade lives.

I started thinking about all the times I've used AI for tasks that fell into two completely different buckets. There's the stuff I need quickly — drafting a quick reply, summarizing a short article, brainstorming a list of ideas, generating a social post. And then there's the stuff where I actually want the model to sit with a problem — writing a full script outline, analyzing something complex, working through a nuanced question. Those two categories have always existed. What's new is that now I have a setting that actually reflects that difference.

Before 3.1 Pro, running everything at the same compute level was a bit like always driving in the same gear. Sometimes it worked. Sometimes it didn't. Now there's a gearshift.

How This Actually Changes My Workflow

When I started being intentional about thinking levels, a few things shifted for me pretty quickly.

For anything where I need a fast creative spark — like coming up with a hook for a video, finding synonyms, or doing a quick rewrite of a sentence — low thinking is more than enough. It's snappy, it's responsive, and frankly it's exactly what you want when you're in flow and don't want to wait. Speed matters there.

For medium tasks — things like drafting a structured outline, explaining a concept clearly, or building a content calendar — medium thinking has become my go-to. It takes a little longer, but the output feels more considered. Less surface-level. Like the model actually thought about the structure before it started writing.

And then there's high. I've started reserving high thinking for the things that actually deserve it. Complex analysis, tricky research questions, anything where getting the answer wrong would cost me time. The wait is longer — we're talking several minutes in some cases — but the quality of what comes back is on a different level. It's not just more text. It's more thoughtful text.

Why This Matters Even If You're Not Technical

I know a lot of people who use AI tools but feel like they're not getting as much out of them as they should. And honestly, after thinking about this thinking level system, I wonder if part of that frustration is just a mismatch between the task and the mode.

If you've ever asked an AI a complicated question and gotten a shallow answer, it might not be a model quality problem. It might be a compute budget problem. The model didn't spend enough time thinking. And now, for the first time, you have direct control over that.

That's actually a pretty big shift. Instead of just hoping the model figures out when to try harder, you get to tell it. It puts a little more responsibility on the user, sure. But it also puts a lot more power in your hands.

The Bigger Picture

What I keep coming back to is this: Gemini 3.1 Pro isn't just a smarter model. It's a model that respects the fact that not every question deserves the same amount of effort. And it's the first time I've felt like a general-purpose AI assistant is actually designed around how I naturally work — some things fast, some things slow, some things in between.

The AI tools that stick around aren't always the ones with the highest benchmark numbers. They're the ones that fit into how people actually think and work. This thinking level system feels like a step in that direction — and it's one I don't think enough people are paying attention to yet.

Grok 4.20: What Nobody Is Actually Talking About

There's been a lot of noise about Grok 4.20 this week. The four agents, the multi-agent setup, the "AI brains arguing with each other" framing. I get it — it's a fun story. But the more I dug into this release, the more I felt like everyone was covering the surface and missing the stuff underneath that's actually worth paying attention to.

So here's what caught my eye.


The model solved a real math problem nobody had solved before

I came across this one almost by accident and it stopped me completely.

Paata Ivanisvili, a math professor at UC Irvine, had been working on an open problem in harmonic analysis — trying to find the exact maximum of something called a Bellman function. He and his student had already published a paper on it. They had a partial answer. The problem was still technically unsolved.

He got early access to the Grok 4.20 beta, fed it the problem, and within five minutes it handed him an explicit formula — a cleaner, sharper result than what he and his student had managed. His reaction on X was basically just: "Wow."

What makes this different from the usual "AI is smart" story is that benchmarks test problems with known answers. This didn't. Grok wasn't pulling a solution from somewhere — it was reasoning toward something that hadn't been written down yet. That's a genuinely different category of thing, and it happened because the Benjamin agent (the math and logic one) is specifically built for this kind of step-by-step proof reasoning.

One result doesn't prove everything. But it happened, it's documented, and it made me think differently about what "AI for research" actually looks like in practice.


It was already winning competitions before anyone knew it existed

Here's something that got buried in the launch coverage: Grok 4.20 didn't launch when the public beta dropped on February 17. It had been running — quietly, under fake names — for weeks before that.

On LMArena it showed up as "Theta-hat" and "Slateflow." On DesignArena it appeared as "Pearl" and "Obsidian." And in Alpha Arena's live AI stock trading competition, a "Mystery Model" was quietly the only entry making money while every other AI lost. That Mystery Model was Grok 4.20.

xAI's last official model announcement on their news page is still Grok 4.1 from November 2025. There was no blog post for this launch. No announcement thread. Just posts from employees on X and a new option appearing quietly in the grok.com dropdown. By the time most people heard about it, it had already been competing for weeks.


The "four brains" framing is wrong — and the real version is more interesting

Everyone's been describing this as four separate AI models arguing with each other. That's not quite what's happening, and I think the actual explanation is more impressive.

Grok 4.20 is one model — estimated around 3 trillion parameters — with four different roles running from the same base weights. Grok the coordinator, Harper the real-time researcher, Benjamin the math and logic brain, Lucas the contrarian wildcard. They're not four separate models. They're four ways of prompting the same model, each with different tool access and context framing.

Why does this matter? Cost. Running four truly separate models would be four times as expensive. Because these agents share weights and the same input context, xAI puts the overhead at 1.5x to 2.5x — not 4x. That's what makes this available at a consumer price point rather than Grok Heavy's $300/month tier. The agents don't have long debates either. The rounds are short, structured, trained through reinforcement learning to be efficient. They send targeted corrections and move on.

I found it interesting that nobody in the coverage this week mentioned this. "Four brains" is a better headline, sure, but one brain wearing four hats is actually a harder technical trick to pull off.


The trading competition result deserves more scrutiny than it got

Alpha Arena Season 1.5 was a live stock trading competition — real transactions, verified on the blockchain. GPT-5, Gemini, Claude, open-source models, Chinese models — everyone participated in multiple strategy variants over several weeks. When it ended, every single configuration of every other model finished in the red. Grok 4.20's four variants were the only ones in profit, with the best posting around 35% returns from a $10,000 start.

A few people mentioned this. Nobody really asked why.

My read: it wasn't that Grok out-thought everyone on finance. It's that Harper — the agent that pulls from X's real-time feed of roughly 68 million English tweets per day — gave it a live information edge the other models didn't have. The competition fed other models summarized news prompts. Grok was processing raw market sentiment as it happened.

That's worth being honest about, because it means the result says more about data access than raw intelligence. But here's the thing — that data access is the architecture. That's what the Harper agent was built to do. So the win was real, and it was structural, and it came from a deliberate design choice.


Grok 4.20 is the proof of concept. Grok 5 is the actual bet.

While everyone's been testing the 4.20 beta, xAI is already training its successor. Grok 5 is reportedly sitting around 6 trillion parameters — roughly double the size of 4.20 — with a target launch somewhere between April and June 2026.

For context, Grok 4.20 arrived in a crowded week. OpenAI shipped GPT-5.3-Codex on February 5. Anthropic released Claude Sonnet 5 on February 3. By that measure, this was a delayed point release catching up while competitors were already moving to the next generation. Even xAI treated it like a soft launch.

Grok 5 will not be a soft launch. xAI is spending roughly $1 billion a month. SpaceX formally acquired them weeks ago. An IPO is on the table. A 6-trillion-parameter release is exactly the kind of moment you build that story around.

What I keep thinking about is that the ecosystem is already being built before most people have noticed. Perplexity is reportedly running Grok 4.20 under the hood for a new search mode. There are signs of something called Grok Build — a coding tool with up to 8 parallel agents. The multi-agent architecture isn't a feature. It's the foundation, and xAI is building on top of it fast.

Grok 4.20 showed the approach works. The math problem, the trading competition, the benchmark trajectory — all of it points in the same direction. Grok 5 is where we find out if xAI can actually deliver on the scale of that promise.

I'll have a full video on this on YouTube soon — I want to actually get hands-on time with it first before I say anything definitive. But if you're following AI closely, this release is worth paying more attention to than the "four agents" headline suggests.


19.2.26

Claude Sonnet 4.6 Is Here — And There's Way More Going On Than the Benchmarks Show

There's been a lot of buzz this week about Claude Sonnet 4.6, Anthropic's latest model release. And honestly? Most of the coverage I've seen is stopping at the surface level — benchmarks, pricing, computer use scores. All of that is interesting, sure. But when I started looking closer at what actually shipped alongside this model, I found four things that nobody's really talking about that I think matter a lot more for people who are actually using Claude day to day.

That 1 Million Token Context Window Has a Catch

Okay, so everyone is excited about the 1 million token context window. And I get it — the number sounds incredible. A million tokens is enough to hold entire codebases, thousands of documents, months of conversations. But here's what I didn't realize until I looked into this more carefully: a big context window doesn't automatically mean a useful context window.

There's a real issue in AI called context rot. The basic idea is that as you keep filling up a model's memory during a conversation, its ability to actually use older information starts to degrade. It's a bit like trying to hold too many things in your head at once — the stuff from earlier starts slipping away. Anthropic's previous model, Sonnet 4.5, scored just 18.5% on a benchmark called MRCR v2, which specifically tests whether a model can find information buried deep in a long context. That's... not great.

Sonnet 4.6 is genuinely better at this. But there are two things I think people need to know before they get too excited. First, the 1 million token window is still in beta — it's not just available to everyone right now, and you need to explicitly turn it on. Second, and this one surprised me, the pricing changes dramatically once you cross 200,000 tokens. Below that threshold you pay the standard rate. Above it, the pricing structure shifts significantly and applies to your entire request, not just the extra tokens. So if you're planning to throw huge amounts of text at this model, make sure you understand what that's going to cost before you build a workflow around it.


Here's the Part That Actually Impressed Me

Now here's where it gets interesting, because Anthropic didn't just train a better model and call it a day. They shipped two features alongside Sonnet 4.6 that directly address the context rot problem — and almost nobody is talking about them.

The first is dynamic web search filtering. When Claude searches the web, it now writes and runs its own code to filter search results before loading them into context. Instead of pulling in a full HTML page and reasoning over all of it — ads, navigation menus, irrelevant paragraphs and everything — it strips out the noise first and only keeps what's relevant. Anthropic tested this across two benchmarks and found it improved accuracy by an average of 11% while actually cutting input tokens by 24%. That's a really meaningful result. Quora's Poe platform tested it against all the major frontier models and said Opus 4.6 with this feature "achieved the highest accuracy on our internal evals" — specifically because it approaches research the way a human researcher would, filtering information programmatically instead of reasoning over raw noise.

The second is programmatic tool calling, which just hit general availability. This one is more for developers, but the idea is fascinating. When Claude needs to call multiple tools — say, querying a database five times to compare regions, then sorting and summarizing the results — it can now write code that does all of that inside a sandboxed container without bringing each intermediate result back into the conversation context. The result only shows up once everything is done. Think of it like doing all your rough math on a scratch pad before writing the final answer on the paper — the scratch work never clutters the main context.

Together, these two features tell a really clear story: Anthropic's answer to context rot isn't just a bigger bucket. It's smarter filtering so less noise goes in.


Should You Actually Cancel Opus?

This is the question I keep seeing pop up and I wanted to think through it properly. Sonnet 4.6 costs $3 per million input tokens and $15 per million output tokens. Opus 4.6 is roughly 1.7x more expensive on input and output. For most individual users that difference feels abstract — but if you're building on the API and processing a lot of requests, it compounds fast.

What made me look at this differently is where Sonnet 4.6 actually beats Opus. On agentic financial analysis — think tasks like researching stock data, calculating ratios, pulling together a market summary — Sonnet 4.6 scored 63.3% versus Opus 4.6's 60.1%. On office tasks, Sonnet leads again. On computer use benchmarks, the gap is almost nothing: 72.5% for Sonnet versus 72.7% for Opus. For everyday knowledge work, Sonnet 4.6 is genuinely at the same level for a meaningfully lower price.

The places where Opus still wins are specific: novel problem-solving, deep codebase work, complex situations where getting it exactly right is more important than getting it fast and cheap. Anthropic themselves describe Opus as the right choice for "codebase refactoring, coordinating multiple agents in a workflow, and problems where getting it just right is paramount." That's a real distinction — but it's a narrower one than it used to be.

And here's one more thing that matters for API developers. Because of programmatic tool calling, tool results from multi-step workflows don't count toward your token usage at all — only the final output does. So if you have workflows that currently make eight or ten tool calls in sequence, each one loading results back into context, you may be spending significantly more than you need to. That changes the cost math even further in Sonnet's favor for the right use cases.


The Security Problem Nobody Mentioned

I want to talk about prompt injection because every video I watched mentioned it in one sentence and moved on. I think it deserves more attention than that — especially now that Claude has computer use and can take real actions on your behalf.

Here's what prompt injection actually means in practice. Imagine you ask Claude to read through your emails and schedule any meeting requests it finds. One of those emails was crafted by someone who knew an AI would read it. Inside that email, hidden in the text, are instructions that tell Claude to forward any email with the word "confidential" to an external address before it drafts your replies. You'd never see it happen. That's the attack. Anthropic has described exactly this scenario in their own research on browser-agent security, and it's not hypothetical — it's an active concern for anyone building or using agentic AI.

Anthropic says Sonnet 4.6 is significantly better at detecting and resisting these attacks compared to its predecessor, and their evaluations show it performs similarly to Opus 4.6 in this area. That's meaningful progress. But independent testing on earlier Claude models found that while the model actively resists simple injection attempts, it can still be confused when the malicious instructions are buried inside what looks like a legitimate document or data structure.

What I didn't see anyone mention is a second security issue that comes with programmatic tool calling. When Claude calls your tools programmatically and gets results back, those results come back as raw strings — and those strings can contain anything, including code snippets that might get processed by the execution environment. If your tools are pulling data from external sources or user inputs, there's a real code injection risk if you're not validating what comes back before acting on it. This is separate from prompt injection — it's a layer deeper, and it's something every developer building agentic workflows needs to think about before shipping.

The honest summary is this: Sonnet 4.6 is more secure than what came before. But "more secure" and "fully solved" are very different things. The more autonomous you make your agents, the more carefully you need to think about what they can be tricked into doing.


What This All Means

I find the Sonnet 4.6 release genuinely exciting — not just because of the model improvements, but because of what the surrounding features tell us about where Anthropic is heading. They're building a system where Claude reasons over less noise, not more. Dynamic filtering, programmatic tool calling, context compaction, memory tools — these are all solving the same underlying problem. And the fact that accuracy went up while token usage went down on web search benchmarks suggests this approach is actually working.

If you're a knowledge worker using Claude through the app, the takeaway is pretty straightforward: this model is fast, capable, and meaningfully better at the kind of office work most people actually do. If you're a developer building on the API, there are real architectural decisions to make now — about when to use Sonnet versus Opus, how to take advantage of programmatic tool calling, and how to think about security in agentic workflows.

What I'm watching next is how the 1 million token context window performs in real production use cases over the next few weeks. The beta label is still on it for a reason. But the direction is clear — and it's more interesting than most of the coverage I've seen this week.


Sources: Anthropic Claude Sonnet 4.6 announcement · Dynamic Web Search Filtering · Programmatic Tool Calling Docs

13.2.26

Google's Gemini 3 Deep Think Just Got a Major Upgrade — And It's Designed for Real Science

There's been this interesting trend in AI lately where models are getting better at reasoning through complex problems. We've seen it with OpenAI's o1 and o3, DeepSeek's R1, and now Google is making a serious push into this space with a major update to Gemini 3 Deep Think.

What makes this update different from the reasoning models we've seen before is that Google specifically built it for scientists, researchers, and engineers working on real-world problems. This isn't just about solving math competitions anymore—though it still does that incredibly well. It's about tackling messy, incomplete data and problems without clear solutions, which is what actual research looks like.

I've been following the development of reasoning models closely, and Deep Think's focus on practical scientific applications is a shift I find really interesting. This is part of a larger movement where AI is moving from being a general-purpose tool to something more specialized for specific domains.

Google AI Ultra subscribers can access the updated Deep Think in the Gemini app starting today, and for the first time, researchers and enterprises can apply for early access to use it via the Gemini API.



Why Deep Think Focuses on Science and Engineering

The way Google approached this update is pretty clever. Instead of just making a model that's good at abstract reasoning, they worked directly with scientists and researchers to understand what kinds of problems they actually face in their work.

Real research isn't like solving textbook problems. You're dealing with incomplete data, messy information, and questions that don't have single right answers. Traditional AI models often struggle with this kind of ambiguity, but Deep Think was specifically trained to handle it.

What caught my attention in the announcement were the real-world examples. A mathematician at Rutgers University used Deep Think to review a highly technical paper and it found a logical flaw that had passed through human peer review. At Duke University, researchers used it to optimize crystal growth methods for semiconductor materials, hitting a precise target that previous methods couldn't achieve.

These aren't just impressive demos—they're solving actual research bottlenecks.

The Numbers Are Genuinely Impressive

Deep Think continues to push what's possible on academic benchmarks. It scored 48.4% on Humanity's Last Exam, a benchmark specifically designed to test the limits of frontier models. That's without using any external tools, just pure reasoning.

It also achieved 84.6% on ARC-AGI-2, which tests abstract reasoning abilities that supposedly indicate progress toward artificial general intelligence. The ARC Prize Foundation verified this result, which gives it more credibility.

On Codeforces, a competitive programming platform, Deep Think reached an Elo rating of 3455. To put that in perspective, that's gold medal territory at international programming competitions.

The really interesting part is that Deep Think now also excels at chemistry and physics olympiad problems, achieving gold medal-level performance on both. It scored 50.5% on CMT-Benchmark, which tests advanced theoretical physics understanding.

Built for Practical Engineering Applications

Beyond benchmarks, what makes Deep Think stand out is how it's being used in practice. Google designed it to interpret complex data and model physical systems through code, which means engineers can actually use it for real work.

One example they showed is turning a sketch into a 3D-printable file. You draw something, Deep Think analyzes it, models the complex shape, and generates a file ready for 3D printing. That's the kind of practical application that makes this more than just an impressive reasoning model—it's a tool people can actually use.

Google's also making this available through the Gemini API for researchers and enterprises, which is significant. Previous versions of Deep Think were mostly limited to the consumer app, but opening it up via API means developers can integrate it into their own workflows and tools.

What This Means for AI Reasoning Models

This release is part of a broader competition happening right now in the reasoning model space. OpenAI has o1 and o3, DeepSeek released R1, Anthropic has been working on extended thinking capabilities, and now Google is pushing hard with Deep Think.

What's interesting is how these companies are differentiating their approaches. OpenAI focuses on general reasoning, DeepSeek emphasizes efficiency and open-source access, and Google is positioning Deep Think as the model for scientific and engineering work.

The practical difference here is that Deep Think isn't trying to be everything to everyone. It's specialized for domains where deep reasoning through complex, messy problems actually matters—research, engineering, advanced mathematics, theoretical physics.

For anyone working in these fields, having a model that understands the nuances of scientific work rather than just being good at logic puzzles could be genuinely transformative.

The fact that Google worked directly with scientists to build this, and that early testers are already finding real research applications, suggests this is more than just benchmark chasing. It's an attempt to make AI actually useful for advancing human knowledge in concrete ways.

If you're a researcher, engineer, or working in a technical field, Deep Think might be worth keeping an eye on—especially if you can get into the early access program for the API. This could be one of those tools that changes how certain kinds of work get done.



 

Google Chrome's WebMCP is About to Change How AI Agents Browse the Web

 There's been this ongoing challenge with AI agents: when they visit a website, they're basically tourists who don't speak the language. Whether you're using LangChain, Claude Code, or tools like OpenClaw, your agent is stuck guessing which buttons to press, scraping HTML, or processing thousands of tokens worth of screenshots just to figure out what's on a page. If you've been building with agents for a while, you know exactly how painful this is.

That's what makes Google Chrome's new WebMCP preview so interesting. Earlier this week, the Chrome team shipped an early version of what could be the most important change to how agents interact with the web in years. Instead of treating every website like a foreign language that needs translation, WebMCP lets websites expose structured tools directly to AI agents. No more scraping. No more processing endless screenshots. Your agent just calls functions.

This is part of a bigger shift we're seeing where the web itself is becoming more agent-friendly, not just more human-friendly. And honestly, it's about time.

I've been following the development of browser-based agents, and WebMCP caught my attention because it solves problems most people aren't even talking about yet. Watch my YouTube video on it below.

WebMCP Begins Rollout


Why Current Web Interaction Is So Inefficient

Right now, agents interact with websites in two main ways. The first is through screenshots—you take an image of the page, feed it to a multimodal model, and hope it can identify buttons, form fields, and interactive elements. The problem? You're burning through thousands of tokens for every single image you process.

The second approach is accessing the DOM directly and parsing raw HTML and JavaScript code. While this uses fewer tokens than images, you're still translating from one language to another. The agent has to sift through paragraph tags, CSS styling, and all sorts of presentation markup that doesn't actually matter for understanding what actions it can take.

Both methods feel like working through a translator when you could just speak the same language.

How WebMCP Actually Works

The idea behind WebMCP is beautifully simple: let each webpage act like an MCP server that agents can query directly. The page basically tells the agent, "Here's what you can read. Here's what you can click. Here's what you can fill in."

This isn't entirely new—academics and companies have been proposing versions of this for a while. But in the second half of last year, Microsoft and Google actually got together to build a real spec for how this would work. The timing makes sense too—this was right around when we saw Perplexity release Comet and OpenAI release Atlas, when web interaction was clearly heating up.

What makes Chrome's approach interesting is that it's designed for human-in-the-loop workflows first. The agent works with the user, not just autonomously. So normal people still use websites normally, but agents can help speed things up and improve the experience.

Google presented three core pillars at the Web AI Summit: context (understanding what the user is doing beyond just the current screen), capabilities (taking actions on the user's behalf), and coordination (managing the handoff between agent and user when needed).

The Two APIs You Need to Know

Chrome has structured WebMCP around two main APIs. The Declarative API handles standard actions—think HTML forms with added tool names and descriptions. If you've already got well-structured forms on your site, you're apparently about 80% of the way there.

The Imperative API is for more complex, dynamic interactions that require JavaScript execution. This is where you'd define custom tools, similar to how you'd structure function calls for OpenAI or Anthropic's API endpoints.

The practical difference here is huge. Instead of dozens of interactions clicking through filters and scrolling pages, a single tool call could return structured results. Imagine your agent calling a "search products" function and getting back organized data instead of trying to parse a visual search interface.

What This Means Going Forward

While WebMCP is still behind a flag in Chrome, it's already in the browser. This isn't a theoretical spec anymore—it's actually happening. Google will likely roll this out fully at Google Cloud Next or Google IO in the coming months, and I expect things to move quickly from there.

We'll probably see tools and maybe even Claude skills that help convert existing websites to expose their own WebMCPs. For anyone building AI agents or websites that want agents to use them, this is definitely something to have on your radar.

The shift from agents guessing their way through the web to websites speaking the agent's language directly? That's the kind of change that makes everything else possible.

8.2.26

Claude Opus 4.6 Fast Mode — Up to 2.5x Faster Responses at Premium Pricing

Anthropic launched Fast Mode for Claude Opus 4.6 in research preview. The feature delivers up to 2.5x higher output tokens per second from the same model at a higher cost per token.

Fast Mode is available now in Claude Code for users with extra usage enabled and through a waitlist for API access. The feature is also rolling out to GitHub Copilot, Cursor, and other platforms.

How Fast Mode Works

Fast Mode is not a different model. It uses the same Opus 4.6 with a different API configuration that prioritizes speed over cost efficiency. You get identical quality and capabilities, just faster responses.

The speed improvement focuses on output tokens per second, not time to first token. The same model weights and behavior remain unchanged.

Accessing Fast Mode

In Claude Code, toggle Fast Mode on or off by typing /fast in the CLI or VS Code extension. You can also enable it in your user settings file by setting "fastMode": true. Fast Mode persists across sessions.

When enabled, Claude Code automatically switches to Opus 4.6 if you're on a different model. A small ↯ icon appears next to the prompt while Fast Mode is active.

For API users, set speed: "fast" in your API request to enable Fast Mode. The feature is currently in limited research preview with waitlist access.

Pricing and Availability

Fast Mode pricing starts at $30 per million input tokens and $150 per million output tokens. This is 6x the standard Opus pricing of $5 per million input and $25 per million output.

A 50% discount is available for all plans until February 16, 2026, bringing the cost to 3x standard pricing during the discount period.

Fast Mode usage is billed directly to extra usage, even if you have remaining usage on your plan. Fast Mode tokens do not count against your plan's included usage.

Requirements and Limitations

Fast Mode requires extra usage enabled on your account. For individual accounts, enable this in Console billing settings. For Teams and Enterprise, an admin must enable both extra usage and Fast Mode for the organization.

Fast Mode is not available on third-party cloud providers including Amazon Bedrock, Google Vertex AI, or Microsoft Azure Foundry. It's only available through the Anthropic Console API and Claude subscription plans using extra usage.

Fast Mode has separate rate limits from standard Opus 4.6. When you hit the rate limit or run out of extra usage credits, Fast Mode automatically falls back to standard Opus 4.6.

When to Use Fast Mode

Fast Mode works best for interactive workflows where speed matters more than cost. Use it for rapid iteration, live debugging, or real-time agent interactions.

Toggle it off when cost efficiency is more important than latency. You can combine Fast Mode with lower effort levels for maximum speed on straightforward tasks.

For API users, note that switching between fast and standard speed invalidates the prompt cache. Requests at different speeds do not share cached prefixes.

5.2.26

Google DeepMind's Evo-Memory Redefines AI Agent Memory — Cutting Task Steps by 50% Without Retraining


The gap between how AI agents remember and how they actually learn from experience has long been a fundamental limitation. While chatbots can recall what you said in a previous conversation, they typically can't leverage that experience to solve similar problems faster or smarter. A new research collaboration between Google DeepMind and the University of Illinois Urbana-Champaign proposes a solution: "Test-Time Evolution" — where agents actively Search, Synthesize, and Evolve their memory after every interaction.

This isn't just another benchmark paper. Evo-Memory introduces a comprehensive streaming evaluation framework alongside ReMem, an action-think-memory refine pipeline that fundamentally changes how we think about agent memory. The results are striking: active memory refinement reduced task completion steps by roughly 50% on ALFWorld (from 22.6 steps down to 11.5), and smaller models like Gemini Flash achieved gains that often rivaled larger static models. The success hinges not on storing more information, but on the agent's ability to refine and delete irrelevant experiences.

For anyone building AI agents, personal assistants, or autonomous systems, this research signals a shift in how we should approach memory architecture. Current RAG systems and long-context models excel at passive retrieval, but they don't learn from what worked and what didn't. Evo-Memory closes that gap by treating memory as something that evolves during deployment rather than remaining frozen after training.

The Core Problem: Remembering vs. Learning

The paper identifies a critical distinction that often gets overlooked. Current LLM memory systems focus on conversational recall — retrieving facts from dialogue history to answer queries. But this misses the more valuable capability of experience reuse, where agents abstract reasoning strategies from past tasks to improve future performance.

Think about it this way: if you ask a math tutor the same type of problem twice, they shouldn't solve it from scratch the second time. They should recognize the pattern and apply the successful strategy faster. Yet most AI agents today do exactly that — they recall context but fail to adapt across sessions. The researchers demonstrate this limitation persists even in sophisticated systems using retrieval-augmented generation, hierarchical memory, and workflow-based approaches.

The benchmark transforms static datasets into streaming task sequences, explicitly testing whether LLMs can accumulate knowledge and refine strategies during deployment. This reframing from isolated task evaluation to continuous adaptation assessment reveals significant weaknesses in current memory architectures.

ReMem: The Think-Act-Refine Loop



The proposed solution introduces a three-operation framework that goes beyond traditional ReAct-style agents. At each step, the agent chooses between Think (internal reasoning traces), Act (execute an operation or output a response), and Refine (meta-reasoning over memory to exploit useful experiences, prune noise, and reorganize stored knowledge).

This creates what the researchers describe as a Markov decision process where memory becomes an adaptive component that interacts with reasoning in real time rather than remaining passive context. The agent can loop between Think and Refine arbitrarily before committing to an action, forming a lightweight but powerful paradigm for continual adaptation.

A concrete example from the paper: when solving a household task like "put a hot apple in the fridge," the ReMem agent thinks about needing a heat source, searches memory for relevant experiences with microwaves, prunes an obsolete entry about stoves, executes the microwave action, then creates a new memory entry capturing the successful "hot→fridge = cooldown" strategy. This completed in 9 steps versus 19 for vanilla ReAct.

Benchmark Results That Challenge Assumptions

The research evaluated over ten representative memory modules across 10 diverse datasets spanning embodied reasoning (ALFWorld, BabyAI, PDDL, ScienceWorld) and single-turn tasks (AIME-24/25, GPQA, MMLU-Pro, ToolBench). The results reveal several important findings.

ReMem on Claude 3.7 Sonnet achieved 0.92 success rate and 0.96 progress on ALFWorld, 0.83 success and 0.95 progress on PDDL planning tasks. On Gemini 2.5 Flash, the average success reached 0.50 with 0.64 progress, consistently outperforming history baselines and ReAct-style approaches across all four multi-turn environments.

Perhaps most notably, the performance gains correlate strongly with task similarity within datasets. The researchers found a Pearson correlation of 0.72 on Gemini 2.5 Flash and 0.56 on Claude 3.7 Sonnet between ReMem's improvement margin and within-dataset coherence. Structured domains like PDDL and ALFWorld with higher intra-task similarity showed larger improvements, while diverse datasets like AIME-25 or GPQA showed smaller gains.

Step efficiency improvements proved equally significant. In ALFWorld, average steps to complete tasks dropped from 22.6 for history baselines to 11.5 for ReMem. ScienceWorld showed similar gains, going from 20.5 steps down to 14.0. The researchers note this represents a direct compute-cost win without any fine-tuning.



The Surprising Power of Simple Approaches

One unexpected finding deserves attention: ExpRAG, a simple retrieval-based baseline, outperformed several more complex designs. This baseline stores each task interaction as structured experience text and retrieves similar experiences for new tasks using basic embedding similarity.

Even ExpRecent, which simply maintains condensed traces of recent task trajectories, performed competitively. This suggests that explicit task-level utilization during test-time evolution represents a promising and underexplored direction, and that architectural complexity isn't always the answer.

The research also tested how agents handle both successful and failed experiences in memory. Baseline methods experienced clear performance drops when exposed to unfiltered failures, indicating that naive memory accumulation introduces noise. ReMem remained robust by actively refining stored experiences, achieving the highest overall success rates under both Claude and Gemini backbones when fed mixed feedback.

Why This Matters for AI Development

The implications extend beyond benchmark scores. Evo-Memory demonstrates that test-time evolution — the ability to retrieve, integrate, and update memory continuously during deployment — represents a viable path to more capable AI agents without additional training.

Smaller models particularly benefit from self-evolving memory, suggesting this approach could democratize access to more sophisticated agent capabilities. The correlation between task similarity and memory effectiveness provides practical guidance: domains with structured, recurring task patterns stand to gain the most from implementing these techniques.



For developers building production AI systems, the key insight is that memory architecture matters as much as model capability. Simply increasing context windows or adding retrieval doesn't capture the adaptive, self-improving behavior that humans naturally exhibit when learning from experience.

The researchers have indicated plans to release all code and configurations for reproducibility, making this a practical resource for the AI community rather than just a research contribution. As we move toward agents that operate autonomously over extended periods, the shift from static recall to dynamic evolution may prove foundational for the next generation of AI systems.

PaperBanana: The AI That's Automating Academic Illustration (And It's Kind of Mind-Blowing)

If you've ever written a research paper, you know the pain: you've done the hard work, written thousands of words explaining your groundbreaking methodology, and then... you need to create diagrams. Beautiful, publication-ready diagrams that somehow capture your complex ideas in a single visual. For many researchers, this becomes the most time-consuming part of the entire process.

Enter PaperBanana, a revolutionary framework from researchers at Peking University and Google Cloud AI Research that's tackling this exact bottleneck. And yes, they named it PaperBanana because even serious AI research deserves a smile.



What Makes PaperBanana Special?

Think of PaperBanana as your personal illustration team, but instead of humans, it's five specialized AI agents working together. Each agent has a specific role: the Retriever finds relevant reference examples from existing papers, the Planner translates your research context into detailed visual descriptions, the Stylist ensures everything looks professionally polished, the Visualizer creates the actual diagrams, and the Critic reviews and refines the output until it meets publication standards.

This isn't just about slapping together some boxes and arrows. PaperBanana generates diagrams that are faithful to your research, concise enough to be readable, aesthetically pleasing, and sophisticated enough to appear in top-tier conferences like NeurIPS.

PaperBanana's architecture: Five specialized AI agents collaborate to transform research content into publication-ready illustrations.

The Secret Sauce: Reference-Driven Intelligence

What sets PaperBanana apart is its reference-driven approach. Instead of generating illustrations from scratch with no context, it learns from the visual language already established in academic publishing. The system analyzes methodology diagrams from recent NeurIPS papers, understanding not just what makes a diagram functional, but what makes it beautiful and publication-ready.

The results speak for themselves. In comprehensive testing against leading baselines, PaperBanana consistently outperformed competitors across all evaluation dimensions: faithfulness, conciseness, readability, and aesthetics. It's not just good—it's setting a new standard.

Beyond Methodology Diagrams

But here's where it gets even more interesting: PaperBanana doesn't just do methodology diagrams. It also generates high-quality statistical plots. The researchers tested both code-based and image generation approaches for creating visualizations, revealing fascinating trade-offs. Image generation creates more visually appealing plots, but code-based methods maintain better content fidelity. Understanding these nuances helps researchers choose the right approach for their needs.

The Benchmark That Changes Everything

To properly evaluate automated illustration generation, the team created PaperBananaBench—a rigorous benchmark comprising 292 test cases curated from NeurIPS 2025 publications. This benchmark captures the sophisticated aesthetics and diverse logical compositions of modern AI research, spanning multiple research domains and illustration styles.

The average source context contains over 3,000 words, proving that PaperBanana can handle the complexity of real research papers, not just simplified examples.

PaperBananaBench statistics showing 292 test cases with average source context of 3,020 words per diagram.

PaperBanana consistently outperforms baselines across all evaluation dimensions: faithfulness, conciseness, readability, and aesthetics.

Real-World Applications

The practical applications extend beyond just generating new diagrams. PaperBanana can enhance the aesthetics of existing human-drawn diagrams, applying automatically summarized style guidelines to elevate visual quality. Imagine taking a rough sketch and having it instantly transformed into a polished, publication-ready illustration that maintains your original intent while looking professionally designed.

Before and after: PaperBanana transforms verbose, outdated diagrams into concise, aesthetically modern illustrations while maintaining accuracy.

The Road Ahead

Of course, no system is perfect. The researchers openly acknowledge failure modes, particularly around connection errors in complex diagrams. But this transparency is refreshing—they're not claiming to have solved everything, just to have made a significant leap forward.

For AI researchers, content creators, and anyone involved in scientific communication, PaperBanana represents something bigger than just a tool. It's a glimpse into a future where the tedious parts of research communication are automated, freeing scientists to focus on what they do best: pushing the boundaries of knowledge.

The code is available on GitHub, the paper is on arXiv, and the framework is ready to explore. As AI continues to augment scientific workflows, tools like PaperBanana remind us that automation isn't about replacing human creativity—it's about amplifying it, one beautifully generated diagram at a time.

4.2.26

Qwen Just Dropped a Coding AI That Runs on Your Laptop — And It's Competing with Models 20x Larger

Okay, I need to tell you about something that just happened in the AI coding world that has me genuinely excited. The Qwen team just released Qwen3-Coder-Next, and if you've been following the whole "local AI coding assistant" conversation, this one's a big deal.


Here's why: This model has 80 billion parameters but only uses 3 billion at a time. And somehow, it's matching the performance of models with 10-20x more active parameters. Yeah, you read that right.

Let me break down what this actually means for people like us who want powerful AI coding tools that don't require sending all our code to the cloud.

What Makes This Different from Other Coding Models?

Most coding AI models you've heard about—like GitHub Copilot, ChatGPT for coding, or Claude—run on massive cloud servers. They're great, but you're always dependent on an internet connection, you're sharing your code with a third party, and there are costs involved.

Qwen3-Coder-Next is built specifically for local development and coding agents. That means it's designed to run on your own machine (yes, even a beefy laptop or desktop), keep your code private, and work with tools like Claude Code, Cline, and other IDE integrations.

But here's where it gets interesting: unlike traditional models that need all their billions of parameters active to work well, Qwen3-Coder-Next uses something called a Mixture-of-Experts (MoE) architecture with sparse activation.

Think of it like having a team of 80 billion specialists, but for any given task, you only need to consult 3 billion of them. This makes it incredibly efficient—you get the intelligence of a huge model with the speed and memory requirements of a much smaller one.

The Architecture: Hybrid Attention That Actually Makes Sense

Now, I know "hybrid attention" and "sparse MoE" sound like buzzwords, but stick with me because this is actually pretty clever.

Traditional transformer models have a problem: as you give them more context (like a large codebase), the computational cost grows exponentially. It's called the "quadratic scaling problem," and it's why most models struggle when you try to feed them an entire repository's worth of code.

Qwen3-Coder-Next solves this by combining three different types of attention mechanisms:

  • Gated DeltaNet (for efficient linear attention)
  • Gated Attention (for focused reasoning)
  • Mixture-of-Experts layers (for specialized knowledge)


The model has 48 layers total, and the layout repeats a pattern: three DeltaNet-MoE blocks followed by one Attention-MoE block. Each MoE layer has 512 expert networks, but only 10 experts plus 1 shared expert activate for each token you process.

What this means practically: You can give this model a 256,000 token context window (that's roughly 200,000 words or a massive codebase), and it won't choke. It'll keep reasoning through your entire project without slowing to a crawl.

Trained Like an Actual Coding Agent, Not Just a Code Generator

Here's where Qwen3-Coder-Next really stands out from other coding models: how it was trained.

Most coding AI is trained on static code snippets—just reading code and learning patterns. Qwen3-Coder-Next went through what the team calls "agentic training at scale."

They created 800,000 executable coding tasks with real environments. These weren't simple "write a function" exercises. They were actual bug-fixing scenarios pulled from GitHub, complete with test suites, containerized environments, and the ability to execute code and see if it works.

During training, the model:

  1. Receives a coding task
  2. Writes code to solve it
  3. Runs the code in a real environment
  4. Gets feedback if it fails
  5. Learns to recover from errors and try again

This is reinforcement learning applied to real-world coding workflows. The model learned to plan, use tools, run tests, and recover from failures—not just spit out code and hope for the best.

The team even trained specialized "expert models" for specific domains:

  • A Web Development Expert for full-stack UI work (tested by actually rendering pages in a browser)
  • A User Experience Expert for CLI tool interactions across different frameworks

This training approach is why Qwen3-Coder-Next excels at long-horizon coding tasks—the kind where you need to make multiple changes across several files, run tests, fix errors, and iterate until everything works.



The Benchmarks: Punching Way Above Its Weight Class

Let me show you where this gets really impressive. On SWE-Bench Verified (a benchmark that tests how well models can solve real GitHub issues), here's how Qwen3-Coder-Next compares:

  • Qwen3-Coder-Next (3B active params): 70.6%
  • DeepSeek-V3.2 (671B total params): 70.2%
  • GLM-4.7 (358B total params): 74.2%


So a model with only 3 billion active parameters is matching or beating models with hundreds of billions of active parameters. That's insane efficiency.

On SWE-Bench Pro (an even harder benchmark):

  • Qwen3-Coder-Next: 44.3%
  • DeepSeek-V3.2: 40.9%
  • GLM-4.7: 40.6%

And on Terminal-Bench 2.0 (testing CLI agent capabilities) and Aider (a coding assistant benchmark), it continues to perform at the level of much larger models.

The takeaway: You're getting elite coding assistant performance in a package that can actually run locally on consumer hardware.

You Can Actually Run This on Your Own Machine

This is where things get practical. The Qwen team didn't just release model weights and say "good luck." They've made this genuinely deployable for real people.

For Server Deployment:

  • Works with SGLang and vLLM (industry-standard inference engines)
  • Supports OpenAI-compatible API endpoints
  • Can handle the full 256K token context window
  • Requires multiple GPUs for full performance (2-4 GPUs with tensor parallelism)

For Local Deployment:

  • Unsloth provides GGUF quantizations (compressed versions)
  • 4-bit quantization: Needs about 46 GB of RAM
  • 8-bit quantization: Needs about 85 GB of RAM
  • Works with llama.cpp and llama-server
  • Compatible with Apple Silicon unified memory (yes, you can run this on an M-series Mac)

The Unsloth team has even created guides showing how to plug Qwen3-Coder-Next into frameworks that mimic OpenAI Codex and Claude Code, but running entirely on your local machine.

For most modern development machines—especially those with 64GB+ of unified memory or dedicated GPUs—this is totally feasible. You can have a production-grade coding assistant running locally.

What You Can Actually Do With It

Qwen3-Coder-Next isn't just for autocompleting code. It's designed for agentic workflows, meaning it can:

  • Understand entire codebases (thanks to the 256K context window)
  • Plan multi-step refactoring tasks across multiple files
  • Execute code and interpret results
  • Call external tools (linters, formatters, test runners, debuggers)
  • Recover from errors by analyzing stack traces and trying different approaches
  • Work with different IDE scaffolds (Claude Code, Qwen Code, Cline, Kilo, etc.)

It supports tool calling natively, meaning it can interact with your development environment like a junior developer would—running commands, reading outputs, and making decisions based on results.

One important note: This model does NOT use "thinking" mode (it doesn't generate <think> tags). It goes straight to action. This makes it more predictable for agent workflows where you want direct tool calls and responses.

Who Should Care About This?

If you're a developer who:

  • Works with sensitive codebases that can't go to cloud APIs
  • Wants a powerful coding assistant without monthly subscription fees
  • Has a decent development machine (64GB+ RAM or multiple GPUs)
  • Prefers open-source tools over proprietary solutions
  • Wants to experiment with AI coding agents

Then Qwen3-Coder-Next is worth checking out.

If you're a company that:

  • Needs coding assistance for proprietary code
  • Wants to avoid data leaving your infrastructure
  • Has the compute resources to host models locally or on-premises
  • Wants to customize the model for specific frameworks or languages

This is a compelling option under the Apache 2.0 license (meaning you can use it commercially without restrictions).

The Bigger Picture: Local AI Coding Assistants Are Here

What excites me most about Qwen3-Coder-Next isn't just the model itself—it's what it represents.

For the past couple of years, the best AI coding tools have been cloud-only. You had to use GitHub Copilot, Claude, or ChatGPT, all of which require sending your code to external servers.

But in just the past few weeks, we've seen an explosion of local coding assistants:

  • Anthropic's Claude Code
  • OpenAI's Codex app
  • Various open-source frameworks like OpenClaw
  • And now Qwen3-Coder-Next

The tech has matured to the point where you can have GPT-4-class coding assistance running entirely on your own hardware. For privacy-conscious developers and companies working on proprietary systems, this is huge.

How to Get Started

If you want to try Qwen3-Coder-Next:

  1. Check the model weights on Hugging Face: Look for Qwen/Qwen3-Coder-Next
  2. Read the technical report on GitHub: QwenLM/Qwen3-Coder
  3. Follow the Unsloth guide for local deployment: unsloth.ai/docs/models/qwen3-coder-next
  4. Try it with your favorite agent framework: Claude Code, Cline, or Qwen Code all support it

The setup isn't quite as simple as installing a VSCode extension (yet), but the documentation is solid, and the community is already building tooling around it.

Final Thoughts

I think we're at an inflection point with AI coding tools. The cloud-based options are still excellent and will continue to improve, but now we have legitimate alternatives that run locally, respect privacy, and don't require subscriptions.

Qwen3-Coder-Next proves that you don't need to activate hundreds of billions of parameters to get strong coding assistance. With clever architecture (sparse MoE with hybrid attention) and smart training (agentic training on executable tasks), you can build something powerful enough to rival the big proprietary models.

For me, this opens up possibilities for experimentation, customization, and building coding tools that work the way I want them to—without worrying about API costs or data privacy.

If you've been curious about local AI coding assistants, now's the time to dive in. The tech is finally here.

I've been using AI tools almost every day for the past couple of years now, and if there's one thing I've learned, it's that...