20.2.26

Grok 4.20: What Nobody Is Actually Talking About

There's been a lot of noise about Grok 4.20 this week. The four agents, the multi-agent setup, the "AI brains arguing with each other" framing. I get it — it's a fun story. But the more I dug into this release, the more I felt like everyone was covering the surface and missing the stuff underneath that's actually worth paying attention to.

So here's what caught my eye.


The model solved a real math problem nobody had solved before

I came across this one almost by accident and it stopped me completely.

Paata Ivanisvili, a math professor at UC Irvine, had been working on an open problem in harmonic analysis — trying to find the exact maximum of something called a Bellman function. He and his student had already published a paper on it. They had a partial answer. The problem was still technically unsolved.

He got early access to the Grok 4.20 beta, fed it the problem, and within five minutes it handed him an explicit formula — a cleaner, sharper result than what he and his student had managed. His reaction on X was basically just: "Wow."

What makes this different from the usual "AI is smart" story is that benchmarks test problems with known answers. This didn't. Grok wasn't pulling a solution from somewhere — it was reasoning toward something that hadn't been written down yet. That's a genuinely different category of thing, and it happened because the Benjamin agent (the math and logic one) is specifically built for this kind of step-by-step proof reasoning.

One result doesn't prove everything. But it happened, it's documented, and it made me think differently about what "AI for research" actually looks like in practice.


It was already winning competitions before anyone knew it existed

Here's something that got buried in the launch coverage: Grok 4.20 didn't launch when the public beta dropped on February 17. It had been running — quietly, under fake names — for weeks before that.

On LMArena it showed up as "Theta-hat" and "Slateflow." On DesignArena it appeared as "Pearl" and "Obsidian." And in Alpha Arena's live AI stock trading competition, a "Mystery Model" was quietly the only entry making money while every other AI lost. That Mystery Model was Grok 4.20.

xAI's last official model announcement on their news page is still Grok 4.1 from November 2025. There was no blog post for this launch. No announcement thread. Just posts from employees on X and a new option appearing quietly in the grok.com dropdown. By the time most people heard about it, it had already been competing for weeks.


The "four brains" framing is wrong — and the real version is more interesting

Everyone's been describing this as four separate AI models arguing with each other. That's not quite what's happening, and I think the actual explanation is more impressive.

Grok 4.20 is one model — estimated around 3 trillion parameters — with four different roles running from the same base weights. Grok the coordinator, Harper the real-time researcher, Benjamin the math and logic brain, Lucas the contrarian wildcard. They're not four separate models. They're four ways of prompting the same model, each with different tool access and context framing.

Why does this matter? Cost. Running four truly separate models would be four times as expensive. Because these agents share weights and the same input context, xAI puts the overhead at 1.5x to 2.5x — not 4x. That's what makes this available at a consumer price point rather than Grok Heavy's $300/month tier. The agents don't have long debates either. The rounds are short, structured, trained through reinforcement learning to be efficient. They send targeted corrections and move on.

I found it interesting that nobody in the coverage this week mentioned this. "Four brains" is a better headline, sure, but one brain wearing four hats is actually a harder technical trick to pull off.


The trading competition result deserves more scrutiny than it got

Alpha Arena Season 1.5 was a live stock trading competition — real transactions, verified on the blockchain. GPT-5, Gemini, Claude, open-source models, Chinese models — everyone participated in multiple strategy variants over several weeks. When it ended, every single configuration of every other model finished in the red. Grok 4.20's four variants were the only ones in profit, with the best posting around 35% returns from a $10,000 start.

A few people mentioned this. Nobody really asked why.

My read: it wasn't that Grok out-thought everyone on finance. It's that Harper — the agent that pulls from X's real-time feed of roughly 68 million English tweets per day — gave it a live information edge the other models didn't have. The competition fed other models summarized news prompts. Grok was processing raw market sentiment as it happened.

That's worth being honest about, because it means the result says more about data access than raw intelligence. But here's the thing — that data access is the architecture. That's what the Harper agent was built to do. So the win was real, and it was structural, and it came from a deliberate design choice.


Grok 4.20 is the proof of concept. Grok 5 is the actual bet.

While everyone's been testing the 4.20 beta, xAI is already training its successor. Grok 5 is reportedly sitting around 6 trillion parameters — roughly double the size of 4.20 — with a target launch somewhere between April and June 2026.

For context, Grok 4.20 arrived in a crowded week. OpenAI shipped GPT-5.3-Codex on February 5. Anthropic released Claude Sonnet 5 on February 3. By that measure, this was a delayed point release catching up while competitors were already moving to the next generation. Even xAI treated it like a soft launch.

Grok 5 will not be a soft launch. xAI is spending roughly $1 billion a month. SpaceX formally acquired them weeks ago. An IPO is on the table. A 6-trillion-parameter release is exactly the kind of moment you build that story around.

What I keep thinking about is that the ecosystem is already being built before most people have noticed. Perplexity is reportedly running Grok 4.20 under the hood for a new search mode. There are signs of something called Grok Build — a coding tool with up to 8 parallel agents. The multi-agent architecture isn't a feature. It's the foundation, and xAI is building on top of it fast.

Grok 4.20 showed the approach works. The math problem, the trading competition, the benchmark trajectory — all of it points in the same direction. Grok 5 is where we find out if xAI can actually deliver on the scale of that promise.

I'll have a full video on this on YouTube soon — I want to actually get hands-on time with it first before I say anything definitive. But if you're following AI closely, this release is worth paying more attention to than the "four agents" headline suggests.


No comments:

There's been a lot of noise about Grok 4.20 this week. The four agents, the multi-agent setup, the "AI brains arguing with each oth...