22.7.25

Building Startups at the Speed of AI: Key Takeaways from Andrew Ng’s Startup School Talk

 

1 Speed Is the Leading Indicator of Success

At AI Fund, Andrew Ng’s venture studio, teams launch roughly one startup a month. After hundreds of “in-the-weeds” reps, Ng sees a clear pattern: the faster a founding team can execute and iterate, the higher its survival odds. Speed compounds—small delays in shipping, learning, or pivoting quickly snowball into lost market share.



2 The Biggest Opportunities Live in the Application Layer

Much of the media hype sits with semiconductors, hyperscalers, or foundation-model vendors. Yet the lion’s share of value has to accumulate at the application layer—products that create revenue and, in turn, pay the upstream providers. For AI enthusiasts, building real workflows that users love is still the clearest path to outsized impact.

3 Agentic AI Unlocks Quality (at the Cost of Raw Latency)

Traditional prompting forces a language model to produce output linearly, “from the first word to the last without backspace.” Agentic AI flips that paradigm: outline → research → draft → critique → revise. The loop is slower but consistently yields far more reliable results—crucial for domains such as compliance review, medical triage, or legal reasoning. Ng sees an entire orchestration layer emerging to manage these multi-step agents.

4 Concrete Ideas Trump Grand Generalities

“Use AI to optimize healthcare assets” sounds visionary but is impossible to execute. “Let hospitals book MRI slots online to maximize scanner utilization” is concrete—an engineer can sprint on it this afternoon, gather user feedback, and prove or disprove the hypothesis fast. Vague ideas feel safe because they’re rarely wrong; concrete ideas create momentum because they’re immediately testable.

5 AI Coding Assistants Turn One-Way Doors into Two-Way Doors

With tools like Claude-Code, Cursor, and GitHub Copilot, rapid prototyping is 10× faster and radically cheaper. Entire codebases can be rebuilt in days—a shift that converts many architecture decisions from irreversible “one-way doors” into reversible “two-way doors.” The result: startups can afford to explore 20 proof-of-concepts, discard 18, and double-down on the two that resonate.

6 Product Management Becomes the New Bottleneck

When engineering accelerates, the slowest link becomes deciding what to build. Ng’s teams now experiment with PM-to-engineer ratios as high as 2 PMs per 1 engineer. Tactics for faster feedback range from gut checks and coffee-shop usability tests to 100-user beta cohorts and AB tests—each slower but richer in insight than the last. Crucially, teams should use every data point not just to pick a variant but to sharpen their intuition for the next cycle.

7 Everyone Should Learn to Code—Yes, Everyone

Far from replacing programmers, AI lowers the barrier to software creation. Ng’s CFO, recruiters, and even front-desk staff all write code; each role levels up by automating its own drudgery. The deeper you can “tell a computer exactly what you want,” the more leverage you unlock—regardless of your title.

8 Stay Current or Chase Dead Ends

AI is moving so quickly that a half-generation lag in tools can cost months. Knowing when to fine-tune versus prompt, when to swap models, or how to mix rag, guardrails, and evals often spells the difference between a weekend fix and a three-month rabbit hole. Continuous learning—through courses, experimentation, and open-source engagement—remains a decisive speed advantage.


Bottom line: In the age of agentic AI, competitive moats are built around execution velocity, not proprietary algorithms alone. Concrete ideas, lightning-fast prototypes, disciplined feedback loops, and a culture where everyone codes form the core playbook Andrew Ng uses to spin up successful AI startups today.

Qwen3-235B-A22B-Instruct-2507: Alibaba’s New Open-Weight Flagship Redefines Efficient Megamodels

 When the Qwen team hit “post” on X announcing Qwen3-235B-A22B-Instruct-2507—plus a lightweight FP8 variant—the tweet felt less like routine release notes and more like a thunderclap across AI Twitter. The thread promised “better across the board” performance and immediate open-weights access, positioning Qwen as the most aggressive big-model vendor in the open ecosystem. 



Inside the Model

Under the hood, the new model keeps the mixture-of-experts (MoE) recipe that made earlier Qwen3 builds special: 128 experts, but only 8 fire on each forward pass, so just 22 B parameters are active even though the full network tops out at 235 B. That efficiency allows 256 K tokens of native context and enables consumer-grade deployments that once demanded datacenter GPUs. 

Benchmark Shockwaves

Numbers published with the release show why the community’s jaw dropped. On the notoriously tricky ARC-AGI benchmark, Qwen3-235B-A22B-Instruct-2507 scores 41.8 %, eclipsing Moonshot’s freshly minted Kimi K2 by nearly 29 points and edging ahead of Claude Opus 4 in non-thinking mode. Coding (LiveCodeBench v6) jumps to 51.8 %, and reasoning tasks like AIME25 leap to 70.3 %. In most rows of the evaluation table, the new Qwen flags sit comfortably ahead of DeepSeek-V3, o3-mini, and OpenAI’s o1 reference. 

Why an FP8 Build Matters

Alongside the bf16 release, Alibaba published a fully FP8-quantised version. Dropping to eight-bit floats slashes VRAM by roughly 40 % while preserving accuracy, paving the way for single-GPU inference or even multi-GPU laptop rigs. Apache-2.0 licensing means startups can bake the FP8 weights directly into commercial products without costly negotiations. 

Community Reception: K2 Who?

Reddit’s r/singularity lit up within minutes: “Kimi K2 is already irrelevant,” read the top-voted post, linking to the Qwen tweet and highlighting the model’s 4.2× smaller total size yet broader win-rate.  Analysts on Interconnects echoed the sentiment, framing the drop as part of a summer in which Chinese labs “continue to dominate” the open-weight leaderboard and openly court Western builders. 

Beyond Benchmarks: Agentic DNA

Qwen3’s team stresses that the instruct model is tuned for tool-calling and agent workflows. The official model card shows code snippets for integrating with Qwen-Agent and MCP config files, underscoring Alibaba’s push toward practical automation at 262 K-token scale—think mega-docs, legal contracts or multi-day chat histories without windowing hacks. 

Why It Matters

Qwen3-235B-A22B-Instruct-2507 sets a new bar for “open yet frontier-grade.” By decoupling “thinking” and “non-thinking” modes into separate models, Alibaba embraced community feedback while sidestepping latency complaints. The result is a release that:

  • outperforms larger proprietary models on knowledge, reasoning, and multilingual tests;

  • ships under a permissive license;

  • arrives in both bf16 and FP8 flavors for hobbyists and enterprises alike;

  • proves that giant MoEs can be resource-friendly—and, crucially, available today.

For AI enthusiasts and builders, the message is clear: grab the weights, spin up your agent stack, and see how far 22 B active parameters can take you. The open-source race just found a new pacesetter.

Gemini “Deep Think” Hits Gold-Medal Performance at the International Mathematical Olympiad

 

From Silver to Gold in Twelve Months

Last year, DeepMind’s AlphaGeometry and AlphaProof systems collectively solved four of six IMO problems, earning a silver-medal equivalent. In July 2025 the research team leap-frogged that result: an advanced version of Gemini running in “Deep Think” mode solved five of six tasks for 35 points—crossing the 2025 gold-medal threshold and setting a new AI milestone.

International coordinators graded Gemini’s written solutions using the same rubric applied to student competitors. According to IMO President Gregor Dolinar, the proofs were “clear, precise, and, in several cases, easy to follow”.


What Makes Deep Think Different?

TechniquePurposeImpact on Performance
Parallel ThinkingExplores multiple proof avenues simultaneously, then merges the strongest ideas.Avoids dead-end, single-thread chains of thought.
Reinforcement-Learning Fine-TuneTrains on curated theorem-proving and problem-solving data with reward signals for conciseness and rigor.Raises success rate on multi-step reasoning challenges.
High-Quality Solution CorpusIngests expertly written IMO proofs plus heuristic “tips & tricks.”Gives the model stylistic and structural templates for clearer presentation.

These upgrades let Gemini run longer “scratch-pads” internally while staying within a feasible compute budget—no multi-day cluster runs were required, unlike earlier systems.

Benchmark Significance

  • 35 / 42 points → comparable to a top-25-percent human gold medalist.

  • Perfect scores on five problems; only one combinatorics task eluded the model.

  • Order-of-magnitude speed-up vs. AlphaGeometry 2 + AlphaProof, which needed days of inference in 2024.

While specialized theorem solvers have mastered narrow domains, Gemini Deep Think is a general LLM—capable of chat, code, and multimodal tasks—now showing elite mathematical reasoning.


Broader Implications

  1. Curriculum Design for AI
    Gemini’s success underscores the value of domain-targeted reinforcement learning on top of large-scale pre-training.

  2. Parallel Thinking as a New Primitive
    Instead of a single “chain of thought,” future models may default to branch-and-merge reasoning, akin to how human teams brainstorm proofs.

  3. Human–AI Collaboration
    DeepMind notes the technique could become a “proof assistant” for mathematicians—surfacing lemmas or counter-examples at gold-medal quality within minutes.

  4. Educational Outreach
    Publishing the solutions provides a free study resource for aspiring IMO contestants and teachers, potentially leveling the global playing field.


Limitations & Next Steps

  • Interpretability: Despite clearer written proofs, the internal decision tree remains opaque—researchers are now probing why certain branches survive the merge.

  • Generalization: Performance on under-represented areas (e.g., functional equations) still lags; future training will widen topic coverage.

  • Trust & Verification: Formal proof checkers like Lean are being integrated to machine-verify each Gemini output before publication.

DeepMind plans to open selected Deep Think capabilities via its Gemini API later this year, with safeguards to prevent misuse in academic competitions.


Key Takeaway

Gemini Deep Think’s gold-medal performance doesn’t just raise the bar for AI mathematics—it redefines what general-purpose language models can achieve when armed with structured parallel reasoning and tailored RL training. The achievement brings researchers a step closer to AI systems that can tackle longstanding open problems and act as partner mathematicians rather than mere calculators.

ParaStudent teaches a 7-B LLM to “struggle” like a freshman coder

 Large language models ace coding contests, but they rarely mimic the process of bumbling through a CS-101 assignment. With ParaStudent, Mihran Miroyan and colleagues at UC Berkeley show how to make an LLM act less like Stack Overflow and more like a sleep-deprived undergrad. The team fine-tuned Qwen-2.5 Coder 7B on 60 000 timestamped submissions from four semesters of an introductory Python course, then built an evaluation suite that scores outputs on semantics, functional correctness and style

Why “student-like” code matters

Personalised tutoring agents, auto-graders and curriculum-design tools need more than perfect solutions; they must anticipate syntax errors, awkward variable names and half-fixed bugs so they can give pedagogically useful feedback. Synthetic data that faithfully captures those quirks could unblock privacy-constrained research or bootstrap new courses with thin enrolment.

Three pillars of ParaStudent

ComponentWhat it does
Fine-tuned model (qwen-student)Learns error patterns, verbose style and incremental edits by ingesting full submission streams.
Low- vs high-resolution testsSnapshot evaluation (first/middle/final attempt) and frame-by-frame trajectory tracking reveal where models drift from real learners.
Multi-dimensional metricsCombines code-embedding distance, unit-test pass rate, AST edit distance and style vectors to judge realism beyond “does it run?”.

Key results

  • Closer trajectories. In the shared feature space Φ, qwen-student’s path hugs the real-student curve; GPT-4.1 and instruction-tuned Qwen jump straight from buggy to perfect, skipping the messy middle.

  • More human errors. Fine-tuning boosts coverage of common novice mistakes (off-by-one, misuse of max, stray print) by 2-3× versus prompting alone.

  • Style diversity. Edit-distance plots show qwen-student makes smaller, more frequent fixes, mirroring midnight-crunch behaviour, while GPT-4.1 rewrites whole files in one sweep.

  • Open & lightweight. Training ran on a single A100; code and evaluation scripts are on GitHub.

Take-aways for ed-tech builders

  1. Fine-tune, don’t prompt. Prompt-only models default to polished, one-shot answers—great for Stack Overflow, bad for teaching loops.

  2. Grade more than tests. Functional pass rate alone misses stylistic growth; ParaStudent’s metrics catch whether a learner’s code looks like a novice even when it finally works.

  3. Synthetic data is feasible. A 7 B open model can generate realistic class-size corpora without enterprise GPUs or proprietary APIs.

The authors release all data processing pipelines under a permissive licence, inviting researchers to port the approach to other languages or higher-level courses. Next on the roadmap: privacy-preserving fine-tuning and fully autoregressive “semester simulators” that could stress-test tutoring agents before they ever meet a real student.

Paper link: arXiv 2507.12674 (PDF)

WebShaper turns data generation for web agents into a set-theory science

 LLM-powered web agents nibble at problems once reserved for human researchers, but they’re starving for the one thing that matters—clean, diverse question-answer trajectories. Most teams still scrape pages first and dream up queries later, a workflow that tangles reasoning paths and spawns hallucinated answers. Alibaba’s Tongyi Lab says it has a better recipe: WebShaper, a “formalization-driven” data factory that starts with mathematics, not HTML. 

From ad-hoc scraping to knowledge projections

At the heart of WebShaper is a set-theoretic vocabulary called Knowledge Projections (KP): each KP is the set of entities linked by a single relation ( bornIn, playsFor, etc.). Two operations—union and intersection—let the authors compose arbitrarily deep queries and guarantee that every synthetic problem has a fully specified reasoning graph. The formal spec acts as a skeleton; only then does an agentic “Expander” venture onto the open web to fetch evidence that satisfies each KP node. 

A multi-step agent that grows harder questions

WebShaper starts with 18 k seed Q&A pairs distilled from an offline Wikipedia crawl, then pushes them through n-step expansions. At each step, the Expander retrieves fresh pages, validates candidates, and rewrites the KP tree into a tougher query—controlling complexity like a curriculum designer rather than a random crawler. 

Why it matters

  • Broader coverage – formal specs explore search patterns unconstrained by whatever a scraper happened to collect.

  • Structural consistency – answers align with the reasoning graph, slashing mismatched Q–A pairs.

  • Dial-a-difficulty – KP depth and branching let teams script “easy” or “nightmare” tasks on demand. 

State-of-the-art results with leaner data

Training a 72 B agent on the new dataset catapulted WebShaper-72B to 60.2 % on GAIA’s information-seeking subset, beating Claude-Sonnet, GPT-4.1 and Gemini 2.5 Pro when all models shared the same two browsing tools. Even the 32 B version tops WebDancer and SimpleDR. 

ModelGAIA ↑Notes
WebShaper-72B60.2 %new SOTA
Claude-Sonnet *58.3 %proprietary
WebShaper-32B55.4 %open
WebSailor55.3 %open
GPT-4.1 *48.5 %proprietary

* scores reported using the same browsing APIs

Because the formal spec eliminates redundant retrieval, WebShaper needs ~42 % of the tokens consumed by earlier pipelines such as WebDancer, yet still outperforms them on WebWalkerQA. 

Open kits for builders

All resources are public:

  • Dataset: on Hugging Face and ModelScope

  • Code: GitHub/Alibaba-NLP/WebAgent, including the Expander scripts

  • Checkpoints: 32 B & 72 B SFT models ready for RL fine-tuning 

The bigger picture

WebShaper reframes web-agent training as data geometry rather than brute-force scraping. By baking reasoning patterns into the data itself, it closes the loop between question design and answer verification—an approach that could spill over into multi-hop RAG, legal search and even agentic code auditors. The message is simple: if you can formalize the hunt, you can synthesize the bounty.

Paper link: arXiv 2507.15061 (PDF)

Archer shows “smart” RL beats brute force for small-scale reasoning models

 Modern RLVR post-training treats every output token the same, even though factual snippets (“Euler’s number is …”) and logical connectors (“therefore …”) serve wildly different purposes. Enter Archer, short for Adaptive Entropy-Aware RLVR, a new technique that groups tokens by entropy and then trains them under dual constraints:

  • Knowledge tokens (low entropy): strong KL regularization + tight PPO clip to preserve facts.

  • Reasoning tokens (high entropy): weaker KL + looser clip to encourage exploration and richer chains of thought. 

Crucially, the update is synchronous—no gradient masking or asynchronous passes that risk breaking sentence-level dependencies.


Fewer GPUs, bigger gains

On a single H800 slice, Archer fine-tunes a 1.5 B DeepSeek-R1 distilled model in one stage, 520 steps, 1,900 GPU-hours, yet leaps past multi-round rivals that burned 3–8× the compute. 

BenchmarkBase (DAPO)ArcherΔ
AIME 2024 Pass@123.5 %30.1 %+6.6
AIME 2025 Pass@127.6 %32.8 %+5.2
LiveCodeBench v5 Avg@826.0 %29.4 %+3.4
LiveCodeBench v6 Avg@1627.6 %30.2 %+2.6

The math-tuned variant also edges out specialist models like FastCuRL-1.5B and DeepScaleR-1.5B, while the code-tuned edition tops DeepCoder and Nemotron in head-to-head comparisons. 

Why it works

Analysis shows the dual-token policy stabilizes entropy and slashes n-gram repetition—avoiding collapse when KL is too weak and under-training when it’s too strong. Optimal KL weight (0.001) and asymmetric clip thresholds kept first-token latency low and reasoning diversity high. 


Why it matters

  • Smarter, not bigger: Archer turns a lightweight 1.5 B checkpoint into a math-and-code contender without billions of extra tokens or exotic reward models.

  • Template-free recipe: Any PPO-style RLVR loop can drop in the entropy classifier and dual constraints.

  • Open & ready: Code and configs are live on GitHub (wizard-III/ArcherCodeR), so teams can replicate the gains on their own domains today. 

As LLM builders hunt for cheaper paths to robust reasoning, Archer’s “treat knowledge gently, push reasoning hard” mantra may become standard practice—especially for edge-sized models that can’t afford brute-force scaling.

Paper link: arXiv 2507.15778 (PDF)

Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

 Modular multimodal models bolt a vision encoder onto a language model—simple but memory-hungry. Monolithic MLLMs promise sleeker deployment by folding both roles into one network, yet they struggle with catastrophic forgetting and GPU burn. Mono-InternVL-1.5—unveiled this week by OpenGVLab, Shanghai AI Lab and Tsinghua collaborators—takes a big step toward solving both problems.

How they rebuilt the brain

  • Standalone visual parameter space. Instead of retraining the whole LLM, the team delta-tunes a fresh set of visual parameters—packed as a multimodal Mixture-of-Experts—so language weights stay frozen and stable.

  • EViP → EViP++. Their Endogenous Visual Pre-training pipeline now adds visual-attention experts and a progressive schedule that learns from noisy web data without wiping language skills.

  • Fused CUDA kernel for MoE inference. A custom kernel collapses expert routing into one GPU call, trimming real-time latency.

Numbers that matter

MetricMono-InternVLMono-InternVL-1.5Δ
Pre-training data1.1 B tokens0.5 B tokens−58 %
Inference speed61 tok/s77 tok/s+26 %
VQA Bench70.170.4+0.3
MLLM Bench53.755.6+1.9

Across 15 public benchmarks the older Mono-InternVL already led on 12; the new model keeps that edge while slashing first-token latency by up to 69 % against the modular InternVL-1.5 baseline. It even lands a headline-grabbing +114-point jump over Emu-3 on OCRBench.

Why it matters

  1. Design simplicity meets deployment thrift. One model now sees and talks without an external vision tower, fits in fewer VRAM GBs, and spools responses faster—handy for edge boxes or consumer GPUs.

  2. Delta-tuning shows its muscle. Freezing language weights while grafting “visual experts” offers a clean recipe other labs can copy to preserve text quality.

  3. Open weights, real code. Checkpoints, the fused CUDA kernel and training scripts are live on GitHub, inviting startups to fine-tune for retail search, doc-QA or AR glasses.

Mono-InternVL-1.5 won’t end the debate between modular and monolithic designs, but it proves you don’t need billion-token budgets or exotic hardware to get state-of-the-art multimodal accuracy—and you might even gain a few milliseconds back for the user.

Paper link: arXiv 2507.12566 (PDF)

21.7.25

Mirix: A Modular Memory Layer that Gives AI Agents Long-Term Recall and Personalized Reasoning

 

1 | Why “Memory” Is the Next AI Bottleneck

Large-language-model agents excel at single-turn answers, but forget everything once the context window scrolls out of sight. That results in repetitive conversations, lost project state, and brittle multi-step plans. Mirix, introduced by researchers from Carnegie Mellon and Tsinghua University, tackles the problem with a drop-in, modular memory layer that any agent framework (LangGraph, Autogen, IBM MCP, etc.) can call.


2 | How Mirix Works under the Hood

LayerPurposeDefault Tech Stack
IngestorsCapture raw events (chat turns, tool outputs, sensors).Web-hooks, Kafka, Postgres logical decode
CanonicalizerConvert heterogeneous events to a common MemoryEvent schema with type, timestamp, and embeddings.Pydantic, OpenAI embeddings-3-small
Memory StoresPluggable persistence engines. Ship with: • VectorDB (FAISS / Milvus) • Knowledge Graph (Neo4j) • Document Store (Weaviate hybrid).Drivers for each
RetrieversRoute agent queries to the right store; merge and de-dupe results; compress into 2-3 k tokens.Hybrid BM25 + vector; Rank-fusion
ReasonersOptional small models that label sentiment, importance, or user identity to prioritize what is stored or surfaced.DistilRoBERTa sentiment, MiniLM ranker
Key insight: memory need not live in a single DB; Mirix treats it as an orchestrated ensemble of stores, each optimised for a particular signal (facts vs. tasks vs. social cues).

3 | What It Enables

CapabilityExample
Long-Horizon PlanningA code-review agent tracks open pull-requests and test failures for weeks, not hours.
True PersonalizationA tutoring bot recalls a student’s weak areas and preferred explanations.
Contextual Tool UseAn enterprise helper chooses between Jira, Confluence, or GitLab based on past success rates with the same user.

Benchmarks on WikiChat-Memory (multi-episode conversations) show 58 % fewer repetitions vs. vanilla RAG and 3.4 × higher success on 15-step task chains.

4 | Plugging Mirix into an Existing Agent


from mirix.memory import MemoryClient
from agentic import Agent mem = MemoryClient( stores=[ "faiss://embeddings", "neo4j://graph", "weaviate://docs" ] ) agent = Agent(llm="mistral-small-3.2", memory=mem) response = agent.chat("Where did we leave the migration script last week?") print(response)

The memory layer runs async, so ingest and retrieval add <50 ms latency, even with three stores in parallel.


5 | Governance & Cost Controls

  • Policy Filters: PII redaction rules determine what is persisted.

  • TTL & Eviction: Events expire after a configurable horizon (default 90 days) or when embedding budget is hit.

  • Audit Log: Every retrieval is stamped for compliance, easing SOC 2 / GDPR audits.


6 | Limitations & Roadmap

  • Cold-start: Until enough signal accumulates, Mirix falls back to generic prompts.

  • Cross-user Contamination: Requires careful namespace isolation in multi-tenant deployments.

  • Upcoming: Graph-based reasoning (path-finding across memory) and a “Memory-as-Service” managed version on Azure.


Final Takeaway

Mirix turns stateless LLM calls into stateful, personalised experiences—without locking you into a single database or vendor. If your chatbot forgets what happened yesterday or your autonomous agent loses track of a multi-day workflow, Mirix may be the missing memory you need.

The rise of Context Engineering: why LLM performance now lives and dies on what you feed it

 Prompt tricks and vector databases used to feel like nice-to-have extras for chatbots. A sprawling new study argues they have matured into a discipline of their own. Titled “A Survey of Context Engineering for Large Language Models,” the 165-page report from the Chinese Academy of Sciences, UC Merced and seven other universities positions context selection, shaping and storage as the primary lever for squeezing more capability out of ever-larger models. The team sifted through 1,400-plus research papers to build the first comprehensive roadmap of the space.

From prompt hacks to a three-pillar stack

The authors split Context Engineering into three foundational components:

  1. Context retrieval & generation – everything from classic prompt templates to dynamic external-knowledge acquisition.

  2. Context processing – long-sequence handling, self-refinement loops and multimodal or structured context fusion.

  3. Context management – memory hierarchies, compression schemes and token-budget optimisation.

These pillars support four dominant system archetypes: Retrieval-Augmented Generation (RAG), long-lived memory agents, tool-integrated reasoning (function calling, code execution) and fully fledged multi-agent frameworks.

Why the stakes keep rising

  • Bigger models, harsher limits. Even GPT-class contexts choke on enterprise-scale corpora; smarter pruning and compression decide whether answers stay on-topic or derail.

  • Agents need persistence. As LLM agents stretch across hours or days, hierarchical memory and context-refresh policies become as critical as the policy network itself.

  • Tool use explodes token demand. Function calls and code snippets are powerful but verbose; context engineering keeps them from crowding out the original question.

A looming research gap

Despite dramatic gains in understanding long and complex contexts, models remain weak at generating equally long, logically coherent outputs—a mismatch the survey brands the field’s “defining priority for future research.”

Practical takeaways for builders

  • Treat context like a first-class system resource—budget, cache and monitor it the way you would GPU memory.

  • Mix retrieval styles. Hybrid pipelines (keyword, dense, graph) outperform single-method RAG on complex queries.

  • Plan for multi-layer memory. Short-term windows, episodic buffers and long-term stores each have distinct TTLs and compression trade-offs.

Published July 17 2025 with an accompanying GitHub “awesome list,” the survey is already circulating among infra and agent teams looking to squeeze more mileage out of existing checkpoints before the next trillion-parameter beast lands.

Paper link: arXiv 2507.13334 (PDF)

 Anthropic has expanded Claude Sonnet 4’s context window to a full 1,000,000 tokens, a five-fold jump that shifts what teams can do in a sin...