22.7.25

Mono-InternVL-1.5 makes monolithic multimodal LLMs cheap (and fast) enough for real workloa

 Modular multimodal models bolt a vision encoder onto a language model—simple but memory-hungry. Monolithic MLLMs promise sleeker deployment by folding both roles into one network, yet they struggle with catastrophic forgetting and GPU burn. Mono-InternVL-1.5—unveiled this week by OpenGVLab, Shanghai AI Lab and Tsinghua collaborators—takes a big step toward solving both problems.

How they rebuilt the brain

  • Standalone visual parameter space. Instead of retraining the whole LLM, the team delta-tunes a fresh set of visual parameters—packed as a multimodal Mixture-of-Experts—so language weights stay frozen and stable.

  • EViP → EViP++. Their Endogenous Visual Pre-training pipeline now adds visual-attention experts and a progressive schedule that learns from noisy web data without wiping language skills.

  • Fused CUDA kernel for MoE inference. A custom kernel collapses expert routing into one GPU call, trimming real-time latency.

Numbers that matter

MetricMono-InternVLMono-InternVL-1.5Δ
Pre-training data1.1 B tokens0.5 B tokens−58 %
Inference speed61 tok/s77 tok/s+26 %
VQA Bench70.170.4+0.3
MLLM Bench53.755.6+1.9

Across 15 public benchmarks the older Mono-InternVL already led on 12; the new model keeps that edge while slashing first-token latency by up to 69 % against the modular InternVL-1.5 baseline. It even lands a headline-grabbing +114-point jump over Emu-3 on OCRBench.

Why it matters

  1. Design simplicity meets deployment thrift. One model now sees and talks without an external vision tower, fits in fewer VRAM GBs, and spools responses faster—handy for edge boxes or consumer GPUs.

  2. Delta-tuning shows its muscle. Freezing language weights while grafting “visual experts” offers a clean recipe other labs can copy to preserve text quality.

  3. Open weights, real code. Checkpoints, the fused CUDA kernel and training scripts are live on GitHub, inviting startups to fine-tune for retail search, doc-QA or AR glasses.

Mono-InternVL-1.5 won’t end the debate between modular and monolithic designs, but it proves you don’t need billion-token budgets or exotic hardware to get state-of-the-art multimodal accuracy—and you might even gain a few milliseconds back for the user.

Paper link: arXiv 2507.12566 (PDF)

21.7.25

Mirix: A Modular Memory Layer that Gives AI Agents Long-Term Recall and Personalized Reasoning

 

1 | Why “Memory” Is the Next AI Bottleneck

Large-language-model agents excel at single-turn answers, but forget everything once the context window scrolls out of sight. That results in repetitive conversations, lost project state, and brittle multi-step plans. Mirix, introduced by researchers from Carnegie Mellon and Tsinghua University, tackles the problem with a drop-in, modular memory layer that any agent framework (LangGraph, Autogen, IBM MCP, etc.) can call.


2 | How Mirix Works under the Hood

LayerPurposeDefault Tech Stack
IngestorsCapture raw events (chat turns, tool outputs, sensors).Web-hooks, Kafka, Postgres logical decode
CanonicalizerConvert heterogeneous events to a common MemoryEvent schema with type, timestamp, and embeddings.Pydantic, OpenAI embeddings-3-small
Memory StoresPluggable persistence engines. Ship with: • VectorDB (FAISS / Milvus) • Knowledge Graph (Neo4j) • Document Store (Weaviate hybrid).Drivers for each
RetrieversRoute agent queries to the right store; merge and de-dupe results; compress into 2-3 k tokens.Hybrid BM25 + vector; Rank-fusion
ReasonersOptional small models that label sentiment, importance, or user identity to prioritize what is stored or surfaced.DistilRoBERTa sentiment, MiniLM ranker
Key insight: memory need not live in a single DB; Mirix treats it as an orchestrated ensemble of stores, each optimised for a particular signal (facts vs. tasks vs. social cues).

3 | What It Enables

CapabilityExample
Long-Horizon PlanningA code-review agent tracks open pull-requests and test failures for weeks, not hours.
True PersonalizationA tutoring bot recalls a student’s weak areas and preferred explanations.
Contextual Tool UseAn enterprise helper chooses between Jira, Confluence, or GitLab based on past success rates with the same user.

Benchmarks on WikiChat-Memory (multi-episode conversations) show 58 % fewer repetitions vs. vanilla RAG and 3.4 × higher success on 15-step task chains.

4 | Plugging Mirix into an Existing Agent


from mirix.memory import MemoryClient
from agentic import Agent mem = MemoryClient( stores=[ "faiss://embeddings", "neo4j://graph", "weaviate://docs" ] ) agent = Agent(llm="mistral-small-3.2", memory=mem) response = agent.chat("Where did we leave the migration script last week?") print(response)

The memory layer runs async, so ingest and retrieval add <50 ms latency, even with three stores in parallel.


5 | Governance & Cost Controls

  • Policy Filters: PII redaction rules determine what is persisted.

  • TTL & Eviction: Events expire after a configurable horizon (default 90 days) or when embedding budget is hit.

  • Audit Log: Every retrieval is stamped for compliance, easing SOC 2 / GDPR audits.


6 | Limitations & Roadmap

  • Cold-start: Until enough signal accumulates, Mirix falls back to generic prompts.

  • Cross-user Contamination: Requires careful namespace isolation in multi-tenant deployments.

  • Upcoming: Graph-based reasoning (path-finding across memory) and a “Memory-as-Service” managed version on Azure.


Final Takeaway

Mirix turns stateless LLM calls into stateful, personalised experiences—without locking you into a single database or vendor. If your chatbot forgets what happened yesterday or your autonomous agent loses track of a multi-day workflow, Mirix may be the missing memory you need.

The rise of Context Engineering: why LLM performance now lives and dies on what you feed it

 Prompt tricks and vector databases used to feel like nice-to-have extras for chatbots. A sprawling new study argues they have matured into a discipline of their own. Titled “A Survey of Context Engineering for Large Language Models,” the 165-page report from the Chinese Academy of Sciences, UC Merced and seven other universities positions context selection, shaping and storage as the primary lever for squeezing more capability out of ever-larger models. The team sifted through 1,400-plus research papers to build the first comprehensive roadmap of the space.

From prompt hacks to a three-pillar stack

The authors split Context Engineering into three foundational components:

  1. Context retrieval & generation – everything from classic prompt templates to dynamic external-knowledge acquisition.

  2. Context processing – long-sequence handling, self-refinement loops and multimodal or structured context fusion.

  3. Context management – memory hierarchies, compression schemes and token-budget optimisation.

These pillars support four dominant system archetypes: Retrieval-Augmented Generation (RAG), long-lived memory agents, tool-integrated reasoning (function calling, code execution) and fully fledged multi-agent frameworks.

Why the stakes keep rising

  • Bigger models, harsher limits. Even GPT-class contexts choke on enterprise-scale corpora; smarter pruning and compression decide whether answers stay on-topic or derail.

  • Agents need persistence. As LLM agents stretch across hours or days, hierarchical memory and context-refresh policies become as critical as the policy network itself.

  • Tool use explodes token demand. Function calls and code snippets are powerful but verbose; context engineering keeps them from crowding out the original question.

A looming research gap

Despite dramatic gains in understanding long and complex contexts, models remain weak at generating equally long, logically coherent outputs—a mismatch the survey brands the field’s “defining priority for future research.”

Practical takeaways for builders

  • Treat context like a first-class system resource—budget, cache and monitor it the way you would GPU memory.

  • Mix retrieval styles. Hybrid pipelines (keyword, dense, graph) outperform single-method RAG on complex queries.

  • Plan for multi-layer memory. Short-term windows, episodic buffers and long-term stores each have distinct TTLs and compression trade-offs.

Published July 17 2025 with an accompanying GitHub “awesome list,” the survey is already circulating among infra and agent teams looking to squeeze more mileage out of existing checkpoints before the next trillion-parameter beast lands.

Paper link: arXiv 2507.13334 (PDF)

 The flashy AI announcements get the headlines — new model, higher benchmark, longer context. But if you've ever tried to actually deplo...