Wandering Nomad: privacy-preserving AI

1.9.25

RAG needs better tests, not just better metrics—Amadeus ships a privacy-first data generator

Retrieval-augmented generation (RAG) is everywhere, but most teams still grade it on shaky ground: ad-hoc question sets that don’t reflect real-world variety—or privacy constraints. A new paper from Amadeus lays out a pragmatic fix: a multi-agent framework that synthesizes diverse and private QA datasets specifically for evaluating RAG systems. The system consistently beats common synthetic baselines on diversity while delivering robust PII masking—a requirement that’s fast becoming table stakes under regimes like the EU AI Act.

How the pipeline works

The framework splits the job across three agents, orchestrated with LangGraph and Azure OpenAI:

Diversity Agent – clusters source docs with embeddings and picks representative spans to maximize topical coverage.
Privacy Agent – detects and pseudonymizes sensitive entities, emitting a structured privacy report.
QA Curation Agent – generates evaluation-ready QA pairs (plus a generation report) from the privacy-scrubbed text.

Under the hood: GPT-4o powers diversity and QA; GPT-4.1 handles the heavier reasoning/tooling for privacy; embeddings use text-embedding-3-small; chunking is 256 tokens with k-means for clustering. Temperatures are locked at 0 for reproducibility.

Does it actually help?

On diversity, the authors compare against (1) an evolutionary generator à la RAGAS and (2) direct prompting with GPT-4o. Using an LLM-as-a-judge (GPT-4.1) plus an embedding-based CosineSimilarity-to-Diversity metric, their sets win across sizes—with judge scores climbing from 7.8 → 9.0 as sample counts scale from 10 → 100, and cosine-similarity trending toward zero (more semantic spread). They use the EU AI Act as a challenging, high-variety testbed.

On privacy, they evaluate the masking agent on three AI4Privacy suites—PHI, PWI, PII—after concatenating items into longer, domain-specific paragraphs. Label-wise accuracies typically land 0.75–0.94, with standouts like JOBTYPE 0.94, DISABILITYSTATUS 0.91, LASTNAME 0.91 and several categories at 0.86–0.90 across datasets. Translation: strong, granular masking across healthcare, workplace and generic PII.

Why this matters for builders

Evaluation data ≫ metric tweaks. Better RAG scores start with representative questions and privacy-safe contexts, not another rubric. This pipeline produces both—and logs reports you can hand to auditors.
Regulatory alignment. With the EU AI Act explicitly encouraging synthetic data in audits, a privacy-first generator isn’t just nice—it’s compliance-friendly.
Drop-in ops. Clustering, masking and QA generation are modular; teams can swap models, change PII taxonomies, or point the pipeline at their own corpora.

What’s next

The authors want tighter agent-to-agent coordination (e.g., via Model Context Protocol), adaptive PII discovery beyond static lists, and stress-tests against privacy attacks—pushing the framework toward fully auditable, enterprise-grade RAG evals. arXiv

Paper link: arXiv 2508.18929 (PDF)

28.6.25

Google AI’s Gemma 3n Brings Full Multimodal Intelligence to Low-Power Edge Devices

A Mobile-First Milestone

Google has released Gemma 3n, a compact multimodal language model engineered to run entirely offline on resource-constrained hardware. Unlike its larger Gemma-3 cousins, the 3n variant was rebuilt from the ground up for edge deployment, performing vision, audio, video and text reasoning on devices with as little as 2 GB of RAM.

Two Ultra-Efficient Flavors

Variant	Activated Params*	Typical RAM	Claimed Throughput	Target Hardware
E2B	≈ 2 B (per token)	2 GB	30 tokens / s	Entry-level phones, micro-PCs
E4B	≈ 4 B	4 GB	50 tokens / s	Laptops, Jetson-class boards

*Mixture-of-Experts routing keeps only a subset of the full network active, giving E2B speeds comparable to 5 B dense models and E4B performance near 8 B models.

Key Technical Highlights

Native Multimodality – Single checkpoint accepts combined image, audio, video and text inputs and produces grounded text output.
Edge-Optimized Attention – A local–global pattern plus per-layer embedding (PLE) caching slashes KV-cache memory, sustaining 128 K-token context on-device.
Low-Precision Friendly – Ships with Q4_K_M quantization recipes and TensorFlow Lite / MediaPipe build targets for Android, iOS, and Linux SBCs.
Privacy & Latency – All computation stays on the device, eliminating round-trip delays and cloud-data exposure—critical for regulated or offline scenarios.

Early Benchmarks

Task	3n-E2B	3n-E4B	Gemma 3-4B-IT	Llama-3-8B-Instruct
MMLU (few-shot)	60.1	66.7	65.4	68.9
VQAv2 (zero-shot)	57.8	61.2	60.7	58.3
AudioQS (ASR)	14.3 WER	11.6 WER	12.9 WER	17.4 WER

Despite the tiny footprint, Gemma 3n matches or outperforms many 4-8 B dense models across language, vision and audio tasks.

Developer Experience

Open Weights (Apache 2.0) – Available on Hugging Face, Google AI Studio and Android AICore.
Gemma CLI & Vertex AI – Same tooling as larger Gemma 3 models; drop-in replacement for cloud calls when bandwidth or privacy is a concern.
Reference Apps – Google has published demos for offline voice assistants, real-time captioning, and hybrid AR experiences that blend live camera frames with text-based reasoning.

Why It Matters

Unlocks Edge-First Use Cases – Wearables, drones, smart-home hubs and industrial sensors can now run frontier-level AI without the cloud.
Reduces Cost & Carbon – Fewer server cycles and no data egress fees make deployments cheaper and greener.
Strengthens Privacy – Keeping raw sensor data on-device helps meet GDPR, HIPAA and other compliance regimes.

Looking Ahead

Google hints that Gemma 3n is just the first in a “nano-stack” of forthcoming sub-5 B multimodal releases built to scale from Raspberry Pi boards to flagship smartphones. With open weights, generous licences and robust tooling, Gemma 3n sets a new bar for AI everywhere—where power efficiency no longer has to compromise capability.