Retrieval-augmented generation (RAG) is everywhere, but most teams still grade it on shaky ground: ad-hoc question sets that don’t reflect real-world variety—or privacy constraints. A new paper from Amadeus lays out a pragmatic fix: a multi-agent framework that synthesizes diverse and private QA datasets specifically for evaluating RAG systems. The system consistently beats common synthetic baselines on diversity while delivering robust PII masking—a requirement that’s fast becoming table stakes under regimes like the EU AI Act.
How the pipeline works
The framework splits the job across three agents, orchestrated with LangGraph and Azure OpenAI:
-
Diversity Agent – clusters source docs with embeddings and picks representative spans to maximize topical coverage.
-
Privacy Agent – detects and pseudonymizes sensitive entities, emitting a structured privacy report.
-
QA Curation Agent – generates evaluation-ready QA pairs (plus a generation report) from the privacy-scrubbed text.
Under the hood: GPT-4o powers diversity and QA; GPT-4.1 handles the heavier reasoning/tooling for privacy; embeddings use text-embedding-3-small
; chunking is 256 tokens with k-means for clustering. Temperatures are locked at 0 for reproducibility.
Does it actually help?
On diversity, the authors compare against (1) an evolutionary generator à la RAGAS and (2) direct prompting with GPT-4o. Using an LLM-as-a-judge (GPT-4.1) plus an embedding-based CosineSimilarity-to-Diversity metric, their sets win across sizes—with judge scores climbing from 7.8 → 9.0 as sample counts scale from 10 → 100, and cosine-similarity trending toward zero (more semantic spread). They use the EU AI Act as a challenging, high-variety testbed.
On privacy, they evaluate the masking agent on three AI4Privacy suites—PHI, PWI, PII—after concatenating items into longer, domain-specific paragraphs. Label-wise accuracies typically land 0.75–0.94, with standouts like JOBTYPE 0.94, DISABILITYSTATUS 0.91, LASTNAME 0.91 and several categories at 0.86–0.90 across datasets. Translation: strong, granular masking across healthcare, workplace and generic PII.
Why this matters for builders
-
Evaluation data ≫ metric tweaks. Better RAG scores start with representative questions and privacy-safe contexts, not another rubric. This pipeline produces both—and logs reports you can hand to auditors.
-
Regulatory alignment. With the EU AI Act explicitly encouraging synthetic data in audits, a privacy-first generator isn’t just nice—it’s compliance-friendly.
-
Drop-in ops. Clustering, masking and QA generation are modular; teams can swap models, change PII taxonomies, or point the pipeline at their own corpora.
What’s next
The authors want tighter agent-to-agent coordination (e.g., via Model Context Protocol), adaptive PII discovery beyond static lists, and stress-tests against privacy attacks—pushing the framework toward fully auditable, enterprise-grade RAG evals. arXiv
Paper link: arXiv 2508.18929 (PDF)