Most test-time scaling schemes still walk a single, linear chain of thought—great until an early mistake snowballs. MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration) swaps that for many chains in parallel, each grounded in a medical knowledge graph and then cross-checked before answering. Think of it as ToT’s breadth, Search-o1’s retrieval, and GraphRAG’s structure—rolled into one pipeline.
How it works (and why it’s different)
-
Entity-grounded decomposition. The system splits a clinical question into sub-questions tied to concrete entities (symptoms, diseases, treatments). Each sub-question spawns its own reasoning chain.
-
Graph-based retrieval, two modes.
-
Anchor mode: query the KG around a single entity (local neighborhood).
-
Bridge mode: search paths between entity pairs to surface multi-hop relations.
-
-
Adaptive evidence streaming. Chains iteratively expand neighbors/multi-hop trails, keeping only deduplicated, directionally relevant facts.
-
Cross-chain verification. An answer synthesizer reconciles sub-answers, prefers explanations backed by broader, independent chains, and normalizes clinical terms—cutting contradictions and hallucinations. Outputs are serialized with full provenance traces for audit.
Benchmarks: consistent wins over strong baselines
Evaluated on GenMedGPT-5k, CMCQA, and ExplainCPE (with paired medical KGs), MIRAGE tops GPT-4o, GPT-4o+ToT, QWQ-32B, MindMap (GraphRAG), and Search-o1 across GPT-4o ranking and/or accuracy. Highlights:
-
GenMedGPT-5k: best GPT-4o rank 1.8 (lower is better).
-
CMCQA: rank 2.8, edging ToT, MindMap, and Search-o1.
-
ExplainCPE: 84.8% accuracy vs GPT-4o 77.8%, Search-o1 80.7%, MindMap 84.6%.
Swapping the backbone to DeepSeek-R1-32B preserves the lift (ExplainCPE 84.4%), suggesting MIRAGE is model-agnostic. A human study on GenMedGPT-5k prefers MIRAGE over all baselines, mirroring GPT-4o’s ranking.
What moved the needle
-
Structured retrieval beats flat text. Graph-aware exploration is more stable than BM25/dense retrieval and less noisy than web-first Search-o1 on medical tasks.
-
Right-sizing the knobs. Increasing the decomposition threshold (Nq) and retrieval depth (Nr) improves rank/accuracy up to a point—useful guidance for real deployments.
-
Ablations matter. Removing the Question Decomposer or Answer Synthesizer drops win rates in GPT-4o pairwise tests, confirming both stages carry weight.
Why it matters
Linear chains waste compute on dead ends; MIRAGE parallelizes exploration, grounds every claim in KG paths, and verifies across chains before speaking—exactly the traits clinicians and auditors want. The approach is plug-and-play with modern LRMs (QWQ-32B, DeepSeek-R1) and slots cleanly into safety-critical, knowledge-heavy domains beyond medicine.
Paper link: arXiv 2508.18260 (PDF)