arxiv: arXiv:2506.22405, June 2025
Authors: Harsha Nori*, Mayank Daswani*, Christopher Kelly*, et al. – Microsoft AI
(* Equal contribution)
Blog: https://microsoft.ai/new/the-path-to-medical-superintelligence/
Sequential Diagnosis (SD) Bench arxiv: https://arxiv.org/html/2506.22405v1
Video:
──────────────────────────────────────────────────────
1. Executive Summary
• The paper introduces SDBench (Sequential Diagnosis Benchmark) and MAI-DxO (MAI Diagnostic Orchestrator), a model-agnostic agent that guides language models through an iterative, cost-aware diagnostic workflow.
• When paired with OpenAI’s o3 model, MAI-DxO attains 80 % diagnostic accuracy—roughly 4× that of practicing generalist physicians—while cutting test costs by ~70 % relative to o3 used “out of the box.”
• Gains persist across six LLM families (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama) and on 56 never-seen NEJM cases published after model-training cut-offs, suggesting real generalization and minimal memorization.
──────────────────────────────────────────────────────
2. Main Themes
A. Realistic Benchmarking
– Moving from single-shot, multiple-choice exams to interactive, multi-turn encounters that mirror real practice.
B. Cost-Conscious AI Reasoning
– Explicitly monetizing every test and visit to penalize “test-everything” behavior.
C. Multi-Agent Orchestration
– Simulating a “virtual panel” of five physician personas to reduce cognitive biases (anchoring, premature closure) and to weigh marginal value vs. cost.
D. Human-AI Comparison
– Side-by-side evaluation shows LLM-based systems can already exceed human generalists on difficult cases when given well-designed reasoning scaffolds.
─────────────────────────────────────────────────────
3. Key Findings & Metrics
• Dataset: 304 NEJM Clinicopathological Conference (CPC) cases recast into interactive simulations; latest 56 cases held out.
• Human Baseline: Median-experience physicians (n = 21, 12 years in practice) – 19.9 % accuracy, $2,963 average cost per case.
• Off-the-Shelf LLMs:
– GPT-4o: 49.3 % @ $2,745
– o3: 78.6 % @ $7,850 (best accuracy, worst cost)
• MAI-DxO (o3 backbone):
– “No-Budget” mode: 81.9 % @ $4,735
– “Budgeted” mode: 79.9 % @ $2,397 (≈ 20 % cheaper than physicians, 70 % cheaper than raw o3)
– “Ensemble” mode: 85.5 % @ $7,184 (highest accuracy reported)
• Cross-Model Gains: +11 pp mean accuracy uplift and large cost savings for all non-OpenAI models; statistically significant (p < 0.005).
• Gatekeeper validation: Physicians reviewed 508 responses; only 8 flagged, none leaked answers—supports benchmark integrity.
• Judge reliability: Automated rubric (Likert 1-5) vs. human graders: κ = 0.70–0.87.
──────────────────────────────────────────────────────
4. Novel Capabilities in MAI-DxO
1. Virtual Physician Panel (5 roles)
• Dr. Hypothesis – Bayesian differential maintenance.
• Dr. Test-Chooser – Up to 3 tests per round maximizing information gain.
• Dr. Challenger – Devil’s advocate for anchoring bias.
• Dr. Stewardship – Cost vigilance; suggests cheaper or less invasive alternatives.
• Dr. Checklist – Consistency & formatting QA.
2. Budget Tracker – Optional live cost accounting; can veto or cancel orders.
3. Ensemble Selector – Runs multiple independent panels and aggregates best answer.
4. Synthetic Answer Generation – Gatekeeper fabricates clinically consistent results for unlisted tests, preventing “missing-data clues.”
──────────────────────────────────────────────────────
5. Representative Quotes
• “MAI-DxO achieves 80 % diagnostic accuracy—four times higher than the 20 % average of generalist physicians.”
• “Off-the-shelf o3 achieved 78.6 % accuracy at a cost of $7,850, whereas MAI-DxO achieved 79.9 % at just $2,397.”
• “These performance gains generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families.”
• “Structured reasoning mitigates the accuracy-cost trade-off present in off-the-shelf models and physicians.”
──────────────────────────────────────────────────────
6. Potential Business Applications
• Clinical Decision Support (CDS) – Embed MAI-DxO as a triage or second-opinion engine in hospitals/telemedicine portals.
• Utilization Management – Insurers can leverage cost-aware diagnostics to minimize unnecessary testing.
• Remote / Low-Resource Settings – Provide “specialist-level” reasoning where specialists are scarce; could run on smaller local models with orchestration benefits.
• Medical Education – Interactive, cost-conscious simulators for students and residents.
• Diagnostics SaaS – Vendor-neutral orchestration layer that “plugs into” whichever frontier model is cheapest/best at deployment time, reducing version-chase overhead.
• Consumer Symptom Checkers – Direct-to-consumer tools for initial work-ups if regulatory and safety barriers satisfied.
──────────────────────────────────────────────────────
7. Risks & Limitations
• Dataset Bias – NEJM CPCs over-represent rare, didactic cases; false-positive rates on everyday complaints unknown.
• Cost Model – Uses 2023 US CPT pricing; geography-specific and excludes intangibles (wait time, patient discomfort).
• Human Baseline – Only generalists, disallowed external tools; may understate clinician performance in realistic environments.
• No Visual Modalities – Imaging interpretation left to text description; future multimodal integration needed.
─────────────────────────────────────────────────────
8. Forward-Looking Statements
• Plans to release SDBench publicly (pending peer review & licensing).
• Future work: multimodal inputs (imaging, auscultation), expanded real-world prevalence datasets, deployment pilots in hospital and rural clinics, investigation into patient-facing conversational safety (empathy, explainability).
──────────────────────────────────────────────────────Created with AI


