Srikanth Bhakthan
AI on AI Podcast
Sequential Diagnosis with Language Models
0:00
-23:12

Sequential Diagnosis with Language Models

#microsoft

arxiv: arXiv:2506.22405, June 2025

Authors: Harsha Nori*, Mayank Daswani*, Christopher Kelly*, et al. – Microsoft AI

(* Equal contribution)

Blog: https://microsoft.ai/new/the-path-to-medical-superintelligence/

Sequential Diagnosis (SD) Bench arxiv: https://arxiv.org/html/2506.22405v1


Video:

──────────────────────────────────────────────────────

1. Executive Summary

• The paper introduces SDBench (Sequential Diagnosis Benchmark) and MAI-DxO (MAI Diagnostic Orchestrator), a model-agnostic agent that guides language models through an iterative, cost-aware diagnostic workflow.

• When paired with OpenAI’s o3 model, MAI-DxO attains 80 % diagnostic accuracy—roughly 4× that of practicing generalist physicians—while cutting test costs by ~70 % relative to o3 used “out of the box.”

• Gains persist across six LLM families (OpenAI, Gemini, Claude, Grok, DeepSeek, Llama) and on 56 never-seen NEJM cases published after model-training cut-offs, suggesting real generalization and minimal memorization.

──────────────────────────────────────────────────────

2. Main Themes

A. Realistic Benchmarking

– Moving from single-shot, multiple-choice exams to interactive, multi-turn encounters that mirror real practice.

B. Cost-Conscious AI Reasoning

– Explicitly monetizing every test and visit to penalize “test-everything” behavior.

C. Multi-Agent Orchestration

Simulating a “virtual panel” of five physician personas to reduce cognitive biases (anchoring, premature closure) and to weigh marginal value vs. cost.

D. Human-AI Comparison

– Side-by-side evaluation shows LLM-based systems can already exceed human generalists on difficult cases when given well-designed reasoning scaffolds.

─────────────────────────────────────────────────────

3. Key Findings & Metrics

• Dataset: 304 NEJM Clinicopathological Conference (CPC) cases recast into interactive simulations; latest 56 cases held out.

• Human Baseline: Median-experience physicians (n = 21, 12 years in practice) – 19.9 % accuracy, $2,963 average cost per case.

• Off-the-Shelf LLMs:

– GPT-4o: 49.3 % @ $2,745

– o3: 78.6 % @ $7,850 (best accuracy, worst cost)

• MAI-DxO (o3 backbone):

– “No-Budget” mode: 81.9 % @ $4,735

– “Budgeted” mode: 79.9 % @ $2,397 (≈ 20 % cheaper than physicians, 70 % cheaper than raw o3)

– “Ensemble” mode: 85.5 % @ $7,184 (highest accuracy reported)

• Cross-Model Gains: +11 pp mean accuracy uplift and large cost savings for all non-OpenAI models; statistically significant (p < 0.005).

• Gatekeeper validation: Physicians reviewed 508 responses; only 8 flagged, none leaked answers—supports benchmark integrity.

• Judge reliability: Automated rubric (Likert 1-5) vs. human graders: κ = 0.70–0.87.

──────────────────────────────────────────────────────

4. Novel Capabilities in MAI-DxO

1. Virtual Physician Panel (5 roles)

• Dr. Hypothesis – Bayesian differential maintenance.

• Dr. Test-Chooser – Up to 3 tests per round maximizing information gain.

• Dr. Challenger – Devil’s advocate for anchoring bias.

• Dr. Stewardship – Cost vigilance; suggests cheaper or less invasive alternatives.

• Dr. Checklist – Consistency & formatting QA.

2. Budget Tracker – Optional live cost accounting; can veto or cancel orders.

3. Ensemble Selector – Runs multiple independent panels and aggregates best answer.

4. Synthetic Answer Generation – Gatekeeper fabricates clinically consistent results for unlisted tests, preventing “missing-data clues.”

──────────────────────────────────────────────────────

5. Representative Quotes

• “MAI-DxO achieves 80 % diagnostic accuracy—four times higher than the 20 % average of generalist physicians.”

• “Off-the-shelf o3 achieved 78.6 % accuracy at a cost of $7,850, whereas MAI-DxO achieved 79.9 % at just $2,397.”

• “These performance gains generalize across models from the OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama families.”

• “Structured reasoning mitigates the accuracy-cost trade-off present in off-the-shelf models and physicians.”

──────────────────────────────────────────────────────

6. Potential Business Applications

• Clinical Decision Support (CDS) – Embed MAI-DxO as a triage or second-opinion engine in hospitals/telemedicine portals.

• Utilization Management – Insurers can leverage cost-aware diagnostics to minimize unnecessary testing.

• Remote / Low-Resource Settings – Provide “specialist-level” reasoning where specialists are scarce; could run on smaller local models with orchestration benefits.

• Medical Education – Interactive, cost-conscious simulators for students and residents.

• Diagnostics SaaS – Vendor-neutral orchestration layer that “plugs into” whichever frontier model is cheapest/best at deployment time, reducing version-chase overhead.

• Consumer Symptom Checkers – Direct-to-consumer tools for initial work-ups if regulatory and safety barriers satisfied.

──────────────────────────────────────────────────────

7. Risks & Limitations

• Dataset Bias – NEJM CPCs over-represent rare, didactic cases; false-positive rates on everyday complaints unknown.

• Cost Model – Uses 2023 US CPT pricing; geography-specific and excludes intangibles (wait time, patient discomfort).

• Human Baseline – Only generalists, disallowed external tools; may understate clinician performance in realistic environments.

• No Visual Modalities – Imaging interpretation left to text description; future multimodal integration needed.

─────────────────────────────────────────────────────

8. Forward-Looking Statements

• Plans to release SDBench publicly (pending peer review & licensing).

Future work: multimodal inputs (imaging, auscultation), expanded real-world prevalence datasets, deployment pilots in hospital and rural clinics, investigation into patient-facing conversational safety (empathy, explainability).

──────────────────────────────────────────────────────Created with AI

Ready for more?