Artificial intelligence holds great promise for improving medical diagnostics, but current evaluation methods often fail to capture the intricacies of real-world clinical reasoning. Traditional assessments typically rely on static, straightforward scenarios, which do not reflect the dynamic, stepwise process doctors use to refine diagnoses—asking targeted questions, weighing test costs, and updating hypotheses based on new information.
While language models have demonstrated impressive results on structured medical exams, these tests rarely simulate the complexities physicians face in practice, such as avoiding premature diagnostic conclusions or unnecessary testing. Early AI approaches to medical problem-solving, including Bayesian frameworks used in disciplines like pathology and trauma care, required extensive expert input and lacked scalability. More recent initiatives introduced richer case materials through projects like AMIE and the NEJM Clinical Problem Solving Challenge but still depended on fixed vignettes rather than interactive workflows.
Introducing SDBench and MAI-DxO: Towards Interactive, Cost-Conscious Clinical AI
To bridge this gap, Microsoft AI researchers developed SDBench, a sequential diagnosis benchmark designed to mirror realistic clinical decision-making. SDBench draws upon 304 complex cases from the New England Journal of Medicine (2017–2025), transforming them into interactive simulations where AI agents or physicians must sequentially ask questions, request diagnostic tests, and decide on a final diagnosis. Information is controlled by a language-model-powered Gatekeeper that only provides case details when prompted, replicating how doctors gather data in practice.
Building on this framework, the team created MAI-DxO, an orchestrator system developed in collaboration with medical professionals. MAI-DxO acts like a virtual panel of diverse medical experts, prioritizing high-value, cost-effective testing strategies. Partnered with advanced language models such as OpenAI’s o3, MAI-DxO has demonstrated diagnostic accuracy reaching 85.5% while significantly decreasing expenses associated with unnecessary tests.
The evaluation of multiple AI diagnostic agents on SDBench revealed that MAI-DxO consistently outperforms both standard language models and expert physicians in balancing accuracy and diagnostic cost. For example, MAI-DxO achieved an accuracy of 81.9% with an average cost of $4,735 per case, compared to an off-the-shelf o3 model’s 78.6% accuracy at $7,850. These results held strong across different models and test datasets, indicating robust generalizability. The system also improved the diagnostic efficiency of weaker AI models and facilitated resource-conscious reasoning among stronger ones.
At its core, SDBench introduces a realistic, interactive challenge for AI and clinicians alike by requiring active questioning, strategic test ordering, and cost-aware diagnosis—steps critical to patient care but missing in prior static benchmarks. Meanwhile, MAI-DxO’s ability to simulate a multidisciplinary medical team addresses the need for nuanced judgment in complex cases. Although the benchmark focuses on challenging clinical scenarios sourced from NEJM and excludes some common conditions, it represents a significant advance toward AI tools applicable to real-world healthcare settings.
Future development plans include deploying these systems in clinical environments and low-resource regions, where optimized diagnostics could have substantial global health benefits. Additionally, tools like SDBench and MAI-DxO have promising applications in medical education by providing interactive case simulations for training practitioners.
For researchers and developers focused on AI-driven healthcare innovations, SDBench offers a valuable new standard for testing clinical reasoning beyond static exams. MAI-DxO exemplifies how integrating physician expertise with advanced AI can enhance both accuracy and cost-effectiveness in medical diagnostics.
How this approach will influence the broader adoption of AI in clinical workflows remains a key question as these technologies evolve.