A new study just ran the most rigorous test of clinical AI accuracy I've seen, and the way they did it matters as much as what they found. Here's the setup: A consortium from UCSF, Harvard, University of Washington, and Stanford, with JAMA's lead statistical editor, evaluated how well AI answers real clinical questions. They didn't use exam questions or vignettes. They used 620 point-of-care queries physicians actually asked, then had 149 practicing physicians grade the answers, blind, head-to-head, each scoring questions in their own specialty across 30 fields. Here are three things that made this study stand out: 1. Real questions, not benchmarks. The queries came from what physicians genuinely ask at the point of care, not standardized test items that strip out the mess of real practice. 2. Expert, specialty-matched grading. Instead of models grading models, subspecialists judged answers in their own field, on accuracy, clinical utility, source quality, verifiability, and completeness. 3. Blinded and head-to-head Vendor identities were hidden and answers rendered identically, so raters judged the content, not the brand. Here’s what they found: The specialized clinical tool (OpenEvidence) outperformed the general-purpose models on every axis. On accuracy, its win margin was +24.7%, while GPT-5.5 sat at -21.1%, with Claude and Gemini close to even. The result held whether or not the physician had used the tool before. When frontier models were used to grade the answers instead of humans, Claude and Gemini roughly tracked the expert rankings, but GPT-5.5 rated its own answers the winner almost every time. Even a panel of model judges barely correlated with the physicians. Using AI to grade AI carries real, hard-to-remove bias. This is what fit-for-purpose evaluation looks like: real queries, expert judges matched to the specialty, blinded comparison. It's slower and more expensive than a benchmark, and far closer to the truth. If you rely on clinical AI, do you actually know how the tool you trust was tested?

Digital Health Space

Friday, July 3, 2026

The Issues with Physicians and A.I.

The Clinical error that AI keeps making (And We're Not Catching It)

Training didn't stop it: Why physicians still trust flawed AI

We train physicians to practice, not to lead

A small evolution (and what's coming next)

Training didn't stop it: Why physicians still trust flawed AI

AI in healthcare just cost companies $5.7 billion

Which psychiatric medications actually reduce suicide risk?

AI scribes have a 70% error rate.

89% of physicians 60+ are using AI or are interested. Age isn't a barrier.

No comments:

Post a Comment

Total Pageviews