A new study just ran the most rigorous test of clinical AI accuracy I've seen, and the way they did it matters as much as what they found. Here's the setup: A consortium from UCSF, Harvard, University of Washington, and Stanford, with JAMA's lead statistical editor, evaluated how well AI answers real clinical questions. They didn't use exam questions or vignettes. They used 620 point-of-care queries physicians actually asked, then had 149 practicing physicians grade the answers, blind, head-to-head, each scoring questions in their own specialty across 30 fields. Here are three things that made this study stand out: 1. Real questions, not benchmarks. The queries came from what physicians genuinely ask at the point of care, not standardized test items that strip out the mess of real practice. 2. Expert, specialty-matched grading. Instead of models grading models, subspecialists judged answers in their own field, on accuracy, clinical utility, source quality, verifiability, and completeness. 3. Blinded and head-to-head Vendor identities were hidden and answers rendered identically, so raters judged the content, not the brand. Here’s what they found: The specialized clinical tool (OpenEvidence) outperformed the general-purpose models on every axis. On accuracy, its win margin was +24.7%, while GPT-5.5 sat at -21.1%, with Claude and Gemini close to even. The result held whether or not the physician had used the tool before. When frontier models were used to grade the answers instead of humans, Claude and Gemini roughly tracked the expert rankings, but GPT-5.5 rated its own answers the winner almost every time. Even a panel of model judges barely correlated with the physicians. Using AI to grade AI carries real, hard-to-remove bias. This is what fit-for-purpose evaluation looks like: real queries, expert judges matched to the specialty, blinded comparison. It's slower and more expensive than a benchmark, and far closer to the truth. If you rely on clinical AI, do you actually know how the tool you trust was tested?
The digital health space refers to the integration of technology and health care services to improve the overall quality of health care delivery. It encompasses a wide range of innovative and emerging technologies such as wearables, telehealth, artificial intelligence, mobile health, and electronic health records (EHRs). The digital health space offers numerous benefits such as improved patient outcomes, increased access to health care, reduced costs, and improved communication and collaboration between patients and health care providers. For example, patients can now monitor their vital signs such as blood pressure and glucose levels from home using wearable devices and share the data with their doctors in real-time. Telehealth technology allows patients to consult with their health care providers remotely without having to travel to the hospital, making health care more accessible, particularly in remote or rural areas. Artificial intelligence can be used to analyze vast amounts of patient data to identify patterns, predict outcomes, and provide personalized treatment recommendations. Overall, the digital health space is rapidly evolving, and the integration of technology in health
No comments:
Post a Comment