A.I. Chatbots Defeated Doctors at Diagnosing Illness
A small study found ChatGPT outdid human physicians when assessing medical case histories, even when those doctors were using a chatbot.
View from a hallway into an exam room of a health care center.
In an experiment, doctors who were given ChatGPT to diagnose illness did only slightly better than doctors who did not. But the chatbot alone outperformed all the doctors. Credit...Michelle Gustafson for The New York Times
By Gina Kolata
Nov. 17, 2024
Adam Rodman, an expert in internal medicine at Beth Israel Deaconess Medical Center in Boston, confidently expected that chatbots built to use artificial intelligence would help doctors diagnose illnesses.
He was wrong.
Instead, in a study Dr. Rodman helped design, doctors who were given ChatGPT-4 along with conventional resources did only slightly better than doctors who did not have access to the bot. And, to the researchers’ surprise, ChatGPT alone outperformed the doctors.
“I was shocked,” Dr. Rodman said.
The chatbot, from the company OpenAI, scored an average of 90 percent when diagnosing a medical condition from a case report and explaining its reasoning. Doctors randomly assigned to use the chatbot got an average score of 76 percent. Those randomly assigned not to use it had an average score of 74 percent.
The study showed more than just the chatbot’s superior performance.
It unveiled doctors’ sometimes unwavering belief in a diagnosis they made, even when a chatbot potentially suggests a better one.
The study illustrated that while doctors are being exposed to the tools of artificial intelligence for their work, few know how to exploit the abilities of chatbots. As a result, they failed to take advantage of A.I. systems’ ability to solve complex diagnostic problems and offer explanations for their diagnoses.
A.I. systems should be “doctor extenders,” Dr. Rodman said, offering valuable second opinions on diagnoses.
But it looks as if there is a way to go before that potential is realized.
Case History, Case Future
The experiment involved 50 doctors, a mix of residents, and attending physicians recruited through a few large American hospital systems, and was published last month in the journal JAMA Network Open.
The test subjects were given six case histories and were graded on their ability to suggest diagnoses and explain why they favored or ruled them out. Their grades also included getting the final diagnosis right.
The graders were medical experts who saw only the participants’ answers, without knowing whether they were from a doctor with ChatGPT, a doctor without it or from ChatGPT by itself.
The case histories used in the study were based on real patients and are part of a set of 105 cases that has been used by researchers since the 1990s. The cases intentionally have never been published so that medical students and others could be tested on them without any foreknowledge. That also meant that ChatGPT could not have been trained on them.
But, to illustrate what the study involved, the investigators published one of the six cases the doctors were tested on, along with answers to the test questions on that case from a doctor who scored high and from one whose score was low with the blood thinner heparin for 48 hours after the procedure.
The man complained that he felt feverish and tired. His cardiologist had done lab studies that indicated a new onset of anemia and a buildup of nitrogen and other kidney waste products in his blood. The man had had bypass surgery for heart disease a decade earlier.
The case vignette continued to include details of the man’s physical exam and then provided his lab test results.
The correct diagnosis was cholesterol embolism — a condition in which shards of cholesterol break off from plaque in arteries and block blood vessels.
Participants were asked for three possible diagnoses, with supporting evidence for each. They also were asked to provide, for each possible diagnosis, findings that do not support it or that were expected but not present.
The participants also were asked to provide a final diagnosis. Then they were to name up to three additional steps they would take in their diagnostic process.
Like the diagnosis for the published case, the diagnoses for the other five cases in the study were not easy to figure out. But neither were they so rare as to be almost unheard of. Yet the doctors on average did worse than the chatbot.
What, the researchers asked, was going on?
The answer seems to hinge on questions of how doctors settle on a diagnosis, and how they use a tool like artificial intelligence.
The key ingredient is how to use the prompts to query ChatGPT.