Without cracking a single textbook, without spending a day in medical school, the co-author of a preprint study correctly answered enough practice questions that it would have passed the real US Medical Licensing Examination.
But the test-taker wasn’t a member of Mensa or a medical savant; it was the artificial intelligence ChatGPT.
The tool, which was created to answer user questions in a conversational manner, has generated so much buzz that doctors and scientists are trying to determine what its limitations are – and what it could do for health and medicine.
ChatGPT, or Chat Generative Pre-trained Transformer, is a natural language-processing tool driven by artificial intelligence.
The technology, created by San Francisco-based OpenAI and launched in November, is not like a well-spoken search engine. It isn’t even connected to the internet. Rather, a human programmer feeds it a vast amount of online data that’s kept on a server.
It can answer questions even if it has never seen a particular sequence of words before, because ChatGPT’s algorithm is trained to predict what word will come up in a sentence based on the context of what comes before it. It draws on knowledge stored on its server to generate its response.
ChatGPT can also answer followup questions, admit mistakes and reject inappropriate questions, the company says. It’s free to try while its makers are testing it.
Artificial intelligence programs have been around for a while, but this one generated so much interest that medical practices, professional associations and medical journals have created task forces to see how it might be useful and to understand what limitations and ethical concerns it may bring.
Dr. Victor Tseng’s practice, Ansible Health, has set up a task force on the issue. The pulmonologist is a medical director of the California-based group and a co-author of the study in which ChatGPT demonstrated that it could probably pass the medical licensing exam.
Tseng said his colleagues started playing around with ChatGPT last year and were intrigued when it accurately diagnosed pretend patients in hypothetical scenarios.
“We were just so impressed and truly flabbergasted by the eloquence and sort of fluidity of its response that we decided that we should actually bring this into our formal evaluation process and start testing it against the benchmark for medical knowledge,” he said.
That benchmark was the three-part test that US med school graduates have to pass to be licensed to practice medicine. It’s generally considered one of the toughest of any profession because it doesn’t ask straightforward questions with answers that can easily found on the internet.
The exam tests basic science and medical knowledge and case management, but it also assesses clinical reasoning, ethics, critical thinking and problem-solving skills.
The study team used 305 publicly available test questions from the June 2022 sample exam. None of the answers or related context was indexed on Google before January 1, 2022, so they would not be a part of the information on which ChatGPT trained. The study authors removed sample questions that had visuals and graphs, and they started a new chat session for each question they asked.
Students often spend hundreds of hours preparing, and medical schools typically give them time away from class just for that purpose. ChatGPT had to do none of that prep work.
The AI performed at or near passing for all the parts of the exam without any specialized training, showing “a high level of concordance and insight in its explanations,” the study says.
Tseng was impressed.
“There’s a lot of red herrings,” he said. “Googling or trying to even intuitively figure out with an open-book approach is very difficult. It might take hours to answer one question that way. But ChatGPT was able to give an accurate answer about 60% of the time with cogent explanations within five seconds.”
Dr. Alex Mechaber, vice president of the US Medical Licensing Examination at the National Board of Medical Examiners, said ChatGPT’s passing results didn’t surprise him.
“The input material is really largely representative of medical knowledge and the type of multiple-choice questions which AI is most likely to be successful with,” he said.
Mechaber said the board is also testing ChatGPT with the exam. The members are especially interested in the answers the technology got wrong, and they want to understand why.
“I think this technology is really exciting,” he said. “We were also pretty aware and vigilant about the risks that large language models bring in terms of the potential for misinformation, and also potentially having harmful stereotypes and bias.”
He believes that there is potential with the technology.
“I think it’s going to get better and better, and we are excited and want to figure out how do we embrace it and use it in the right ways,” he said.
Already, ChatGPT has entered the discussion around research and publishing.
The results of the medical licensing exam study were even written up with the help of ChatGPT. The technology was originally listed as a co-author of the draft, but Tseng says that when the study is published in the journal PLOS Digital Health this year, ChatGPT will not be listed as an author because it would be a distraction.
Last month, the journal Nature created guidelines that said no such program could be credited as an author because “any attribution of authorship carries with it accountability for the work, and AI tools cannot take such responsibility.”
But an article published Thursday in the journal Radiology was written almost entirely by ChatGPT. It was asked whether it could replace a human medical writer, and the program listed many of its possible uses, including writing study reports, creating documents that patients will read and translating medical information into a variety of languages.
Still, it does have some limitations.
“I think it definitely is going to help, but everything in AI needs guardrails,” said Dr. Linda Moy, the editor of Radiology and a professor of radiology at the NYU Grossman School of Medicine.
She said ChatGPT’s article was pretty accurate, but it made up some references.
One of Moy’s other concerns is that the AI could fabricate data. It’s only as good as the information it’s fed, and with so much inaccurate information available online about things like Covid-19 vaccines, it could use that to generate inaccurate results.
Moy’s colleague Artie Shen, a graduating Ph.D. candidate at NYU’s Center for Data Science, is exploring ChatGPT’s potential as a kind of translator for other AI programs for medical imaging analysis. For years, scientists have studied AI programs from startups and larger operations, like Google, that can recognize complex patterns in imaging data. The hope is that these could provide quantitative assessments that could potentially uncover diseases, possibly more effectively than the human eye.
“AI can give you a very accurate diagnosis, but they will never tell you how they reach this diagnosis,” Shen said. He believes that ChatGPT could work with the other programs to capture its rationale and observations.
“If they can talk, it has the potential to enable those systems to convey their knowledge in the same way as an experienced radiologist,” he said.
Tseng said he ultimately thinks ChatGPT can enhance medical practice in much the same way online medical information has both empowered patients and forced doctors to become better communicators, because they now have to provide insight around what patients read online.
ChatGPT won’t replace doctors. Tseng’s group will continue to test it to learn why it creates certain errors and what other ethical parameters need to be put in place before using it for real. But Tseng thinks it could make the medical profession more accessible. For example, a doctor could ask ChatGPT to simplify complicated medical jargon into language that someone with a seventh-grade education could understand.
“AI is here. The doors are open,” Tseng said. “My fundamental hope is, it will actually make me and make us as physicians and providers better.”