Feb 13 2023

ChatGPT Almost Passes Medical Licensure Exams

The emergence of several AI applications for public use, such as Dalle-2, Midjourney, and ChatGPT, had made AI one of the biggest science news items of the past year. I have written about it here extensively myself, and have been using these applications extensively to get a feel for what they can, and cannot, do. The capability of these systems, however, is a rapidly moving target.

Recently I wrote about the potential for a ChatGPT-like application as an expert system, specifically to aid in medical practice. Already there is an update worthy of a new post (also that was posted on SBM). For background, ChatGPT is a large language model, essentially a powerful chatbot that is able to produce natural-language responses that are coherent in response to user prompts. Ask it a question or give it a task and it will spit out a fairly descent response. It is trained on data from the internet up to 2021. The application has many teachers freaking out because it produces good essay responses, at least at a high school level. I don’t think this will ultimately be a problem but it will force teachers to rethink essay-based assignments.

As a marker of the real-world potential of these AI apps, Microsoft has reportedly put billions of dollars into ChatGPT and is incorporating it into their Bing search engine. Google has countered with their own application, Bard, which is off to a bumpy start, but give them time. The next version of ChatGPT, version 4, is coming out soon, and promising to be even more powerful and up-to-date. The bottom line – expect to see this software everywhere, incorporated into the background of our computing experience. In fact, ChatGPT will be writing that software.

It is always a question, however, how the public will interface with new technology and how they will feel about it. Once we get over the novelty and hype stage, will the general public incorporate the new tech into their daily lives? The smartphone is perhaps the best recent example of a new technology that rapidly changed the world. The Segue is the iconic counter example. I think the answer for the new AI apps is how they are applied. One “killer app” and soon we won’t remember how we got buy without this technology. My prediction is that ChatGPT-type AI applications will be excellent personal digital assistants.

What I discussed on SBM is the potential for ChatGPT-style AI software to be an excellent expert system for medical professionals. What these systems are great at is having a massive database of information at their digital finger tips. They can quickly comb through that information and provide a readable executive summary. The medical world is crying out for such an application, as we are increasingly buried in a non-stop avalanche of new research, practice standards, and treatment options. This can, and should become, an essential tool for any clinician.

I am apparently not the only person to have this (admittedly obvious) idea. Stanford University has created PubMedGPT – a version of ChatGPT trained exclusively on the medical literature. At the very least this can serve as an excellent search engine – “Show me all published studies in the last 2 years relating to treatment X of disease Y.” PubMed is an invaluable and necessary resource. But it’s search engine is somewhat clunky. I often will combine it with a Google search, which just has a better search engine. If nothing else I would like to see PubMed incorporate ChatGPT technology into its search engine.

To test the model PubMedGPT was given the three part USMLE exams, which doctors have to pass in order to become licensed. Pass/fail is determined as a percentile, but is usually around 60% of the questions (it’s a really hard test, so don’t think that’s a bad performance). PubMedGPT scored a 50.8%, which is not passing but pretty good for a chatbot. Many of the questions are subtle and conceptually complicated, so that is an impressive performance.

However, ChatGPT (again, trained on the internet as of 2021) was also given the test. It performed between 52.4 and 75% on the three tests, with an average score just below the 60% threshold. To be clear, it would not have passed all three exams, but this is an impressive result. It’s also better than PubMedGPT, which is interesting. I wonder how a GPT app would do if it were first trained on the internet and then on PubMed giving priority to information on the latter?

We should think of this result the same way as the first time a computer program came close to beating a world chess master. Before long those chess programs were so good no human player could come close to them. Similarly, I don’t think it will be long (if development on this specific application continues) before we have GPT medical expert systems scoring 80% correct, and then eventually >90% correct. f

ChatGPT is also passing law school exams, and MBA exams. Again, it is not outperforming the best students, or even average students, but give it time.

This is all good. It shows the potential of this type of application of AI technology. I look forward to the day I have on my clinic desktop a MedicalGPT application ready to provide up-to-date information to assist my clinical decision-making. Think of the health care money savings. Microsoft is putting billions of dollars into getting an edge on the search engine wars. We can put billions of dollars into improving healthcare.

No responses yet