Axios has the sensational headline: ChatGPT plays doctor with 72% success.
Driving the news: A new study from Mass General Brigham researchers testing ChatGPT’s performance on textbook-drawn case studies found the AI bot achieved 72% accuracy in overall clinical decision making, ranging from identifying possible diagnoses to making final diagnoses and care decisions.
Let’s break this down a little.
Textbook Case Studies
ChatGPT depends on its training the basically read the internet and an unknown number of books. This means that it likely had access to these case studies, or something similar. In other words, it already knew the answer to the test question. This is not the same thing as a real diagnosis.
Accuracy Changes
ChatGPT’s models are incredibly opaque — even by AI standards.
First, accuracy in ChatGPT has already been observed as changing. It isn’t just ChatGPT, it is an issue with AI models more broady:
AI drift occurs when an AI system’s performance and behavior change over time, often due to the evolving nature of the data it interacts with and learns from. This can result in the Artificial intelligence system making predictions or decisions that deviate from its original design and intended purpose. In essence, AI model drift is a form of algorithmic bias that can lead to unintended consequences and potentially harmful outcomes.
Analytics Insight
Second, OpenAI puts out new version of its models every few months. You don’t notice this as an end-user, but it can make a big difference in output, and in the ability to truly test in accuracy.
To the right is a partial list of OpenAI models and the date they will be shutdown: no more access to them at all.
This means that any study done today on one of these models can’t be replicated in January. Every few months the models need to be re-tested for accuracy.
A lead AI researcher says:
“Any results on closed-source models are not reproducible and not verifiable, and therefore, from a scientific perspective, we are comparing raccoons and squirrels,” [Sasha Luccioni of Hugging Face] told Ars.
Ars Technica
Partial ChatGPT Deprecation Schedule
SHUTDOWN DATE | MODEL | PRICE | RECOMMENDED REPLACEMENT |
---|---|---|---|
2024-01-04 | text-ada-001 | $0.0004 / 1K tokens | gpt-3.5-turbo-instruct |
2024-01-04 | text-babbage-001 | $0.0005 / 1K tokens | gpt-3.5-turbo-instruct |
2024-01-04 | text-curie-001 | $0.0020 / 1K tokens | gpt-3.5-turbo-instruct |
2024-01-04 | text-davinci-001 | $0.0200 / 1K tokens | gpt-3.5-turbo-instruct |
2024-01-04 | text-davinci-002 | $0.0200 / 1K tokens | gpt-3.5-turbo-instruct |
2024-01-04 | text-davinci-003 | $0.0200 / 1K tokens | gpt-3.5-turbo-instruct |
It is an open question if LLM models will be able to perform at this kind of level in the future, but we can’t count on them today.
This Distracts from Helpful Machine Learning
ML (AI in a more specific context) is already proven in other areas. For example, Mayo Clinic uses ML in radiology.
“Radiology has had the lead, partly because AI is driven by data, and radiology has a lot of digital data already ready to be used by AI.”
Radiology has a narrow context, and an understandable learning concept. We can define what radiological images look like and if they show areas of concern — or not. This is different from an LLM like ChatGPT where the learning scope is so broad we don’t understand it: it really is a black box. [1]
Look Beyond The Hype
You need to look beyond the hype to understand where AI is making gains today. You’ll usually find that information in less-mainstream publications, and in headlines that are non-sensational.
[1] You could argue that this is a matter of scale, and that radiology is much smaller black box. You’d be right, but the number of variables is so vastly different that it’s more than Apples and Oranges. In addition, testing radiology outcomes is relatively straightforward, unlike broader medical diagnoses. [2] You know where LLMs work well? Summary: the SEO summary for this post was created by ChatGPT:ChatGPT achieves 72% accuracy in clinical textbook case studies, but concerns arise over drift, model opaqueness, and frequent updates. Don’t go to Dr. ChatGPT
ChatGPT, Model GPT-4. August 3 version
0 Comments