Researchers have trained artificial intelligence (AI) systems to interpret the results of visual tests such as mammograms, MRIs, and tissue biopsies. As AI becomes more and more capable, some analysts have suggested that these models could replace humans in medical diagnostics.
But now, new research casts doubt on the ability of current AI models to deliver reliable results and highlights serious flaws that could hinder the use of AI in healthcare.
you may like
They call this phenomenon a “mirage,” and this is the first time this effect has been shown across multiple AI models used to interpret images across multiple disciplines.
“What we’re showing is that even if the AI is describing something very specific that you think, ‘Oh, there’s no way I could make that up,’ yes, they can make it up,” said Mohammad Asadi, the study’s lead author and a data scientist at Stanford University. “They can make up something very rare and very specific.”
When AI recognizes something that isn’t there
AI “hallucinations” are well-documented and include models embedding fabricated details, such as misquotes from real essays. This is often caused by the AI making inaccurate or illogical predictions based on the training data provided. Scientists called the phenomenon a “mirage” in the new study because the AI created its own explanations of the original images and created answers based on those non-existent images.
In the study, the researchers gave 12 models text input prompts such as “Please identify the type of tissue present in this histology slide.” Then either you provided an image for the slide, or you didn’t. If an image was not provided to the model, it would sometimes alert the human user that no image was provided. However, in most cases the model will instead describe the non-existent image and provide an answer to the original prompt.
The researchers observed this “mirage mode” across 20 fields, testing the model’s interpretation of images ranging from satellites to crowds to birds. The mirage effect was seen across all disciplines and all AI models at various levels. But it was especially noticeable in medical diagnosis.
When given text prompts for brain MRIs, chest X-rays, electrocardiograms, and pathology slides, and in the absence of actual images, the AI model’s answers tended to be biased toward diagnoses that required immediate clinical follow-up. Therefore, the researchers concluded that when AI is used in clinical decision-making, it may encourage more aggressive medical treatment than necessary.
Why does AI invent images?
So how does an AI model describe an image that doesn’t exist?
What to read next
Models trained using large amounts of textual and visual data aim to find answers to questions in as few steps as possible. And research shows they will take every shortcut possible to provide answers. Therefore, the model may end up relying only on this trained logic and not on the images provided.
Interestingly, the researchers found that the AI model also performed better in Mirage mode against benchmark tests typically used to assess accuracy. These standardized tests ask the model to complete a task, such as answering multiple-choice questions, and compare its performance to the expected output answer key.
Researchers can fine-tune benchmark tests to assess AI’s visual understanding of images, but this approach doesn’t take into account questions that can be answered based on mirages. Additionally, AI models are often trained on the same data that is used as a reference to create benchmark tests. Therefore, the model can answer questions based on its reference data, rather than actually interpreting the image.
According to Asadi, this is a problem because there is no way to tell whether the AI model actually analyzed the image or just made it up. If you are uploading a large number of images, but some images are corrupted or missing from your dataset, the model may not tell you that. And based on the mirage image, it may be possible to provide a very consistent, comprehensive, and convincing answer.
”[AI models] “They’re very good at interpreting images, but on the other hand, they’re also very, very good at convincing us of things and speaking to us in an authoritative way,” Asadi said.
Its authority is evidenced by the fact that many consumers contact AI chatbots for health guidance, with approximately one-third of U.S. adults reporting doing so. The authority of this conversation increases the risk that fabricated or overconfident output will be trusted by both the general public and medical professionals, the study authors said.
“We urgently need a new generation of assessment frameworks that rigorously measure true cross-modal integration, ensuring that AI truly ‘sees’ the pathology and not just ‘reads’ the clinical situation,” Hongye Zeng, a biomedical AI researcher in the UCLA Department of Radiology who was not involved in the study, told Live Science via email.
This study shows that while AI is becoming an increasingly useful tool in medical diagnostics, there are still aspects of its inner workings that are not understood. Adashi believes that AI models can discover things that medical professionals may have missed, but he also believes there should be limits to how much we trust them.
AI companies have tried to put higher guardrails to prevent their models from hallucinating or spreading misinformation, but Asadi warned that even these safeguards cannot completely prevent the mirage effect.
Source link
