Popular AI chatbots are often unable to recognize false health claims offered in confident, medical-sounding language, leading to dubious advice that could be dangerous to the general public, such as recommending putting garlic in your butt, according to a January study published in The Lancet Digital Health. Another study published in Nature Medicine in February found that chatbots are no different from regular internet searches.
The results add to a growing body of evidence suggesting that such chatbots are not reliable sources of health information, at least not for the general public, experts told Live Science.
Article continues below
you may like
“The crux of the issue is that LLMs don’t fail in the same way that doctors fail,” Dr. Mahmoud Omar, a research scientist at Mount Sinai Medical Center and co-author of the Lancet Digital Health study, told Live Science in an email. “Doctors who are unsure pause, take risks, and order another test. LLM provides the wrong answer with exactly the same confidence as the right answer.”
“Rectal Garlic Insertion for Immune Support”
LLM is designed to respond to written input, such as medical questions, with natural-sounding text. ChatGPT and Gemini, like medical-based LLMs like Ada Health and ChatGPT Health, are trained on large amounts of data, read a lot of medical literature, and have near-perfect scores on medical licensing exams.
And people use them widely. Although most LLMs include a warning that they should not be relied upon for medical advice, more than 40 million people use ChatGPT every day with medical questions.
But in a January study, researchers tested 20 models that included more than 3.4 million prompts taken from public forums and social media conversations, real hospital discharge notes edited to include one false recommendation, and doctor-approved fabricated accounts to assess how well LLMs combat medical misinformation.
“About one in three times, we encountered medical misinformation and just followed it,” Omar said. “The finding that caught us off guard wasn’t the overall susceptibility; it was the pattern.”
When a false medical claim is presented in casual Reddit-style language, the model becomes much more skeptical and fails about 9% of the time. But when the very same claims were repackaged in formal clinical language—discharge notes advising patients to “drink cold milk daily for esophageal bleeding” or “inserting garlic into the rectum for immune support”—that model failed 46% of the time.
The reason for this may be structural. Because LLMs are text-trained, they have learned that clinical language implies authority, but they do not test whether claims are true. “They evaluate whether it sounds like something a reliable source would say,” Omar said.
What to read next
But when the misinformation was framed using logical fallacies (‘a senior clinician with 20 years of experience supports this’ or ‘everyone knows this works’), the model became even more skeptical. This is because LLMs have “learned that you can’t trust the rhetorical tricks of internet discussions, but not the language of clinical documents,” Omar added.
As a result, Omar believes that LLMs cannot be trusted to evaluate and communicate medical information.
Nothing beats an internet search
In the Nature Medicine study, researchers asked how well chatbots help people make medical decisions, such as whether to see a doctor or go to the emergency room. The researchers concluded that LLM did not provide better insight than traditional internet searches because participants did not always ask the right questions and the answers they received often contained a mix of good and bad recommendations, making it difficult to know what to do.
That doesn’t mean all chatbot relays are garbage.
AI chatbots “can provide pretty good recommendations; [at] It is not reliable, at least to some extent,” Marvin Kopka, an AI researcher at the Technical University of Berlin who was not involved in the study, told Live Science via email.
The problem, Kopka says, is that non-experts have “no way to tell whether the output is correct or not.”
For example, a chatbot could advise you whether a bad headache after a night at the movies is meningitis, a trip to the ER, or something more benign, according to the study. However, users do not know whether the advice is reliable, and recommending a wait-and-see approach can be dangerous. “While it may probably be helpful in many situations, it can be actively harmful in others,” Kopka said.
This finding suggests that chatbots are not a good tool for the general public to use for health decision-making.
That doesn’t mean chatbots aren’t useful in healthcare, Omar said, “just that they’re not useful in the way people are using them today.”
Bean, A. M., Payne, R. E., Parsons, G., Kirk, H. R., Ciro, J., Mosquelagomez, R., M. S. H., Ekanayaka, A. S., Tarasenko, L., Roche, L., and Mahdi, A. (2026). Confidence of the LLM as a medical assistant in the general public: A randomized pre-registration study. Natural Medicine, 32(2), 609–615. https://doi.org/10.1038/s41591-025-04074-y
Source link
