Large-scale language models (LLMs) are oversimplified with each new version, and in some cases misrepresent important scientific and medical findings, making them almost “intelligent” as new research is discovered.
Scientists have found that ChatGpt, Llama, and Deepseek versions are five times more likely to oversimplify scientific discoveries than human experts in their analysis of summaries of 4,900 research papers.
Given an accuracy prompt, chatbots were twice as likely to overly normalize their findings than when asked for a brief summary. The test also revealed an increase in excess generalization between new chatbot versions compared to previous generations.
You might like it
Researchers published their findings in a new study at the Royal Society Open Science on April 30th.
“One of the biggest challenges is that generalizations seem benign or useful until we realize that they are changing the meaning of the original study,” wrote Uwe Peters, a postdoctoral researcher at the University of Bonn in Germany, in an email to Live Science. “What we’re adding here is a systematic way to detect when a model generalizes beyond what is guaranteed in the original text.”
It’s like a copier with a broken lens that makes subsequent copies bigger and bolder than the original. LLMS filters information through a set of computational layers. Along the way, you can either lose some information or change the meaning in subtle ways. This is especially true for scientific research, as scientists need to frequently include qualifications, context, and limitations in their findings. It can be extremely difficult to provide a simple and accurate summary of the findings.
“While previous LLMs were more likely to avoid answering difficult questions, instead of rejecting responses, newer, larger, and more capable models produced authoritative but flawed responses that are often misleading,” the researchers wrote.
Related: AI is overconfident and as biased as humans, learning shows
In one example of this study, DeepSeek created medical recommendations in one summary by changing the phrase “a safe and effective treatment option.”
Another test in this study showed that the range of efficacy of drug treatment type 2 diabetes in young people by eliminating information about drug dosage, frequency, and efficacy.
If published, the summary generated by this chatbot could potentially allow a healthcare professional to prescribe the drug on a non-effective parameter.
Unsafe treatment options
In the new study, researchers worked to answer three questions about the 10 most popular versions of LLM: four versions of ChatGPT, three versions of Claude, two versions of Llama, and one of Deepseek.
When they were encouraged to present and summarise human summary of articles in the Academic Journal, LLM overstated the summary and wanted to see if seeking a more accurate answer would have better results. The team also sought to find out if LLMs are overdoing more than humans.
The findings revealed that LLMS except Claude, which worked well on all test criteria, was twice as likely to produce an overgeneralized result given a prompt for accuracy. The LLM summary could be nearly five times higher than the human-generated summary for rendering generalized conclusions.
The researchers also noted that LLMS is most common overcombined overloaded by quantified data to generic information, and is most likely to create unsafe treatment options.
These transitions and overgeneralization have led to bias, according to experts at the intersection of AI and healthcare.
“The study emphasizes that bias can also take a more subtle form, like quiet inflation in the scope of bias,” Max Rollwage, vice president of AI and Limbic, a clinical mental health AI technology company, told Live Science in an email. “In domains like medicine, LLM summary is already a routine part of the workflow. This makes it even more important to look at how these systems are run and whether the output is reliable to faithfully represent the original evidence.”
Such findings should encourage developers to create workflow guardrails that identify excessive replicating and omissions of critical information before leaving their findings in the hands of public or expert groups, Rollwage said.
Comprehensively, this study had limitations. Future research will benefit from expanding the test to other scientific tasks and non-English texts. It also benefits from testing which types of scientific claims are affected by overgeneralization, says Patricia Thaine, co-founder and CEO of AI development company.
Rollwage also said “deep, faster engineering analyses could have improved or clarified results,” and Peters sees greater risks on the horizon as their reliance on chatbots grows.
“Tools like ChatGpt, Claude, and Deepseek have become part of how people understand scientific discoveries,” he writes. “As their use continues to grow, this poses a real risk of massive misconceptions of science at a moment when public trust and scientific literacy are already under pressure.”
For other experts in this field, the challenge we face is ignoring specialized knowledge and protection.
“Models are trained in simplified scientific journalism, not or in addition to the primary sources of information inheriting those overstatements,” Thaine wrote in Live Science.
“However, what’s important is that we are applying a generic model to specialized domains without the supervision of appropriate experts. This is a fundamental misuse of technology that requires more task-specific training.”
Source link