As large-scale language models (LLMs) such as ChatGPT, Claude, and Gemini gain acceptance by the public, scientists have explored how these artificial intelligence (AI) tools can enhance medical research.
Some argue that an LLM has the potential to dramatically increase a researcher’s efficiency in completing certain types of medical research. A study published in February in the journal Cell Reports Medicine exemplifies that vision for the technology.
This study used a large dataset of patient biomedical information to predict the risk of preterm birth in a given pregnancy. These types of predictions have been a powerful AI use case for many years, and are more possible with traditional machine learning than LLMs employ. However, this study was notable in that LLM enabled young researchers (graduate students and high school students) to efficiently generate highly accurate code.
you may like
The code predicted the baby’s gestational age at birth and the likelihood of premature birth. The AI’s output matched, and in some cases exceeded, analysis by a team of experts who processed the same data using human-generated code.
“What I saw here with the young scientists and how effective they were really inspired and surprised me,” said study co-author Marina Sirota, interim director of the Baker Institute for Computational Health Sciences at the University of California, San Francisco.
One of the great promises of the LLM is to lower the barriers for researchers to write code and perform complex analyses, but this comes with risks. As AI advances rapidly, researchers must address countless questions. What guardrails need to be established to ensure AI accuracy? How will its output be measured? And as these systems gain traction, how will the role of human researchers evolve?
How AI predictions work
Sirota’s team drew on data from the Dialogue on Reverse Engineering Evaluation and Methodology (DREAM) Challenge, an international competition in which teams of scientists use shared datasets to tackle complex biomedical problems.
The open-source dataset includes blood transcriptomics, which examines RNA, a molecule that reflects which genes are active in the body. These included epigenetic information from placental cells, which describes the chemical tags that sit ‘on top’ of the DNA and control which genes are turned on, as well as microbiome data, which describes the bacteria present in the vaginal fluid sample.
These data points were flagged by sample type, such as blood, placental tissue, or vaginal fluid, and labeled by the outcome of interest, such as gestational age or preterm birth. A machine learning algorithm can then be trained to identify links between the source of the sample and its label. For example, it may turn out that microbiome samples containing a particular mixture of bacteria are often obtained from people who give birth prematurely.
After training on a subset of data, you can test your algorithm on unlabeled samples to see if it can predict the labels that should be there. For example, samples containing bacterial mixtures similar to the training data associated with a high risk of preterm birth should be flagged.
What to read next
But generative AI can also speed up the data cleaning part and normalization.
Marina Sirota, Interim Director, Baker Institute for Computational Health Sciences, University of California, San Francisco
The final step is to evaluate and compare the accuracy of the models. “Accuracy” in the context of machine learning has a specific definition: the number of correct predictions divided by the total number of predictions.
Human-generated code vs. AI-generated code
The DREAM Challenge aimed to clarify the relationship between these medical indicators and the risk of preterm birth. Some risk factors are already well known, such as infections during pregnancy. But for the DREAM Challenge, we wanted to see what kind of signals could be collected from clinical samples such as blood.
This is the type of work that typically requires months of effort by a trained bioinformatician. But rather than writing the analysis code themselves, the junior researchers in the recent study gave each of the eight LLMs a single prompt that described the available data and the labeling task at hand: predicting gestational age or preterm birth.
Tested LLM
This simple prompt generated code that successfully executed four of the eight models (DeepSeekR1, Gemini, ChatGPT o3-mini-high and 4o). OpenAI’s o3-mini, which performed best, was as accurate as the original human DREAM Challenge team. In one task, estimating gestational age from epigenetic data, they were more accurate than humans.
Additionally, the junior researchers generated results in about three months and submitted a manuscript describing their results within six months, whereas the original DREAM Challenge team took years to complete the same process.
“We’ve had some good luck with the review process here, but six months to generate results and write a paper is pretty incredible, especially for a young scientist,” Sirota told Live Science.
Preterm birth before 37 weeks of gestation affects approximately 11% of infants worldwide. Babies born too early are at a higher risk of developing a variety of health problems than babies born full-term, including problems that affect the brain, eyes, and digestive system. Being able to predict which pregnant patients are likely to give birth prematurely could mean closer monitoring and treatment to protect the baby and increase the chances of a full-term birth, experts say.
Beyond writing code
The data used in the Cell Reports Medicine paper started off “in good shape” with tables that were easily readable by AI, Sirota said. “But with generative AI, you can also speed up the data cleaning and normalization part,” she said.
Sirota’s team is currently considering other LLM applications, including a new tool they developed called Chat PTB (short for “premature birth”). The Chat GPT-based tool is incorporated into a paper published by the March of Dimes research network, part of a nonprofit organization dedicated to improving maternal and child health. Instead of manually examining this literature, researchers can now query the chat PTB and receive synthetic answers that include references. What used to take hours is now reduced to seconds.
But tools like Chat PTB and the coding approach in Sirota’s research are just the first wave. AI-powered medical research is moving toward “agent-based” AI. This means that the system performs multi-step research workflows with increased autonomy, rather than only responding to a single prompt.
Instead of responding only with text, agents can review and iterate on their work until they achieve their goals. In addition to simply writing code, it can also perform actions on your behalf, such as searching the Internet and running code.
The shift towards more AI autonomy and less human oversight offers both huge opportunities and serious risks. In a January study published in the journal Nature Biomedical Engineering, researchers evaluated LLM on 293 coding tasks extracted from 39 published biomedical studies, initially allowing LLM to devise its own workflows. We found that the overall accuracy was less than 40%.
Their solution was to separate planning and execution. The AI created a step-by-step analysis plan, which was reviewed by human researchers before any code was written. This approach increased the accuracy to 74%.
The goal of AI is not to be perfect, but to do better than humans.
Ian McCullough, Professor of Computer Science, Whiting School of Engineering, Johns Hopkins University
“The goal is not to ask researchers to blindly trust AI systems,” study co-author Zifeng Wang, who was a doctoral student at the University of Illinois at Urbana-Champaign at the time of the study, told Live Science via email.
Instead, the goal is to “design a framework where the inference, planning, and intermediate steps are sufficiently visible so that researchers can oversee and verify the process,” said Wang, co-founder of Keiji AI.
Why safeguards are important
Scientists caution that these risks do not mean researchers should shy away from AI, but that they should apply the same rigor to AI-generated research that they would apply to the work of other collaborators.
“The question is not whether LLMs will accelerate science or create ‘AI slop,'” Ian McCullough, a professor of computer science at the Johns Hopkins University Whiting School of Engineering, told Live Science in an email. “The question is how to leverage this powerful technology within the scientific method.”
But McCullough also warned against pushing AI to impossible standards. People tend to assume that AI is fallible and discount human error, but the reality is that both humans and machines make mistakes, he said. He shared the case of a consulting client who was unaware that human employees had a 25% error rate and lamented that AI had a 15% error rate on a particular task.
“The goal of AI is not to be perfect, but to do better than humans,” McCullough said.
That effort includes agreeing on how to measure AI success. Dr. Ethan Goh, a physician and researcher at Stanford University, pointed out that healthcare still lacks standardized benchmarks to evaluate AI performance. Goh recently published a randomized trial in JAMA Network Open that investigated how LLM affects physicians’ diagnostic decisions.
Because LLMs are trained on vast amounts of data, “benchmarks are very expensive to create,” Goh told Live Science. What’s more, AI advances so quickly that most commercial models will start outperforming some existing benchmarks and quickly become useless, he said. Amid these challenges, Goh’s team at Stanford University’s AI Research and Science Evaluation (ARISE) Healthcare Network is working to develop such a standard by the end of this year.
Despite all the uncertainty around standards and safety measures, researchers who spoke with Live Science shared a common belief: AI belongs in the lab, but it doesn’t go unsupervised.
“We have to be careful not to forget what we know about the scientific process,” Sirota says. “But I think the opportunity is huge.”
Source link
