Architectural limitations of today’s most popular artificial intelligence (AI) tools may limit their ability to become more intelligent, new research suggests.
A study published on February 5th on the preprint arXiv server argues that modern large-scale language models (LLMs) are inherently prone to breakdowns in problem-solving logic known as “reasoning failures.”
you may like
Based on LLM’s performance in assessments such as the Last Humanity Test, some scientists say the underlying neural network architecture could one day lead to models that can reach human-level cognition. While the transformer architecture makes LLM extremely capable at tasks such as language generation, the researchers argue that it also hinders the reliable logical processes needed to achieve true human-level reasoning.
“LLM demonstrated remarkable reasoning abilities and achieved impressive results across a wide range of tasks,” the researchers wrote in the study. “Despite these advances, significant reasoning failures continue and occur even in seemingly simple scenarios…This failure stems from a lack of overall planning and deep thinking.”
LLM limitations
LLM is trained on vast amounts of text data and generates responses to user prompts by predicting plausible answers word by word. It does this by stringing together units of text called “tokens” based on statistical patterns learned from training data.
Transformers also use a mechanism called “self-attention” to track relationships between words and concepts over long text strings. The combination of self-attention and their large training databases makes modern chatbots very good at generating convincing answers to user prompts.
However, the LLM does not involve any actual “thinking” in the traditional sense. Instead, the response is determined by an algorithm. For long tasks, especially those that require serious problem solving over multiple steps, the transformer can lose track of important information and default to patterns learned from training data. As a result, the inference fails.
That’s not real reasoning in the human sense. Still, it’s just a prediction of the next token disguised as a chain of thoughts.
Federico Nanni, Senior Research Data Scientist, Alan Turing Institute
“This fundamental weakness extends beyond basic tasks to constructing math problems, testing multiple factual claims, and other tasks that are compositional in nature,” the researchers wrote in their study.
Reasoning failures are also why LLMs circle the same answer even after being told that a user’s query is wrong, or generate different answers when the same question is worded slightly differently, even when asked to explain their reasoning step-by-step.
What to read next
Federico Nanni, a senior research data scientist at Britain’s Alan Turing Institute, argues that what LLMs typically present as inference is mostly window dressing.
“We found that people often got the correct answer when we asked them to ‘think step by step’ and write out their reasoning process first, rather than answering the LLM directly,” Nani told Live Science. “But it’s a trick. It’s not real reasoning in the human sense. It’s still just a prediction of the next token disguised as a chain of thoughts,” he says. “When we refer to these models as ‘reasons’, what we really mean is that they write out a process of inference – something that sounds like a chain of plausible inferences.”
Gaps in existing AI benchmarks
Researchers found that current methods of evaluating LLM performance fall short in three key areas. First, changing the wording of the prompt can affect the results. Second, benchmarks degrade and become contaminated the more they are used. And finally, evaluate only the results, not the reasoning process the model used to reach its conclusion.
This means that current benchmarks may significantly exaggerate the capabilities of LLM and underestimate how often LLM fails in real-world use.
“Our position is not that benchmarks are flawed, but rather that they need to evolve,” study co-author Peiyan Song, a computer science and robotics student at Caltech, told Live Science in an email. Similarly, benchmarks tend to leak into LLM training data, Nanni said, meaning subsequent LLMs find ways to fool the benchmarks.
“Plus, now that the models are deployed in production, the usage itself becomes a kind of benchmark,” Nanni says. “You put the system in front of the users and see what goes wrong. That’s the new test. So, yes, we need better benchmarks, and we need to rely less on AI to check AI. But that’s actually very difficult, because these tools are now embedded in the way we work, and just using them is very useful.”
A new architecture for AGI?
Unlike other recent studies, this new study does not claim that the neural network approach to AI is a dead end in our quest to achieve artificial general intelligence (AGI). Rather, the researchers liken this to the early days of computing and point out that understanding why LLMs fail is the key to improving them.
However, they argue that simply training or scaling up a model with more data is unlikely to solve the problem on its own. This means that developing AGI may require a fundamentally different approach to how models are built.
“Neural networks, and LLM in particular, are clearly part of the AGI picture, and the progress they’ve made is incredible,” said Song. “However, our research suggests that scaling alone is unlikely to resolve all inference failures… [meaning] Reaching human-level reasoning will likely require architectural innovations, stronger world models, improved robustness training, and deeper integration of structured reasoning with embodied interaction. ”
Nanni agreed. “From a philosophy of mind perspective, we can basically say that we have discovered the limits of transformers. Transformers are not the way to build a digital mind,” he said. “They model text so well that it’s almost impossible to tell whether the text was written by a human or a machine. But that’s what language models are… You can only push this architecture so far.”
Source link
