‘That’s not the way to build a digital mind’: How failures in reasoning are preventing AI models from achieving human-level intelligence

Architectural limitations of today’s most popular artificial intelligence (AI) tools may limit their ability to become more intelligent, new research suggests.

A study published on February 5th on the preprint arXiv server argues that modern large-scale language models (LLMs) are inherently prone to breakdowns in problem-solving logic known as “reasoning failures.”

LLM limitations

LLM is trained on vast amounts of text data and generates responses to user prompts by predicting plausible answers word by word. It does this by stringing together units of text called “tokens” based on statistical patterns learned from training data.

Transformers also use a mechanism called “self-attention” to track relationships between words and concepts over long text strings. The combination of self-attention and their large training databases makes modern chatbots very good at generating convincing answers to user prompts.

However, the LLM does not involve any actual “thinking” in the traditional sense. Instead, the response is determined by an algorithm. For long tasks, especially those that require serious problem solving over multiple steps, the transformer can lose track of important information and default to patterns learned from training data. As a result, the inference fails.

That’s not real reasoning in the human sense. Still, it’s just a prediction of the next token disguised as a chain of thoughts.
Federico Nanni, Senior Research Data Scientist, Alan Turing Institute

“This fundamental weakness extends beyond basic tasks to constructing math problems, testing multiple factual claims, and other tasks that are compositional in nature,” the researchers wrote in their study.

Reasoning failures are also why LLMs circle the same answer even after being told that a user’s query is wrong, or generate different answers when the same question is worded slightly differently, even when asked to explain their reasoning step-by-step.

Gaps in existing AI benchmarks

Researchers found that current methods of evaluating LLM performance fall short in three key areas. First, changing the wording of the prompt can affect the results. Second, benchmarks degrade and become contaminated the more they are used. And finally, evaluate only the results, not the reasoning process the model used to reach its conclusion.

This means that current benchmarks may significantly exaggerate the capabilities of LLM and underestimate how often LLM fails in real-world use.

Artificial intelligence expressed in digital circuits and advanced algorithms in a high-tech environment showcases modern technological advances and innovations. — LLM’s performance may mean that it has limited real-world applications. (Image courtesy: da-kuk/Getty Images)

“Our position is not that benchmarks are flawed, but rather that they need to evolve,” study co-author Peiyan Song, a computer science and robotics student at Caltech, told Live Science in an email. Similarly, benchmarks tend to leak into LLM training data, Nanni said, meaning subsequent LLMs find ways to fool the benchmarks.

“Plus, now that the models are deployed in production, the usage itself becomes a kind of benchmark,” Nanni says. “You put the system in front of the users and see what goes wrong. That’s the new test. So, yes, we need better benchmarks, and we need to rely less on AI to check AI. But that’s actually very difficult, because these tools are now embedded in the way we work, and just using them is very useful.”

A new architecture for AGI?

Unlike other recent studies, this new study does not claim that the neural network approach to AI is a dead end in our quest to achieve artificial general intelligence (AGI). Rather, the researchers liken this to the early days of computing and point out that understanding why LLMs fail is the key to improving them.

However, they argue that simply training or scaling up a model with more data is unlikely to solve the problem on its own. This means that developing AGI may require a fundamentally different approach to how models are built.

“Neural networks, and LLM in particular, are clearly part of the AGI picture, and the progress they’ve made is incredible,” said Song. “However, our research suggests that scaling alone is unlikely to resolve all inference failures… [meaning] Reaching human-level reasoning will likely require architectural innovations, stronger world models, improved robustness training, and deeper integration of structured reasoning with embodied interaction. ”

Nanni agreed. “From a philosophy of mind perspective, we can basically say that we have discovered the limits of transformers. Transformers are not the way to build a digital mind,” he said. “They model text so well that it’s almost impossible to tell whether the text was written by a human or a machine. But that’s what language models are… You can only push this architecture so far.”

Source link

What's Hot

Amazon imposes ‘fuel surcharge’ on sellers as global energy market turmoil due to Iran war

Artemis II is NASA’s last lunar mission without Silicon Valley

Hackers exploit CVE-2025-55182 to compromise 766 Next.js hosts and steal credentials

‘That’s not the way to build a digital mind’: How failures in reasoning are preventing AI models from achieving human-level intelligence

Chinese satellite equipped with robotic Octopus arm passes critical refueling test in orbit – increasing chances of extending the lifespan of space assets

Scientists cured type 1 diabetes in mice by creating a mixed immune system

Native Americans invented dice and games of chance more than 12,000 years ago, archaeological research reveals

Amazon imposes ‘fuel surcharge’ on sellers as global energy market turmoil due to Iran war

Artemis II is NASA’s last lunar mission without Silicon Valley

Hackers exploit CVE-2025-55182 to compromise 766 Next.js hosts and steal credentials

Chinese satellite equipped with robotic Octopus arm passes critical refueling test in orbit – increasing chances of extending the lifespan of space assets

Castilla-La Mancha Ignites Innovation: fiveclmsummit Redefines Tech Future

Local Power, Health Innovation: Alcolea de Calatrava Boosts FiveCLM PoC with Community Engagement

The Future of Digital Twins in Healthcare: From Virtual Replicas to Personalized Medical Models

Human Digital Twins: The Next Tech Frontier Set to Transform Healthcare and Beyond

What's Hot

‘That’s not the way to build a digital mind’: How failures in reasoning are preventing AI models from achieving human-level intelligence

LLM limitations

Gaps in existing AI benchmarks

A new architecture for AGI?

Related Posts