Openai’s new research paper asks whether large-scale language models like GPT-5 and large-scale language models like chatbots like ChatGpt are still hallucinating, and whether anything can be done to reduce those hallucinations.
In a blog post summarizing the paper, Openai defines hallucinations as “plausible but false statements generated by language models,” acknowledging that despite improvements, hallucinations “continued to be a fundamental challenge for all major language models.”
To explain the point, the researchers say when they asked about the title of Adam Tauman Kalai’s PhD: “Widely Used Chatbot.” Thesis, they got three different answers, they are all wrong. (Karai is one of the authors of the paper.) They then asked about his birthday and received three different dates. Again, they were all wrong.
Why are chatbots so wrong? Researchers suggest that hallucinations occur due to pre-training processes focused on correctly predicting the model without attaching true or false labels attached to the training statement.
“The spelling and parentheses follow a consistent pattern, so the error disappears on scale,” they write. “However, like a pet’s birthday, any low-frequency fact cannot be predicted from the pattern alone, and thus leads to hallucinations.”
However, the proposed solution does not focus on the initial prerequisite process, which is why a large model of language models has been evaluated. Current evaluation models do not cause hallucinations per se, but they argue that they “set the wrong incentives.”
Researchers compare these ratings with the types of random guesses. Because “guaranteed zero” while leaving the answer blank, “You’ll be lucky and maybe you’re right.”
TechCrunch Events
San Francisco
|
October 27th-29th, 2025
“In the same way, if the model is rated only with accuracy, the exact percentage of questions is encouraged to guess rather than say “I don’t know,”” they say.
The proposed solution is similar to testing with negatives (like SAT) [scoring] For partial credits to leave the question blank to discourage the wrong answer or blind guess. Similarly, Openai states that model evaluations should “punish confident errors rather than punish uncertainty and give partial credit for the appropriate expression of uncertainty.”
And the researchers argue that “it’s not enough to introduce some new uncertainty-conscious tests on the side. Instead, “A widespread accuracy-based avoidance should be updated so that scoring prevents guessing.”
“When the main scoreboard continues to reward fortune guesses, the model continues to learn speculation,” the researcher says.
Source link