Researchers at the Center for AI Safety and Scale AI have published “Humanity’s Last Test.” It’s a test designed to measure how close today’s most powerful artificial intelligence (AI) models are to approaching or exceeding human-level knowledge across multiple domains.
The test began in January 2025, but scientists first outlined the framework and the thinking behind its design in a new study published in the journal Nature on January 28. It includes a corpus of 2,500 questions across more than 100 subjects, with input from more than 1,000 subject matter experts from 500 institutions in 50 countries.
you may like
At launch, researchers tested OpenAI’s GPT-4o and o1 models, Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet, and DeepSeek R1. OpenAI’s o1 system took the top spot with a score of just 8.3%.
Despite this poor performance, the researchers wrote at the time, “Given the rapid pace of AI development, the model’s HLE accuracy could exceed 50% by the end of 2025.”
As of February 12, 2026, the highest score ever achieved is 48.4%, recorded by Google’s Gemini 3 Deep Think. Human experts, on the other hand, score around 90% in their respective fields.
Testing the world’s smartest machines
The last test of humanity was intentionally designed to be extremely difficult for the AI models. During the early stages of development, researchers solicited submissions globally from subject matter experts across numerous disciplines.
The researchers applied strict submission criteria that required questions to be precise, unambiguous, solvable, and non-searchable. They didn’t want the model to cheat by doing a simple web search, or the question to already appear online, increasing the likelihood that a particular model would have the answer in its training dataset.
Each question submitted was fed to an AI model. The team automatically rejected questions that the model could answer correctly.
More than 70,000 submissions were attempted, resulting in approximately 13,000 questions that stumped LLMs. These were then reviewed by a team of subject matter experts, approved by the research team, and presented to the scientific community for open feedback.
you may like
Ultimately, the researchers narrowed the total submitted questions down to 2,500 questions, which generally fall within the scope of a doctoral-level exam.
An example of a trivia question on an exam is “In Greek mythology, who is Jason’s maternal great-grandfather?”
An example physics problem, on the other hand, asks about the relationships between various forces in motion in a scenario where a block rests on a horizontal rail (so it can slide without friction) and is also attached to a stiff, massless rod of unknown length.
The breadth of questions and subject matter covered in Humanity’s Last Exam sets it apart from similar benchmarking tools, say its creators.
Common tests, such as the Massive Multitask Language Understanding (MMLU) dataset created with the participation of Center for AI Safety founder Dan Hendrycks, test only a small subset of expert-level domain knowledge, primarily focused on coding and math.
Even cutting-edge benchmarks like Francois Chollet’s ARC-AGI suite struggle to outperform the memory and searchability issues that the creators of Humanity’s Last Exam suggest their new test will address. For example, Gemini’s Deep Think achieved 84.6% on the ARC-AGI-2 benchmark just one week after failing to reach 50% on the HLE test.
The ultimate prize is general intelligence
While “Humanity’s Last Test” likely represents the greatest attempt in the history of the AI world to measure the wide range of capabilities of modern AI models compared to human experts, the study’s authors make clear that achieving a high score on the HLE in no way signals the arrival of artificial general intelligence (AGI).
“HLE’s high accuracy demonstrates expert-level performance with respect to closed-ended testable questions and cutting-edge scientific knowledge, but by itself does not imply autonomous research capabilities or artificial general intelligence,” the scientists wrote in the study.
“Performing well on the HLE is a necessary but not sufficient criterion for a machine to reach true intelligence,” Manuel Schottdorf, a neuroscientist in the University of Delaware’s Department of Psychological and Brain Sciences, said in a recent statement. Schottdorff is one of many experts whose questions have been accepted into the HLE corpus.
“Machines need to be smart enough to answer these questions, but that alone doesn’t mean they’re truly intelligent.”
Source link
