Openai released a new benchmark on Thursday. This tested the performance of the AI model compared to human experts in a wide range of industries and employment. This test, GDPVAL, is an early attempt to understand how close OpenAI systems are to outperform humans in economically valuable work.
Openai says it has discovered that the GPT-5 model and Anthropic’s Claude Opus 4.1 are “already approaching the quality of work produced by industry experts.”
That doesn’t mean Openai’s models will soon begin human change at work. Despite predictions by some CEOs that AI will take on human work in just a few years, Openai acknowledges that GDPVal covers a very limited number of tasks people today do in real work. However, this is one of the latest ways in which companies are measuring AI progress towards this milestone.
GDPVal is based on nine industries that contribute most to the US gross domestic product, including domains such as healthcare, finance, manufacturing and government. This benchmark tests the performance of AI models in 44 occupations in these industries, ranging from software engineers to nurses and journalists.
For the first version of Openai, GDPVal-V0, Openai asked experienced experts to compare AI generation reports with reports generated by other experts and select the best report. For example, we asked a quick, rapid banker to create a competitor landscape for the last mile delivery industry and compare them to AI generation reports. Openai then averages the “winning rate” of the AI model for human reporting across all 44 occupations.
For GPT-5, a soup-up version of GPT-5 of GPT-5-High, for GPT-5 with additional computing power, the company says that the AI model was ranked on par with industry experts in 40.6% of the time.
Openai also tested the Claude Opus 4.1 model of humanity. This was ranked on par with industry experts at 49% of tasks. Openai says he believes Claude scored very high because he tends to make fun graphics rather than performance.
TechCrunch Events
San Francisco
|
October 27th-29th, 2025

It is worth noting that most working professionals do more than submitting research reports to their boss, which is everything about the GDPVAL-V0 test. Openai acknowledges this and says it plans to create more robust tests in the future that can explain more industries and interactive workflows.
Nevertheless, the company considers GDPVal’s progress worth noting.
In an interview with TechCrunch, Openai’s chief economist Dr. Aaron Chatterji said the results of GDPVal suggest that people in these jobs can spend time using AI models to spend more meaningful tasks.
“[Because] This model is good at a few of these things,” Chatterji said, “People in these jobs can use the model as their abilities improve, offloading some of their jobs and doing things that are potentially high value.”
In Openai’s assessment, Tejal Patwardhan told TechCrunch that he was encouraged by the GDPVal progress rate. Openai’s GPT-4O model won 13.7% (victory and bond with humans), released about 15 months ago. Currently, the GPT-5 has scored almost three times the score.
Silicon Valley has a wide range of benchmarks used to measure the progress of AI models and to assess whether a particular model is cutting edge. The most popular are AIME 2025 (testing competitive math problems) and GPQA diamond (testing PHD-level science questions). However, some AI models are approaching saturation with some of these benchmarks, and many AI researchers have cited the need for better testing that can measure AI proficiency with respect to real tasks.
Benchmarks like GDPVal can become increasingly important in that conversation, as Openai claims AI models are valuable to a wide range of industries. However, Openai clearly states that testing of a more comprehensive version may be required, and that its AI model may be superior to humans.
Source link