New, challenging AGI tests cut down most AI models

The ARC Awards Foundation, a nonprofit co-founded by AI-famous researcher François Charette, announced in a blog post Monday that it has created a new, challenging test to measure the general intelligence of key AI models.

So far, most models are confused in a new test called the ARC-AGI-2.

According to the ARC Awards leaderboard, there is an “inference” AI model with Openai’s O1-Pro and Deepseek’s R1 scores of 1% to 1.3% on the ARC-AGI-2. Powerful irrational models such as GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 flash scores about 1%.

The ARC-AGI test consists of puzzle-like problems where AI identifies visual patterns from a collection of squares of different colors and generates the correct “answer” grid. This issue is designed to force AI to adapt to new problems that have never been seen before.

Over 400 people have taken ARC-AGI-2 to establish a human baseline. On average, the “panel” of these people correctly won 60% of the test questions. This is much better than the model’s score.

Sample questions from ARC-AGI-2 (credit: ARC Award).

In X’s post, Chollet argued that ARC-AGI-2 is a better measure of the actual intelligence of the AI model than the first iteration of the test, ARC-AGI-1. The ARC Award Foundation test aims to assess whether AI systems can efficiently acquire new skills other than trained data.

Unlike the ARC-AGI-1, new testing prevents AI models from relying on “brute force” (a wide range of computing power) to find solutions. Chollet previously admitted that this is a major flaw in ARC-AGI-1.

To address the flaws in the initial test, ARC-AGI-2 introduces a new metric: efficiency. And instead of relying on memory, the model must interpret the patterns on the spot.

“Intelligence is not defined by the ability to solve problems or achieve high scores,” wrote Greg Kamradt, co-founder of the ARC Awards Foundation in a blog post. “The efficiency at which these features are acquired and deployed is a critical, defined component. [the] A skill to solve tasks? “How is it efficient and cost?”

The ARC-AGI-1 has been undefeated for about five years since December 2024. This has released Openai’s advanced inference model, O3, which surpasses all other AI models, matching human performance in ratings. However, as mentioned at the time, the O3 performance improvements on the ARC-AGI-1 came with a large price tag.

The O3 (low) version of Openai’s O3 model first reached new heights with the ARC-AGI-1, earning 75.7% in testing and just 4% with the ARC-AGI-2 using $200 worth of computing power per task.

Comparison of performance of frontier AI models of ARC-AGI-1 and ARC-AGI-2 (credit: ARC Award).

The arrival of the ARC-AGI-2 is as many people in the tech industry are seeking new unsaturated benchmarks to measure AI progress. Face co-founder Thomas Wolf recently told TechCrunch that the AI industry doesn’t have enough tests to measure key properties of so-called artificial general information, including creativity.

In addition to the new benchmarks, the ARC Awards Foundation announced a new ARC Awards 2025 contest, challenging developers to reach 85% accuracy in the ARC-AGI-2 test, spending only $0.42 per task.

Source link

New, challenging AGI tests cut down most AI models

Leave a Reply Cancel reply