There are many ways to test the intelligence of artificial intelligence – the fluidity of conversation, reading comprehension, or difficult physics that bend the mind. However, some of the tests that are most likely to stump AIS are what humans find relatively simple and interesting. AIS are increasingly superior for tasks that require high levels of human expertise, but this does not mean that they are approaching artificial general information, or AGI. At AGI, we require that AI get a very small amount of information and use it to generalize and adapt to very new situations. This ability, the foundation of human learning, remains challenging for AIS.
One test designed to assess the ability of AI to generalize is an abstraction and inference corpus, or a collection of small color grid puzzles that ask you to estimate rules hidden in ARC: Solver and apply them to a new grid. Developed in 2019 by AI researcher François Chollet, it became the basis for the ARC Awards Foundation, a non-profit program that manages testing. It is an industry benchmark that is currently used by all major AI models. The organization is also developing new tests, and uses two (ARC-AGI-1 and its challenging successor ARC-AGI-2) on a daily basis. This week, the foundation launched ARC-AGI-3, specifically designed for testing AI agents, and is based on playing video games.
Scientific American spoke to AI researcher and entrepreneur Greg Kamradt, president of the ARC Award Foundation, to understand how these tests evaluate AIS, what they tell us about the possibilities of AGI, and why they often challenge deep learning models despite the fact that many people tend to find relatively easily. A link to try the test is available at the end of the article.
You might like it
[An edited transcript of the interview follows.]
What is the definition of intelligence measured by ARC-AGI-1?
Our definition of intelligence is the ability to learn new things. We already know that AI can win in chess. They know they can go. However, these models cannot be generalized to new domains. They can’t go and learn English. François Charette created a benchmark called Arc-Agi. It teaches you the mini skill of the question and then asks you to show that mini skill. We basically teach you something and ask you to repeat the skills you just learned. Therefore, this test measures the ability of the model to learn within a narrow domain. But our argument is that it doesn’t measure AGI as it is still in a scoped domain [in which learning applies to only a limited area]. It measures what AI can generalize, but it does not claim to be AGI.
How do you define AGI here?
There are two ways I can see it. The first is more technical. This is, “Can artificial systems match human learning efficiency?” Now, what I mean is, after humans are born, we learn a lot outside of their training data. In fact, they do not actually have training data except for some evolutionary advances. So we learn how to speak English, learn how to drive a car, and learn how to ride a bike. All of these are outside of the training data. It is called generalization. When you can do something outside of what you are now trained, we define it as intelligence. Now, another definition of AGI that we use is when we can no longer come up with problems that humans can do and AI can’t. That’s when you have an AGI. It is an observational definition. The back side also applies. This is generally the case as long as ARC awards and humans can still find problems humans can do, unless AI can, they have AGI. One of the key factors regarding François Challett’s benchmarks is that we test humans on them, and the average human can do these tasks and these problems, but AI is still very struggling with it. The very interesting reason is that advanced AIS like Grok can pass graduate level exams and do all these crazy things, but that’s thorny intelligence. Humans still don’t have the power to generalize. And that’s what this benchmark shows.
How do benchmarks differ from those used by other organizations?
One thing that distinguishes us is that we require that our benchmarks be resolved by humans. It opposes other benchmarks and they do the “Ph.D.-Plus-Plus” issue. AI doesn’t need to be said to be smarter than me. I already know that Openai’s O3 can do more than I do, but there is no human power to generalize. We need to test humans because that’s what we measure. In fact, we tested 400 people with the ARC-AGI-2. We put them in the room, gave them computers, did demographic screenings, and gave them tests. The average person scored 66% on the ARC-AGI-2. Collectively, however, the aggregated responses of 5-10 people include the correct answers to all ARC2 questions.
Why is this test difficult for AI and relatively easy for humans?
There are two things. Humans learn very well to sample efficiency. This means you can see the problem, and perhaps in one or two examples, you can pick up mini skills and transformations, and you can do that. The algorithms running in the human head are orders of magnitude better and more efficient than what we currently see in AI.
What is the difference between ARC-AGI-1 and ARC-AGI-2?
So arc-agi-1, François Charette made it himself. It was about 1,000 tasks. It was 2019. He basically did a minimal, viable version to measure generalization, and held it for five years as deep learning was not touched at all. I wasn’t approaching. Then, the inference model published by Openai in 2024 began to advance. This showed a step-level change in what AI can do. After that, when I went to Arc-Agi-2, I went a little down the rabbit hole in terms of what humans can do and what AI can’t. A little more planning for each task is required. So instead of solving within 5 seconds, humans may be able to do it in 1-2 minutes. There are more complicated rules and the grid is bigger so the answer needs to be more accurate, but that’s the same concept, more or less…. We’re currently launching the developer preview for ARC-AGI-3, which is completely gone from this format. The new format will actually become interactive. So, think of it as an agent benchmark.
How do ARC-AGI-3 test agents differ from previous tests?
When we think about daily life, it is rare that a decision to be stateless is made. When I say Stateless, I mean just a question and an answer. Currently, all benchmarks are more or less stateless benchmarks. Asking a question to a language model gives you a single answer. There are many things that cannot be tested with stateless benchmarks. You cannot test your plan. You cannot test the search. You cannot test your intuition about your environment or the goals that come with it. So, we are creating 100 new video games to use to test humans to make sure humans are able to do them, as humans are the basis of our benchmarks. Then drop AIS on these video games and see if you can understand this environment you’ve never seen before. Until now, our internal testing has prevented one AI from hitting one or one of the games.
Can you explain video games here?
Each “environment” or video game is a two-dimensional pixel-based puzzle. These games are organized as different levels designed to teach specific mini-skills to players (Human or AI). To successfully complete the level, players must demonstrate their proficiency in their skills by performing a planned series of actions.
Using AGI testing using video games, is it different from the way video games were previously used to test AI systems?
Video games have long been used as a benchmark for AI research, but Atari Games is a popular example. However, traditional video game benchmarks face some limitations. Popular games have published extensive training data, lacking standardized performance rating metrics, allowing brute force methods that include billions of simulations. Additionally, developers usually build prior knowledge of these games. Unintentionally embedding your own insights into your solution.
Try ARC-AGI-1, ARC-AGI-2, and ARC-AGI-3.
This article was first published in Scientific American. ©ScientificAmerican.com. Unauthorized reproduction is prohibited. Follow Tiktok and Instagram, X and Facebook.
Source link