Artificial intelligence (AI) models are playing the popular tabletop role-playing game Dungeons & Dragons (D&D) to help researchers develop long-term strategies and test their ability to collaborate with other AI systems and human players.
In the study, presented at the NeurIPS 2025 conference in San Diego from Dec. 2 to Dec. 7, researchers said D&D is an ideal test bed because of the game’s unique blend of creativity and strict rules.
you may like
In the experiment, a single model could assume the role of not only a hero, but also a dungeon master (DM), the individual who creates the story and plays the role of the monster (each scenario had one DM and four heroes). The framework built for this study, called D&D Agent, allows models to play cooperatively with other LLMs, or for human players to play some or all roles on their own. For example, an LLM can take on the role of DM, and two LLMs and two human players can play the heroes.
“Dungeons & Dragons is a natural testing ground for evaluating multi-step planning, including adherence to rules and team strategy,” study lead author Raj Ammanavrolu, an assistant professor in the Department of Computer Science and Engineering at the University of California, San Diego, said in a statement. “Because play unfolds through interaction, D&D also opens the way for direct human-AI interaction. Agents can assist and collaborate with other people.”
This simulation does not recreate an entire D&D campaign. Instead, it focuses on combat encounters drawn from a pre-written adventure called “The Lost Mines of Phandelver.” To create the parameters for the test, the team selected one of three combat scenarios from the adventure, a set of four characters, and a character power level (low, medium, or high). Each episode lasted 10 turns, after which results were collected.
Strategy and decision framework
The researchers ran three different AI models (DeepSeek-V3, Claude Haiku 3.5, and GPT-4) through simulations and used D&D as a metric to assess how the models demonstrated qualities such as long-term planning and ability to use tools.
These are important for real-world applications such as optimizing supply chains and building manufacturing lines. We also tested how well the models can plan together, which can be applied to scenarios such as disaster response modeling and search-and-rescue multi-agent systems.
Overall, Claude Haiku 3.5 showed the best combat efficiency, especially in difficult scenarios. In simpler scenarios, resource savings were similar for all three models. In D&D, resources are things like the number of spells or abilities a character can use each day, or the number of healing potions available. Since these were separate combat scenarios, there was little incentive to save resources for later use, as there would be if you played a complete adventure.
In more difficult situations, Claude Haiku 3.5 showed a willingness to consume more of the allocated resources, which led to better results. GPT-4 was a close second, but DeepSeek-V3 struggled the most.
you may like
The researchers also assessed how well the model maintained its characteristics throughout the simulation. They created a performance quality metric that isolates a model’s narrative speech (generated as a text response) and balances how well the model stays in character and how much voice the model maintains during play.
The researchers found that DeepSeek-V3 produced many essential first-person barks and expletives (such as “leave me alone” and “catch me!”) but often reused the same sounds. Claude Haiku 3.5, on the other hand, has tailored the phrases to be more specific to the class and monster you play, whether you’re a holy paladin or a nature-loving druid. GPT-4, on the other hand, falls somewhere in between, with a mix of character narration and meta-tactical expressions.
Some of the most interesting and unique battle calls were made when models played the role of monsters. Various creatures began to take on unique personalities, and during battle the goblins screeched, “Hey, shiny guy’s gonna bleed!”
The researchers said this kind of testing framework is important for assessing how well a model can perform without human input for long periods of time. It is a measure of an AI’s ability to act independently while maintaining consistency and reliability, and this ability requires memory and strategic thinking.
In the future, the team hopes to implement a complete D&D campaign that models all non-combat narrative and action, further emphasizing the AI’s creativity and ability to improvise in response to input from humans and other LLMs.
Source link
