This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
One of the key challenges of deep reinforcement learning models—the kind of AI systems that have mastered Go, StarCraft 2, and other games—is their inability to generalize their capabilities beyond their training domain. This limit makes it very hard to apply these systems to real-world settings, where situations are much more complicated and unpredictable than the environments where AI models are trained.
But scientists at AI research lab DeepMind claim to have taken the “first steps to train an agent capable of playing many different games without needing human interaction data,” according to a blog post about their new “open-ended learning” initiative. Their new project includes a 3D environment with realistic dynamics and deep reinforcement learning agents that can learn to solve a wide range of challenges.
The new system, according to DeepMind’s AI researchers, is an “important step toward creating more general agents with the flexibility to adapt rapidly within constantly changing environments.”
The paper’s findings show some impressive advances in applying reinforcement learning to complicated problems. But they are also a reminder of how far current systems are from achieving the kind of general intelligence capabilities that the AI community has been coveting for decades.
The brittleness of deep reinforcement learning
The key advantage of reinforcement learning is its ability to develop behavior by taking actions and getting feedback, similar to the way humans and animals learn by interacting with their environment. Some scientists describe reinforcement learning as “the first computational theory of intelligence.”
The combination of reinforcement learning and deep neural networks, known as deep reinforcement learning, has been at the heart of many advances in AI, including DeepMind’s famous AlphaGo and AlphaStar models. In both cases, the AI systems were able to outmatch human world champions at their respective games.
But reinforcement learning systems are also notoriously renowned for their lack of flexibility. For example, a reinforcement learning model that can play StarCraft 2 at an expert level won’t be able to play a game with similar mechanics (e.g., Warcraft 3) at any level of competency. Even slight changes to the original game will considerably degrade the AI model’s performance.
“These agents are often constrained to play only the games they were trained for – whilst the exact instantiation of the game may vary (e.g. the layout, initial conditions, opponents) the goals the agents must satisfy remain the same between training and testing. Deviation from this can lead to catastrophic failure of the agent,” DeepMind’s researchers write in a paper that provides the full details on their open-ended learning.
Humans, on the other hand, are very good at transferring knowledge across domains.
The XLand environment
The goal of DeepMind’s new project was to create “an artificial agent whose behaviour generalises beyond the set of games it was trained on.”
To this end, the team created XLand, an engine that can generate 3D environments composed of static topology and moveable objects. The game engine simulates rigid-body physics and allows players to use the objects in various ways (e.g., create ramps, block paths, etc.).
XLand is a rich environment in which you can train agents on a virtually unlimited number of tasks. One of the main advantages of XLand is the capability to use programmatic rules to automatically generate a vast array of environments and challenges to train AI agents. This addresses one of the key challenges of machine learning systems, which often require vast amounts of manually curated training data.
According to the blog post, the researchers created “billions of tasks in XLand, across varied games, worlds, and players.” The games include very simple goals such as finding objects to more complex settings in which the AI agents much weigh the benefits and tradeoffs of different rewards. Some of the games include cooperation or competition elements involving multiple agents.
Deep reinforcement learning
DeepMind uses deep reinforcement learning and a few clever tricks to create AI agents that can thrive in the XLand environment.
The reinforcement learning model of each agent receives a first-person view of the world, the agent’s physical state (e.g., whether it holding an object), and its current goal. Each agent finetunes the parameters of its policy neural network to maximize its rewards on the current task. The neural network architecture contains an attention mechanism to ensure the agent can balance optimization for the subgoals required to accomplish the main goal.
Once the agent masters its current challenge, the computational task generator creates a new challenge for the agent. Each new task is generated according to the agent’s training history and in a way to help distribute the agent’s skills across a vast range of challenges.
DeepMind also used its vast computational resources (courtesy of its owner Alphabet Inc.) to train a large population of agents in parallel and transfer learned parameters across different agents to improve the general capabilities of the reinforcement learning systems.
The performance of the reinforcement learning agents was evaluated based on their general ability to accomplish a wide range of tasks they had not been trained on. Some of the test tasks include well-known challenges such as “capture the flag” and “hide and seek.”
According to DeepMind, each agent played around 700,000 unique games in 4,000 unique worlds within XLand and went through 200 billion training steps across 3.4 million unique tasks (in the paper, the researchers write that 100 million steps are equivalent to approximately 30 minutes of training).
“At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human,” the AI researchers wrote. “And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space.”
Zero-shot machine learning models can solve problems that were not present in their training dataset. In a complicated space such as XLand, zero-shot learning might imply that the agents have obtained fundamental knowledge about their environment as opposed to memorizing sequences of image frames in specific tasks and environments.
The reinforcement learning agents further manifested signs of generalized learning when the researchers tried to adjust them for new tasks. According to their findings, 30 minutes of fine-tuning on new tasks was enough to create an impressive improvement in a reinforcement learning agent trained with the new method. In contrast, an agent trained from scratch for the same amount of time would have near-zero performance on most tasks.