This article is part of our coverage of the latest in AI research.
Generative AI has made remarkable progress in producing realistic images and videos. However, interactive simulation—the ability to generate dynamic content that responds to user actions in real time—remains a significant challenge.
A new study by Google Research demonstrates that neural networks can simulate complex video games in real time. GameNGen, Google’s new model, can simulate DOOM without a game engine and by predicting the next frame.
Simulating DOOM with diffusion models
Current generative models, such as Stable Diffusion and DALL-E, are good at producing images and videos from text descriptions or images. However, creating interactive simulations that respond in real-time to a stream of user input poses unique challenges for existing architectures.
Previous efforts to simulate games through neural networks have been limited in terms of the complexity of the game, simulation speed, stability over long periods, or visual quality.
Google’s GameNGen (pronounced “game engine”) is a diffusion model that learns to simulate the classic first-person shooter game DOOM. Games like DOOM have an engine that represents the state of the game, player, map, characters, objects, etc. At each timestep, the game engine updates all components and renders the next frame based on where the player is looking.
In contrast, GameNGen learns to simulate the game by predicting the contents of the next frame based on previous frames and user actions. GameNGen is based on Stable Diffusion 1.4, a popular text-to-image diffusion model. However, instead of text prompts, GameNGen is conditioned on user actions (e.g., key presses) and previous game frames.
Training the diffusion model required a large dataset of gameplay trajectories. To gather this data, the researchers first trained a reinforcement learning (RL) agent to interact with the game environment and learn to play DOOM. The RL agent was trained on 10 million environment steps, and its training trajectories, which include different skill levels of play, were recorded to create the dataset.
“Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency,” the researchers write.
The diffusion model was then trained on this dataset using 128 TPU v5e accelerators for 700,000 training steps, generating 900 million training frames.
To achieve real-time performance during inference, the researchers reduced the number of sampling and denoising steps in the diffusion model. They found that this had a negligible impact on the visual quality of the generated frames, likely due to the constrained image space of the game and the strong conditioning provided by the previous frames. GameNGen can run the game interactively at over 20 frames per second on a single TPU v5, achieving a visual quality comparable to the original game.
Testing GameNGen
To evaluate the quality of the generated simulations, the researchers conducted a human evaluation study. They presented participants with short clips of the game generated by GameNGen and the original game engine and asked them to distinguish between the two.
For short trajectories (1.6 to 3 seconds), human raters were only slightly better than random chance at identifying the simulated clips.
“GameNGen is a proof-of-concept for one part of a new paradigm where games are weights of a neural model, not lines of code,” the researchers write.
Despite its impressive results, GameNGen has limitations. The model has a limited memory of only three seconds, which restricts its ability to capture long-term dependencies in the game. The researchers’ attempts to extend the context window did not improve performance, suggesting the need for architectural changes to address this limitation.
If you enjoyed this article, please consider supporting TechTalks with a paid subscription (and gain access to subscriber-only posts)
The human evaluation studies were also limited to very short clips. Longer examples posted on the project’s website clearly show generative model artifacts when the simulations run for a few seconds or more.
Another limitation is the model’s reliance on the RL agent’s training data. The assumption is that human players will act within the boundaries of the behavior learned by the RL agent. If a player’s actions deviate significantly from the training distribution, the model will behave unpredictably, which is not the experience you would want as a gamer.
One major challenge not explicitly addressed in the paper is the limited field of view in games. What the player sees on the screen is only a small part of the game world. DOOM is a crude and limited game, but in modern games, a lot of the action happens off-screen. GameNGen was trained on the visible frames, which means it might not have a complete representation of the game state. It is unclear how this approach would scale to more complex games with richer environments and more intricate game mechanics.
Finally, the computational cost of training GameNGen was substantial. While the researchers were able to achieve real-time performance during inference, the training process required a large number of TPUs and millions of dollars worth of compute—and this for a game that can run on toasters and toothbrushes and uses rotating billboards instead of real 3D objects. And adding new game levels will require the model to be retrained at very high costs.
At its current state, this approach is clearly not viable for developing commercial games in the near future, especially given that classic game engines remain more robust, stable, and predictable. Nonetheless, GameNGen demonstrates the potential of neural networks to learn and simulate complex environments in real-time. As we have often seen, the technologies we create end up solving problems that are different from the ones they were intended for.