On Tuesday, researchers from Google and Tel Aviv University unveiled GameNGen, a new AI model that can interactively simulate the classic 1993 first-person shooter game Doom in real time using AI image generation techniques borrowed from Stable Diffusion. It’s a neural network system that can function as a limited game engine, potentially opening new possibilities for real-time video game synthesis in the future.
For example, instead of drawing graphical video frames using traditional techniques, future games could potentially use an AI engine to “imagine” or hallucinate graphics in real time as a prediction task.
“The potential here is absurd,” wrote app developer Nick Dobos in reaction to the news. “Why write complex rules for software by hand when the AI can just think every pixel for you?”
GameNGen can reportedly generate new frames of Doom gameplay at over 20 frames per second using a single tensor processing unit (TPU), a type of specialized processor similar to a GPU that is optimized for machine learning tasks.
In tests, the researchers say that ten human raters sometimes failed to distinguish between short clips (1.6 seconds and 3.2 seconds) of actual Doom game footage and outputs generated by GameNGen, identifying the true gameplay footage 58 percent or 60 percent of the time.
Real-time video game synthesis using what might be called “neural rendering” is not a completely novel idea. Nvidia CEO Jensen Huang predicted during an interview in March, perhaps somewhat boldly, that most video game graphics could be generated by AI in real time within five to 10 years.
GameNGen also builds on previous work in the field, cited in the GameNGen paper, that includes World Models in 2018, GameGAN in 2020, and Google’s own Genie in March. And a group of university researchers trained an AI model (called “DIAMOND“) to simulate vintage Atari video games using a diffusion model earlier this year.
Also, ongoing research into “world models” or “world simulators,” commonly associated with AI video synthesis models like Runway’s Gen-3 Alpha and OpenAI’s Sora, is leaning toward a similar direction. For example, during the debut of Sora, OpenAI showed demo videos of the AI generator simulating Minecraft.
Diffusion is key
In a preprint research paper titled “Diffusion Models Are Real-Time Game Engines,” authors Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter explain how GameNGen works. Their system uses a modified version of Stable Diffusion 1.4, an image synthesis diffusion model released in 2022 that people use to produce AI-generated images.
“Turns out the answer to ‘can it run DOOM?’ is yes for diffusion models,” wrote Stability AI Research Director Tanishq Mathew Abraham, who was not involved with the research project.
While being directed by player input, the diffusion model predicts the next gaming state from previous ones after having been trained on extensive footage of Doom in action.
The development of GameNGen involved a two-phase training process. Initially, the researchers trained a reinforcement learning agent to play Doom, with its gameplay sessions recorded to create an automatically generated training dataset—that footage we mentioned. They then used that data to train the custom Stable Diffusion model.
However, using Stable Diffusion introduces some graphical glitches, as the researchers note in their abstract: “The pre-trained auto-encoder of Stable Diffusion v1.4, which compresses 8×8 pixel patches into 4 latent channels, results in meaningful artifacts when predicting game frames, which affect small details and particularly the bottom bar HUD.”
And that’s not the only challenge. Keeping the images visually clear and consistent over time (often called “temporal coherency” in the AI video space) can be a challenge. GameNGen researchers say that “interactive world simulation is more than just very fast video generation,” as they write in their paper. “The requirement to condition on a stream of input actions that is only available throughout the generation breaks some assumptions of existing diffusion model architectures,” including repeatedly generating new frames based on previous ones (called “autoregression”), which can lead to instability and a rapid decline in the quality of the generated world over time.