Google has discovered its first real-world vulnerability using an artificial intelligence (AI) agent that company researchers are designing expressly for this purpose. The discovery of a memory-safety flaw in a production version of a popular open source database by the company’s Big Sleep large language model (LLM) project is the first of its kind, and it has “tremendous defensive potential” for organizations, the Big Sleep team wrote in a recent Project Zero blog.
Big Sleep — the work of a collaboration between the company’s Project Zero and Deep Mind groups — discovered an exploitable stack buffer underflow in SQLite, a widely used open source database engine.
Specifically, Big Sleep discovered a pattern in the code of a publicly released version of SQLite that creates a potential edge case that needs to be handled by all code that uses the field, the researchers noted. A function in the code failed to correctly handle the edge case, “resulting in a write into a stack buffer with a negative index when handling a query with a constraint on the ‘rowid’ column,” thus creating an exploitable flaw, according to the post.
Google reported the bug to SQLite developers in early October. They fixed it on the same day and before it appeared in an official release of the database, so users were not affected.
Inspired by AI Bug-Hunting Peers
“We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software,” the Big Sleep team wrote in the post. While this may be true, it’s not the first time an LLM-based reasoning system autonomously found a flaw in the SQLite database engine, Google acknowledged.
An LLM model called Atlantis from a group of AI experts called Team Atlanta discovered six zero-day flaws in SQLite3 and even autonomously identified and patched one of them during the AI Cyber Challenge organized by ARPA-H, DARPA, and the White House, the team revealed in a blog post in August.
In fact, the Big Sleep team used one of the Team Atlanta discoveries — of “a null-pointer dereference” flaw in SQLite — to inspire them to use AI “to see if we could find a more serious vulnerability,” according to the post.
Software Review Goes Beyond Fuzzing
Google and other software development teams already use a process called fuzz-testing, colloquially known as “fuzzing,” to help find flaws in applications before release. Fuzzing is an approach that targets the software with deliberately malformed data — or inputs — to see if it will crash so they can investigate and fix the cause.
In fact, Google earlier this year released an AI-boosted fuzzing framework as an open source resource to help developers and researchers improve how they find software vulnerabilities. The framework automates manual aspects of fuzz-testing and uses LLMs to write project-specific code to boost code coverage.
While fuzzing “has helped significantly” to reduce the number of flaws in production software, developers need a more powerful approach “to find the bugs that are difficult (or impossible) to find” in this way, such as variants for previously found and patched vulnerabilities, the Big Sleep team wrote.
“As this trend continues, it’s clear that fuzzing is not succeeding at catching such variants, and that for attackers, manual variant analysis is a cost-effective approach,” the team wrote in the post.
Moreover, variant analysis is a better fit for current LLMs because its provides them with a starting point — such as the details of a previously fixed flaw — for a search, and thus removes a lot of ambiguity from AI-based vulnerability testing, according to Google. In fact, at this point in the evolution of LLMs, lack of this type of starting point for a search can cause confusion, they noted.
“We’re hopeful that AI can narrow this gap,” the Big Sleep team wrote. “We think that this is a promising path towards finally turning the tables and achieving an asymmetric advantage for defenders.”
Glimpse Into the Future
Google Big Sleep is still in its research phase, and using AI-based automation to identify software flaws overall is a new discipline. However, there already are tools available that developers can use to get a jump on finding vulnerabilities in software code before public release.
Late last month, researchers at Protect AI released Vulnhuntr, a free, open source static code analyzer tool that can find zero-day vulnerabilities in Python codebases using Anthropic’s Claude artificial intelligence (AI) model.
Indeed, Google’s discovery shows promising progress for the future of using AI to help developers troubleshoot software before letting flaws seep into production versions.
“Finding vulnerabilities in software before it’s even released means that there’s no scope for attackers to compete: the vulnerabilities are fixed before attackers even have a chance to use them,” Google’s Big Sleep team wrote.