A trove of documents that appear to describe how Google ranks search results has appeared online, likely as the result of accidental publication by an in-house bot.
The leaked documentation describes an old version of Google’s Content Warehouse API and provides a glimpse of Google Search’s inner workings.
The material appears to have been inadvertently committed to a publicly accessible Google-owned repository on GitHub around March 13 by the web giant’s own automated tooling. That automation tacked an Apache 2.0 open source license on the commit, as is standard for Google’s public documentation. A follow-up commit on May 7 attempted to undo the leak.
The material was nonetheless spotted by Erfan Azimi, CEO of search engine optimization (SEO) biz EA Digital Eagle and were then disclosed on Sunday by fellow SEO operatives Rand Fishkin, CEO of SparkToro and Michael King, CEO of iPullRank.
These documents do not contain code or the like, and instead describe how to use Google’s Content Warehouse API that’s likely intended for internal use only; the leaked documentation includes numerous references to internal systems and projects. While there is a similarly named Google Cloud API that’s already public, what ended up on GitHub goes well beyond that, it seems.
The files are noteworthy for what they reveal about the things Google considers important when ranking web pages for relevancy, a matter of enduring interest to anyone involved in the SEO business and/or anyone operating a website and hoping Google will help it to win traffic.
Among the 2,500-plus pages of documentation, assembled for easy perusal here, there are details on more than 14,000 attributes accessible or associated with the API, though scant information about whether all these signals are used and their importance. It is therefore hard to discern the weight Google applies to the attributes in its search result ranking algorithm.
But SEO consultants believe the documents contain noteworthy details because they differ from public statements made by Google representatives.
“Many of [Azimi’s] claims [in an email describing the leak] directly contradict public statements made by Googlers over the years, in particular the company’s repeated denial that click-centric user signals are employed, denial that subdomains are considered separately in rankings, denials of a sandbox for newer websites, denials that a domain’s age is collected or considered, and more,” explained SparkToro’s Fishkin in a report.
iPullRank’s King, in his post on the documents, pointed to a statement made by Google search advocate John Mueller, who said in a video that “we don’t have anything like a website authority score” – a measure of whether Google considers a site authoritative and therefore worthy of higher rankings for search results.
But King notes that the docs reveal that as part of the Compressed Quality Signals Google stores for documents, a “siteAuthority” score can be calculated.
Several other revelations are cited in the two posts.
One is the importance of clicks – and different types of clicks (good, bad, long, etc.) – are in determining how a webpage rankings. Google during the US v. Google antitrust trial acknowledged [PDF] that it considers click metrics as a ranking factor in web search.
Another is that Google uses websites viewed in Chrome as a quality signal, seen in the API as the parameter ChromeInTotal. “One of the modules related to page quality scores features a site-level measure of views from Chrome,” according to King.
Additionally, the documents indicate that Google considers other factors like content freshness, authorship, whether a page is related to a site’s central focus, alignment between page title and content, and “the average weighted font size of a term in the doc body.”
Google did not respond to a request for comment. ®