Sunday, December 22, 2024

Google Data Leak Clarification

Must read

Over the United States holidays some posts were shared about an alleged leak of Google ranking-related data. The first posts about the leaks focused on “confirming” beliefs that were long-held by Rand Fishkin but not much attention was focused on the context of the information and what it really means.

Context Matters: Document AI Warehouse

The leaked document shares relation to a public Google Cloud platform called Document AI Warehouse which is used for analyzing, organizing, searching, and storing data. This public documentation is titled Document AI Warehouse overview. A post on Facebook shares that the “leaked” data is the “internal version” of the publicly visible Document AI Warehouse documentation. That’s the context of this data.

Screenshot: Document AI Warehouse

@DavidGQuaid tweeted:

“I think its clear its an external facing API for building a document warehouse as the name suggests”

That seems to throw cold water on the idea that the “leaked” data represents internal Google Search information.

As far we know at this time, the “leaked data” shares a similarity to what’s in the public Document AI Warehouse page.

Leak Of Internal Search Data?

The original post on SparkToro does not say that the data originates from Google Search. It says that the person who sent the data to Rand Fishkin is the one who made that claim.

One of the things I admire about Rand Fishkin is that he is meticulously precise in his writing, especially when it comes to caveats. Rand precisely notes that it’s the person who provided the data who makes the claim that the data originates from Google Search. There is no proof, only a claim.

He writes:

“I received an email from a person claiming to have access to a massive leak of API documentation from inside Google’s Search division.”

Fishkin himself does not affirm that the data was confirmed by ex-Googlers to have originated from Google Search. He writes that the person who emailed the data made that claim.

“The email further claimed that these leaked documents were confirmed as authentic by ex-Google employees, and that those ex-employees and others had shared additional, private information about Google’s search operations.”

Fishkin writes about a subsequent video meeting where the the leaker revealed that his contact with ex-Googlers was in the context of meeting them at a search industry event. Again, we’ll have to take the leakers word for it about the ex-Googlers and that what they said was after carefully reviewing the data and not an informal comment.

Fishkin writes that he contacted three ex-Googlers about it. What’s notable is that those ex-Googlers did not explicitly confirm that the data is internal to Google Search. They only confirmed that the data looks like it resembles internal Google information, not that it originated from Google Search.

Fishkin writes what the ex-Googlers told him:

  • “I didn’t have access to this code when I worked there. But this certainly looks legit.”
  • “It has all the hallmarks of an internal Google API.”
  • “It’s a Java-based API. And someone spent a lot of time adhering to Google’s own internal standards for documentation and naming.”
  • “I’d need more time to be sure, but this matches internal documentation I’m familiar with.”
  • “Nothing I saw in a brief review suggests this is anything but legit.”

Saying something originates from Google Search and saying that it originates from Google are two different things.

Keep An Open Mind

It’s important to keep an open mind about the data because there is a lot about it that is unconfirmed. For example, it is not known if this is an internal Search Team document. Because of that it is probably not a good idea to take anything from this data as actionable SEO advice.

Also, it’s not advisable to analyze the data to specifically confirm long-held beliefs. That’s how one becomes ensnared in Confirmation Bias.

A definition of Confirmation Bias:

“Confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms or supports one’s prior beliefs or values.”

Confirmation Bias will lead to a person deny things that are empirically true. For example, there is the decades-old idea that Google automatically keeps a new site from ranking, a theory called the Sandbox. People every day report that their new sites and new pages nearly immediately rank in the top ten of Google search.

But if you are a hardened believer in the Sandbox then actual observable experience like that will be waved away, no matter how many people observe the opposite experience.

Brenda Malone, Freelance Senior SEO Technical Strategist and Web Developer (LinkedIn profile), messaged me about claims about the Sandbox:

“I personally know, from actual experience, that the Sandbox theory is wrong. I just indexed in two days a personal blog with two posts. There is no way a little two post site should have been indexed according to the the Sandbox theory.”

The takeaway here is that if the documentation turns out to originate from Google Search, the incorrect way to analyze the data is to go hunting for confirmation of long-held beliefs.

What Is The Google Data Leak About?

There are five things to consider about the leaked data:

  1. The context of the leaked information is unknown. Is it Google Search related? Is it for other purposes?
  2. The purpose of the data. Was the information used for actual search results? Or was it used for data management or manipulation internally?
  3. Ex-Googlers did not confirm that the data is specific to Google Search. They only confirmed that it appears to come from Google.
  4. Keep an open mind. If you go hunting for vindication of long-held beliefs, guess what? You will find them, everywhere. This is called confirmation bias.
  5. Evidence suggests that data is related to an external-facing API for building a document warehouse.

What Others Say About “Leaked” Documents

Ryan Jones, someone who not only has deep SEO experience but has a formidable understanding of computer science shared some reasonable observations about the so-called data leak.

Ryan tweeted:

“We don’t know if this is for production or for testing. My guess is it’s mostly for testing potential changes.

We don’t know what’s used for web or for other verticals. Some things might only be used for a Google home or news etc.

We don’t know what’s an input to a ML algo and what’s used to train against. My guess is clicks aren’t a direct input but used to train a model how to predict clickability. (Outside of trending boosts)

I’m also guessing that some of these fields only apply to training data sets and not all sites.

Am I saying Google didn’t lie? Not at all. But let’s examine this leak objectionably and not with any preconceived bias.”

@DavidGQuaid tweeted:

“We also don’t know if this is for Google search or Google cloud document retrieval

APIs seem pick & choose – that’s not how I expect the algorithm to be run – what if an engineer wants to skip all those quality checks – this looks like I want to build a content warehouse app for my enterprise knowledge base”

Is The “Leaked” Data Related To Google Search?

At this point in time there is no hard evidence that this “leaked” data is actually from Google Search. There is an overwhelming amount of ambiguity about what the purpose of the data is. Notable is that there are hints that this data is just “an external facing API for building a document warehouse as the name suggests” and not related in any way to how websites are ranked in Google Search.

The conclusion that this data did not originate from Google Search is not definitive at this time but it’s the direction that the wind of evidence appears to be blowing.

Featured Image by Shutterstock/Jaaak

Latest article