Saturday, November 2, 2024

Google On How It Manages Disclosure Of Search Incidents

Must read

Google’s latest Search Off The Record podcast discussed examples of disruptive incidents that can affect crawling and indexing and discuss the criteria for deciding whether or not to disclose the details of what happened.

Complicating the issue of making a statement is that there are times when SEOs and publishers report that Search is broken when from Google’s point of view they’re working the way they’re supposed to.

Google Search Has A High Uptime

The interesting part of the podcast began with the observation that Google Search (the home page with the search box) itself has an “extremely” high uptime and rarely ever goes down and become unreachable. Most of the reported issues were due to network routing issues from the Internet itself than a failure from within Google’s infrastructure.

Gary Illyes commented:

“Yeah. The service that hosts the homepage is the same thing that hosts the status dashboard, the Google Search Status Dashboard, and it has like an insane uptime number. …the number is like 99.999 whatever.”

John Mueller jokingly responded with the word “nein” (pronounced like the number nine), which means “no” in German:

“Nein. It’s never down. Nein.”

The Googlers admit that the rest of Google Search on the backend does experience outages and they explain how that’s dealt with.

Crawling & Indexing Incidents At Google

Google’s ability to crawl and index web pages is critical for SEO and earnings. Disruption can lead to catastrophic consequences particularly for time-sensitive content like announcements, news and sales events (to name a few).

Gary Illyes explained that there’s a team within Google called Site Reliability Engineering (SRE) that’s responsible for making sure that the public-facing systems are running smoothly. There’s an entire Google subdomain devoted to the site reliability where they explain that they approach the task of keeping systems operational similar to how software issues are. They watch over services like Google Search, Ads, Gmail, and YouTube.

The SRE page explains the complexity of their mission as being very granular (fixing individual things) to fixing larger scale problems that affect “continental-level service capacity” for users that measure in the billions.

Gary Ilyes explains (at the 3:18 minute mark):

“Site Reliability Engineering org publishes their playbook on how they manage incidents. And a lot of the incidents are caught by incidents being issues with whatever systems. They catch them with automated processes, meaning that there are probers, for example, or there are certain rules that are set on monitoring software that looks at numbers.

And then, if the number exceeds whatever value, then it triggers an alert that is then captured by a software like an incident management software.”

February 2024 Indexing Problem

Gary next explains how the February 2024 indexing problem is an example of how Google monitors and responds to incidents that could impact users in search. Part of the response is figuring out if it’s an actual problem or a false positive.

He explains:

“That’s what happened on February 1st as well. Basically some number went haywire, and then that opened an incident automatically internally. Then we have to decide whether that’s a false positive or it’s something that we need to actually look into, as in like we, the SRE folk.

And, in this case, they decided that, yeah, this is a valid thing. And then they raised the priority of the incident to one step higher from whatever it was.

I think it was a minor incident initially and then they raised it to medium. And then, when it becomes medium, then it ends up in our inbox. So we have a threshold for medium or higher. Yeah.”

Minor Incidents Aren’t Publicly Announced

Gary Ilyes next explained that they don’t communicate every little incident that happens because most of the times it won’t even be noticed by users. The most important consideration is whether the incident affects users, which will automatically receive an upgraded priority level.

An interesting fact about how Google decides what’s important is that problems that affect users are automatically boosted to a higher priority  level. Gary said he didn’t work in SRE so he was unable to comment on the exact number of users that need to be affected before Google decides to make a public announcement.

Gary explained:

“SRE would investigate everything. If they get a prober alert, for example, or an alert based on whatever numbers, they will look into it and will try to explain that to themselves.

And, if it’s something that is affecting users, then it almost automatically means that they need to raise the priority because users are actually affected.”

Incident With Images Disappearing

Gary shared another example of an incident, this time it was about images that weren’t showing up for users. It was decided that although the user experience was affected it was not affected to the point that it was keeping users from finding what they were searching for, the user experience was degraded but not to the point where Google became unusable. Thus, it’s not just whether users are affected by an incident that will cause an escalation in priority but also how badly the user experience is affected.

The case of the images not displaying was a situation in which they decided to not make a public statement because users could still be able to find the information they needed. Although Gary didn’t mention it, it does sound like an issue that recipe bloggers have encountered in the past where images stopped showing.

He explained:

“Like, for example, recently there was an incident where some images were missing. If I remember correctly, then I stepped in and I said like, “This is stupid, and we should not externalize it because the user impact is actually not bad,” right? Users will literally just not get the images. It’s not like something is broken. They will just not see certain images on the search result pages.

And, to me, that’s just, well, back to 1990 or back to 2008 or something. It’s like it’s still usable and still everything is dandy except some images.”

Are Publishers & SEOs Considered?

Google’s John Mueller asked Gary if the threshold for making a public announcement was if the user’s experience was degraded or if it was the case that the experience of publishers and SEOs were also considered.

Gary answered (at about the 8 minute mark):

“So it’s Search Relations, not Site Owners Relations, from Search perspective.

But by extension, like the site owners, they would also care about their users. So, if we care about their users, it’s the same group of people, right? Or is that too positive?”

Gary apparently sees his role as primarily as Search Relations in a general sense of their users. That may come as a surprise to many in the SEO community because Google’s own documentation for their Search Off The Record podcast explains the role of the Search Relations team differently:

“As the Search Relations team at Google, we’re here to help site owners be successful with their websites in Google Search.”

Listening to the entire podcast, it’s clear that Googlers John Mueller and Lizzi Sassman are strongly focused on engaging with the search community. So maybe there’s a language issue that’s causing his remark to be interpretable differently than he intended?

What Does Search Relations Mean?

Google explained that they have a process for deciding what to disclose about disruptions in search and it is a 100% sensible approach. But something to consider is that the definition of “relations” is that it’s about a connection between two or more people.

Search is a relation(ship). It is an ecosystem where two partners, the creators (SEOs and site owners) create content and Google makes it available to their users.

Featured Image by Shutterstock/Khosro

Latest article