Monday, December 23, 2024

Google Revamps Entire Crawler Documentation

Must read

Google has launched a major revamp of its Crawler documentation, shrinking the main overview page and splitting content into three new, more focused pages.  Although the changelog downplays the changes there is an entirely new section and basically a rewrite of the entire crawler overview page. The additional pages allows Google to increase the information density of all the crawler pages and improves topical coverage.

What Changed?

Google’s documentation changelog notes two changes but there is actually a lot more.

Here are some of the changes:

  • Added an updated user agent string for the GoogleProducer crawler
  • Added content encoding information
  • Added a new section about technical properties

The technical properties section contains entirely new information that didn’t previously exist. There are no changes to the crawler behavior, but by creating three topically specific pages Google is able to add more information to the crawler overview page while simultaneously making it smaller.

This is the new information about content encoding (compression):

“Google’s crawlers and fetchers support the following content encodings (compressions): gzip, deflate, and Brotli (br). The content encodings supported by each Google user agent is advertised in the Accept-Encoding header of each request they make. For example, Accept-Encoding: gzip, deflate, br.”

There is additional information about crawling over HTTP/1.1 and HTTP/2, plus a statement about their goal being to crawl as many pages as possible without impacting the website server.

What Is The Goal Of The Revamp?

The change to the documentation was due to the fact that the overview page had become large. Additional crawler information would make the overview page even larger. A decision was made to break the page into three subtopics so that the specific crawler content could continue to grow and making room for more general information on the overviews page. Spinning off subtopics into their own pages is a brilliant solution to the problem of how best to serve users.

This is how the documentation changelog explains the change:

“The documentation grew very long which limited our ability to extend the content about our crawlers and user-triggered fetchers.

…Reorganized the documentation for Google’s crawlers and user-triggered fetchers. We also added explicit notes about what product each crawler affects, and added a robots.txt snippet for each crawler to demonstrate how to use the user agent tokens. There were no meaningful changes to the content otherwise.”

The changelog downplays the changes by describing them as a reorganization because the crawler overview is substantially rewritten, in addition to the creation of three brand new pages.

While the content remains substantially the same, the division of it into sub-topics makes it easier for Google to add more content to the new pages without continuing to grow the original page. The original page, called Overview of Google crawlers and fetchers (user agents), is now truly an overview with more granular content moved to standalone pages.

Google published three new pages:

  1. Common crawlers
  2. Special-case crawlers
  3. User-triggered fetchers

1. Common Crawlers

As it says on the title, these are common crawlers, some of which are associated with GoogleBot, including the Google-InspectionTool, which uses the GoogleBot user agent. All of the bots listed on this page obey the robots.txt rules.

These are the documented Google crawlers:

  • Googlebot
  • Googlebot Image
  • Googlebot Video
  • Googlebot News
  • Google StoreBot
  • Google-InspectionTool
  • GoogleOther
  • GoogleOther-Image
  • GoogleOther-Video
  • Google-CloudVertexBot
  • Google-Extended

3. Special-Case Crawlers

These are crawlers that are associated with specific products and are crawled by agreement with users of those products and operate from IP addresses that are distinct from the GoogleBot crawler IP addresses.

List of Special-Case Crawlers:

  • AdSense
    User Agent for Robots.txt: Mediapartners-Google
  • AdsBot
    User Agent for Robots.txt: AdsBot-Google
  • AdsBot Mobile Web
    User Agent for Robots.txt: AdsBot-Google-Mobile
  • APIs-Google
    User Agent for Robots.txt: APIs-Google
  • Google-Safety
    User Agent for Robots.txt: Google-Safety

3. User-Triggered Fetchers

The User-triggered Fetchers page covers bots that are activated by user request, explained like this:

“User-triggered fetchers are initiated by users to perform a fetching function within a Google product. For example, Google Site Verifier acts on a user’s request, or a site hosted on Google Cloud (GCP) has a feature that allows the site’s users to retrieve an external RSS feed. Because the fetch was requested by a user, these fetchers generally ignore robots.txt rules. The general technical properties of Google’s crawlers also apply to the user-triggered fetchers.”

The documentation covers the following bots:

  • Feedfetcher
  • Google Publisher Center
  • Google Read Aloud
  • Google Site Verifier

Takeaway:

Google’s crawler overview page became overly comprehensive and possibly less useful because people don’t always need a comprehensive page, they’re just interested in specific information. The overview page is less specific but also easier to understand. It now serves as an entry point where users can drill down to more specific subtopics related to the three kinds of crawlers.

This change offers insights into how to freshen up a page that might be underperforming because it has become too comprehensive. Breaking out a comprehensive page into standalone pages allows the subtopics to address specific users needs and possibly make them more useful should they rank in the search results.

I would not say that the change reflects anything in Google’s algorithm, it only reflects how Google updated their documentation to make it more useful and set it up for adding even more information.

Read Google’s New Documentation

Overview of Google crawlers and fetchers (user agents)

List of Google’s common crawlers

List of Google’s special-case crawlers

List of Google user-triggered fetchers

Featured Image by Shutterstock/Cast Of Thousands

Latest article