As part of the topic of clustering and canonicalization with Google Search today, Allan Scott from Google explained how Google Search’s clustering technology works with its localization technology. The answer is, wait for it, “it depends,” but let’s dig in.
This came up in the excellent Search Off The Record interview of Allan Scott from the Google Search team, who works specifically on duplication within Google Search. Martin Splitt and John Mueller from Google interviewed Allan.
Allan said, “You’re asking about how clustering works with localization. Well, the answer is it depends.” This all started at the 8:35 mark into the interview where Allan said, “This would be the localization iceberg that we’re now encountering. You can see the tiny sliver above the water line, and then there’s this giant mass underneath.”
Localization and hreflang, as Google has said before, may be the most complex area of SEO. “If this topic seems confusing externally; it’s also confusing internally,” Allan said. “We have been trying to make localization work in a reasonable way for a very long time because it’s a very challenging subject,” he added.
There are two type of categories Google has for localization. Allan explained, “Internally, there’s essentially two categories of localization types. There are the localization types where it’s just a boilerplate translation, which is something very common, especially with big social media sites. They don’t translate the content, whereas there are also translations that are full translations where you will see the actual content of the page fully change.”
The boilerplate content is not something Google is too concerned with, Allan said, “That is not something that we’re really concerned with and doling out for people.” The other types of translation, well, that is important.
Allan said:
But the full translation pages should not cluster because they have different tokens they’re going to retrieve for different queries, so we don’t want them in the same cluster. We want to have all those pages available for retrieval. The boilerplate translations, we want to put into the same cluster. That means that they’ll consolidate signals, but it also means that we don’t have to crawl every single localization variant because to be honest, you know, we’re wasting your bandwidth, and we’re wasting our space by doing that. That’s why it depends. There’s two different ways we want to handle these things, and which one you’re doing matters. And then you get the really complicated ones, like what you said, where they just changed the price. Those ones become more complicated because it’s basically the same content, but for one token. But that one token really matters. And then that one token case, we still want to have them in different clusters. That’s a more challenging problem in theory than, you know, not putting two language variants into the same cluster. But, you know, that’s why localization is a hard space.
Then on top of all the clustering for duplication here, is also hreflang. Allan said:
sitting on top of all of this talk about clustering, which is the dups system on its own, there’s hreflang, which is basically a separate system where, if you put in the annotations, we will try to substitute them. John knows that there is a project right now which may or may not be live by the end of the year that is attempting to increase the reach of that specifically. We want to serve more hreflang variants. We want to utilize that more, but we need to put in place mechanisms that will determine basically how much we can trust it on a given site. We’re doing some crawl and verification, basically, to determine, you know, is this site serving its map correctly, and if so, then we’re going to try to serve that more often without necessarily having to verify it as much as we currently do.
This quote stood out to me a lot, “John knows that there is a project right now which may or may not be live by the end of the year that is attempting to increase the reach of that specifically.” Google wants to “serve more hreflang variants.” Hmmm….
Anyway, this topic goes on, here is the embed that you can listen to:
Forum discussion at X.