Google’s John Mueller explained the difference between clustering and canonicalization within Google Search. He said, “Clustering is basically taking the pages that we think are the same. And then canonicalization is, from those pages, which one is the best one.” John said this at the 3:03 minute mark of the interview.
This came up in the excellent Search Off The Record interview of Allan Scott from the Google Search team, who works specifically on duplication within Google Search. Martin Splitt and John Mueller from Google interviewed Allan.
Allan explained at the beginning of the video, “when people think canonicalization, they sort of imagine this one black box that does all the magic things together. And it’s very difficult to handle requests from people that are like, “Well, why is canonicalization wrong?” And so I tend to push people to think of it as, well, canonicalization is one step. It’s I have a bunch of URLs, and I want to know which of them is the canonical, but there are other steps that are as, if not more important here, like the first one being clustering.”
Allan continued to explain, “Usually, when people come to us and complain about canonicalization, the immediate thing we say is, “Oh, that’s a clustering problem, because these two pages shouldn’t be in the same cluster, let alone cases of canonical selection.” Like, if you want to bring a canonicalization problem to me, what that is, is these two pages are in the same cluster, but they aren’t actually, like we picked the wrong one. The most dire case being a hijacking, we see those and we act really fast because those are just disasters.”
That is when John Mueller stepped in to summarize the answer as, “Clustering is basically taking the pages that we think are the same. And then canonicalization is, from those pages, which one is the best one? Is that about right?” In which Allan responded, “Exactly. Yes.”
Then Alan gave this example, “So, for example, rel=”canonical” is a bit of a magic factor that crosses both these lines. rel=”canonical” will actually first try to put two pages in the same cluster. It may or may not succeed, but if two pages are in the same cluster and there is a rel=”canonical” between them, then it’s also a canonical selection signal.”
This started pretty much at the beginning of this video if you want to listen to it:
Forum discussion at X.