I/O Google blew the lid off its sixth tensor processing unit (TPU) codenamed Trillium, designed to support a new generation of bigger, more capable of large language and recommender models.
Initially built to accelerate Google’s internal machine learning workloads, like those built into Gmail, Google Maps, and YouTube, the search giant began making the matrix math accelerators available on its cloud in 2018.
Six generations later, Google’s TPUs are central to the development of the Gemini large language models behind its growing portfolio of generative AI apps and services.
According to Google, Trillium boasts a 4.7x increase in peak compute performance and twice the high bandwidth memory capacity and bandwidth of its earlier TPU v5e design, which we looked at last summer. Google has also doubled the interchip interconnect bandwidth.
Looking at the v5e’s spec sheet, Google’s claims of a 4.7x boost suggest that the new chip is capable of roughly 926 teraFLOPS at BF16 and 1,847 teraFLOPS at INT8. However, that’s assuming Google isn’t relying on lower precision INT4 or FP4 datatypes to achieve that score, like Nvidia is with its Blackwell chips.
This would make Trillium about twice as fast as Google’s TPU v5p accelerators which it announced less than six months ago.
According to Google these performance gains were achieved by increasing the size of the TPU’s matrix multiple units (MXUs) — heart of the chip — and boosting the clock speed.
Alongside the MXU improvements, the chip also boasts Google’s third-gen SparseCore, a specialized accelerator designed to process large embeddings commonly found in ranking and recommender systems.
Meanwhile, a doubling of bandwidth and capacity means that we’re looking at 32GB of HBM operating at 1.6TB/s and a chip-to-chip interconnect capable of 3.2 Tbps.
Google claims the higher memory capacity will enable the chip to support bigger models containing more weights and larger key-value caches — the latter being important for handling large numbers of concurrent users.
As a general rule, you need about 1GB of memory for every billion parameters when training or inferencing a model at 8-bit precision. So a 32GB TPUv6 would be able to support models up to about 30 billion parameters — double that when using models quantized to 4-bit precision.
The higher chip-to-chip interconnect bandwidth, meanwhile, means that multiple TPUs can be strung together much more efficiently in order to support inferencing or training on much larger models.
In terms of scalability, Trillium looks quite similar to the v5e instances it replaces in that it supports pods with up to 256 chips. Multiple Pods can then be networked using Google’s multisplice tech and Titanium infrastructure processing units to support training workloads scaling to “tens of thousands of chips.”
Despite the boost in performance, Google claims its latest TPU is capable of delivering those FLOPS using 67 percent less power than its previous generation.
They add that several customers, including Nuro, Deep Genomics, Deloitte and Google’s own DeepMind team will be among the first to put Trillium through its paces training and running their respective models.
However, it remains to be seen when the rest of us will be able to get our hands on Google’s shiny new TPUs. For the moment, Google is soliciting interest via an online form which you can find here. ®