Friday, November 22, 2024

UALink: Google, Microsoft, Meta and More to Develop AI Chip Components

Must read

AMD, Broadcom, Cisco, Google, Hewlett Packard Enterprise (HPE), Intel, Meta and Microsoft are combining their expertise to create an open industry standard for an AI chip technology called Ultra Accelerator Link. The setup will improve high-speed and low latency communications between AI accelerator chips in data centres.

An open standard will advance artificial intelligence/machine learning cluster performance across the industry, meaning that no singular firm will disproportionately capitalise on the demand for the latest and greatest AI/ML, high-performance computing and cloud applications.

Notably absent from the so-called UALink Promoter Group are NVIDIA and Amazon Web Services. Indeed, the Promoter Group likely intends for its new interconnect standard to topple the two companies’ dominance in AI hardware and the cloud market, respectively.

The UALink Promoter Group expects to establish a consortium of companies that will manage the ongoing development of the UALink standard in Q3 of 2024, and they will be given access to UALink 1.0 at around the same time. A higher-bandwidth version is slated for release in Q4 2024.

SEE: Gartner Predicts Worldwide Chip Revenue Will Gain 33% in 2024

What is the UALink and who will it benefit?

The Ultra Accelerator Link, or UALink, is a defined way of connecting AI accelerator chips in servers to enable faster and more efficient communication between them.

AI accelerator chips, like GPUs, TPUs and other specialised AI processors, are the core of all AI technologies. Each one can perform huge numbers of complex operations simultaneously; however, to achieve high workloads necessary for training, running and optimising AI models, they need to be connected. The faster the data transfer between accelerator chips, the faster they can access and process the necessary data and the more efficiently they can share workloads.

The first standard due to be released by the UALink Promoter Group, UALink 1.0, will see up to 1,024 GPU AI accelerators, distributed over one or multiple racks in a server, connected to a single Ultra Accelerator Switch. According to the UALink Promoter Group, this will “allow for direct loads and stores between the memory attached to AI accelerators, and generally boost speed while lowering data transfer latency compared to existing interconnect specs.” It will also make it simpler to scale up workloads as demands increase.

While specifics about the UALink have yet to be released, group members said in a briefing on Wednesday that UALink 1.0 would involve AMD’s Infinity Fabric architecture while the Ultra Ethernet Consortium will cover connecting multiple “pods,” or switches. Its publication will benefit system OEMs, IT professionals and system integrators looking to set up their data centres in a way that will support high speeds, low latency and scalability.

Which companies joined the UALink Promoter Group?

  • AMD.
  • Broadcom.
  • Cisco.
  • Google.
  • HPE.
  • Intel.
  • Meta.
  • Microsoft.

Microsoft, Meta and Google have all spent billions of dollars on NVIDIA GPUs for their respective AI and cloud technologies, including Meta’s Llama models, Google Cloud and Microsoft Azure. However, supporting NVIDIA’s continued hardware dominance does not bode well for their respective futures in the space, so it is wise to eye up an exit strategy.

A standardised UALink switch will allow providers other than NVIDIA to offer compatible accelerators, giving AI companies a range of alternative hardware options upon which to build their system and not suffer vendor lock-in.

This benefits many of the companies in the group that have developed or are developing their own accelerators. Google has a custom TPU and the Axion processor; Intel has Gaudi; Microsoft has the Maia and Cobalt GPUs; and Meta has MTIA. These could all be connected using the UALink, which is likely to be provided by Broadcom.

SEE: Intel Vision 2024 Offers New Look at Gaudi 3 AI Chip

Which companies notably have not joined the UALink Promoter Group?

NVIDIA

NVIDIA likely hasn’t joined the group for two main reasons: its market dominance in AI-related hardware and its exorbitant amount of power stemming from its high value.

The firm currently holds an estimated 80% of the GPU market share, but it is also a large player in interconnect technology with NVLink, Infiniband and Ethernet. NVLink specifically is a GPU-to-GPU interconnect technology, which can connect accelerators within one or multiple servers, just like UALink. It is, therefore, not surprising that NVIDIA does not wish to share that innovation with its closest rivals.

Furthermore, according to its latest financial results, NVIDIA is close to overtaking Apple and becoming the world’s second most valuable company, with its value doubling to more than $2 trillion in just nine months.

The company does not look to gain much from the standardisation of AI technology, and its current position is also favourable. Time will tell if NVIDIA’s offering will become so integral to data centre operations that the first UALink products don’t topple its crown.

SEE: Supercomputing ‘23: NVIDIA High-Performance Chips Power AI Workloads

Amazon Web Services

AWS is the only of the major public cloud providers to not join the UALink Promoter Group. Like NVIDIA, this also could be related to its influence as the current cloud market leader and the fact that it is working on its own accelerator chip families, like Trainium and Inferentia. Plus, with a strong partnership of more than 12 years, AWS might also lend itself to hiding behind NVIDIA in this arena.

Why are open standards necessary in AI?

Open standards help to prevent disproportionate industry dominance by one firm that happened to be in the right place at the right time. The UALink Promoter Group will allow multiple companies to collaborate on the hardware essential to AI data centres so that no single organisation can take over it all.

This is not the first instance of this kind of revolt in AI; in December, more than 50 other organisations partnered to form the global AI Alliance to promote responsible, open-source AI and help prevent closed model developers from gaining too much power.

The sharing of knowledge also works to accelerate advancements in AI performance at an industry-wide scale. The demand for AI compute is continuously growing, and for tech firms to keep up, they require the very best in scale-up capabilities. The UALink standard will provide a “robust, low-latency and efficient scale-up network that can easily add computing resources to a single instance,” according to the group.

Forrest Norrod, executive vice president and general manager of the Data Center Solutions Group at AMD, said in a press release: “The work being done by the companies in UALink to create an open, high performance and scalable accelerator fabric is critical for the future of AI.

“Together, we bring extensive experience in creating large scale AI and high-performance computing solutions that are based on open-standards, efficiency and robust ecosystem support. AMD is committed to contributing our expertise, technologies and capabilities to the group as well as other open industry efforts to advance all aspects of AI technology and solidify an open AI ecosystem.”

Latest article