AWS offers a glimpse of its AI networking infrastructure

For example, AWS recently delivered a new network optimized for generative AI workloads – and it did it in seven months.

“Our first generation UltraCluster network, built in 2020, supported 4,000 graphics processing units, or GPUs, with a latency of eight microseconds between servers. The new network, UltraCluster 2.0, supports more than 20,000 GPUs with 25% latency reduction. It was built in just seven months, and this speed would not have been possible without the long-term investment in our own custom network devices and software,” Kalyanaraman wrote.

Known internally as the “10p10u” network, the UltraCluster 2.0, introduced in 2023, delivers tens of petabits per second of throughput, with a round-trip time of less than 10 microseconds. “The new network results in at least 15% reduction in time to train a model,” Kalyanaraman wrote.

Cooling tactics, chip designs aim for energy efficiency

Another infrastructure priority at AWS is to continuously improve the energy efficiency of its data centers. Training and running AI models can be extremely energy-intensive.

“AI chips perform mathematical calculations at high speed, making them critical for ML models. They also generate much more heat than other types of chips, so new AI servers that require more than 1,000 watts of power per chip will need to be liquid-cooled. However, some AWS services utilize network and storage infrastructure that does not require liquid cooling, and therefore, cooling this infrastructure with liquid would be an inefficient use of energy,” Kalyanaraman wrote. “AWS’s latest data center design seamlessly integrates optimized air-cooling solutions alongside liquid cooling capabilities for the most powerful AI chipsets, like the NVIDIA Grace Blackwell Superchips. This flexible, multimodal cooling design allows us to extract maximum performance and efficiency whether running traditional workloads or AI/ML models.”

For the last several years, AWS has been designing its own chips, including AWS Trainium and AWS Inferentia, with a goal of making it more energy-efficient to train and run generative AI models. “AWS Trainium is designed to speed up and lower the cost of training ML models by up to 50 percent over other comparable training-optimized Amazon EC2 instances, and AWS Inferentia enables models to generate inferences more quickly and at lower cost, with up to 40% better price performance than other comparable inference-optimized Amazon EC2 instances,” Kalyanaraman wrote.

AWS offers a glimpse of its AI networking infrastructure

Must read

Exclusive | Dave Portnoy is quietly shopping a book

Chiefs’ Xavier Worthy arrested on assault charge; team ‘aware and gathering information’

Bracketology: First automatic bid to 2025 NCAA Tournament field on tap as regular season winds down

NetApp, Inc. (NTAP) Gains Momentum as Key AI Infrastructure Supplier Amid Strong Q3 Results

Cooling tactics, chip designs aim for energy efficiency

Latest article

Exclusive | Dave Portnoy is quietly shopping a book

Chiefs’ Xavier Worthy arrested on assault charge; team ‘aware and gathering information’

Bracketology: First automatic bid to 2025 NCAA Tournament field on tap as regular season winds down

NetApp, Inc. (NTAP) Gains Momentum as Key AI Infrastructure Supplier Amid Strong Q3 Results

Viant Technology Inc. (DSP) Reports Record Q4 2024, Driven by ViantAI Innovations

About Us

Popular Category

Latest News

Exclusive | Dave Portnoy is quietly shopping a book

Chiefs’ Xavier Worthy arrested on assault charge; team ‘aware and gathering information’