Sunday, April 14, 2024
HomeSuperintelligenceNVIDIA's supercomputer

NVIDIA’s supercomputer

- Advertisement -

HALF EOS’D: EVEN H100s are in high demand for NVIDIA’s supercomputer.

UPDATED: Right now, obtaining an Nvidia “Hopper” H100 GPU is maybe the hardest thing in the world. It appears that half of the Eos supercomputer that Nvidia was showcasing last November while it ran the MLPerf benchmarks was repurposed for another machine, leaving the Eos machine that Nvidia is boasting about today in its original configuration but half of its peak. Even the company that makes them is having to carefully parcel them out for its own internal uses.

These days, things are strange in the AI datacenter.

Relevantly, Nvidia released a blog post and a video explaining the Eos device in depth to the mainstream media. But we kind of envision it as one of those kid’s Cracker Barrel menus—a coloring book with a black, green, and yellow crayon.

The Hopper GPU accelerator launch in March 2022 featured the Eos machine, which was deployed later that year. When the High Performance LINPACK benchmark was verified in a run for the November 2023 list, the machine rose to the ninth spot on the Top 500 supercomputer rankings.

During this event, Nvidia also introduced the most recent version of the MLPerf machine learning benchmark for datacenter training and inference tests. They also boasted about the Eos system, a machine that Nvidia named after itself that has 10,752 H100 GPUs connected by 4000 Gb/sec Quantum-2 NDR InfiniBand interconnects.

We also quote Nvidia itself: “Nvidia Eos, an AI supercomputer powered by a staggering 10,752 Nvidia H100 Tensor Core GPUs and Nvidia Quantum-2 InfiniBand networking, stands out among many new records and milestones: in just 3.9 minutes, it completed a training benchmark based on a GPT-3 model with 175 billion parameters trained on one billion tokens.”

- Advertisement -

This is the situation: 4,608 H100 GPUs were required in the original Eos design, and it appears that this is the system that Nvidia is referring to today and the one that it used to run the LINPACK test for the Top 500 ranking. The additional 6,144 H100 accelerators that were reportedly a component of the Eos system last fall—what happened to them?

Furthermore, it should be noted that although the machine tested on LINPACK achieved a peak FP64 oomph of 188.65 petaflops, only about 3,160 of those 4,608 H100s were likely used to drive the LINPACK benchmark.The top theoretical performance of the initial Eos design, which was unveiled in March 2022, is 275 petaflops at FP64 double precision.

Why 4,608 GPUs—or even the 10,752 GPUs utilized in the test—were not employed for LINPACK is a bit of a mystery. With a fuller representation of what was dubbed Eos, it might have achieved peak performance of around 642 petaflops and sustained LINPACK performance of maybe 400 petaflops, according to the MLPerf tests. securing it a spot on the November Top 500 rankings at Number 5.

It’s funny, isn’t it?

As best as we could tell at the time, the Eos machine was initially designed as follows:

According to the March 2022 announcement, this Eos machine produced a SuperPOD of 32 DGX H100 systems, each with eight H100 GPUs. This resulted in an NVSwitch memory fabric that offered 256 GPUs in total a shared memory space. Eighteen of these DGX H100 SuperPODs must be connected to a huge Quantum-2 InfiniBand switch complex in order to achieve the peak 275 petaflops at FP64 double precision and peak 18 exaflops at FP8 quarter precision.

Based on our calculations, the DGX servers included 2,304 NVSwitch 3 ASICs, while the 18 SuperPODs’ 360 NVSwitch leaf/spine switches contained 720 additional NVSwitch 3 ASICs. There will be an additional 500 switch ASICs in the two-tier InfiniBand network with 500 InfiniBand switches. Interestingly, 3,524 switch ASICs were needed to connect 4,608 H100 GPUs.

(In terms of raw FP64 flops, the 1,152 Xeon SP host CPUs in the DGX nodes of the Eos machine were not truly statistically relevant.) This is an extremely network-heavy arrangement, as we stated at the time.

Certainly not the type that cloud providers and hyperscalers like. None of the cloud builders or hyperscalers that we are aware of have used NVSwitch fabrics with SuperPODs. Although the performance is superior, there is probably a fairly high premium associated with it.

We prefer to dive into the finer points, so we contacted Nvidia to request a reference architecture for the Eos machine. We are unaware of whether it was constructed using H100s and 80 GB or 96 GB of RAM, as well as the reason behind the machine’s 57.1 percent reduction in size from its MLPerf configuration from November of last year.

However, this is one conceivable response. As of right now, an H100 costs $30,000 and weighs around 3 pounds, or 50 ounces. At around $600 an ounce, that is. Gold is currently selling for nearly $1,990 an ounce, or over three times as much, as we go to press. The H100 has a far wider range of applications than gold does. Additionally, Nvidia might not be able to maintain the extra 6,144 H100 GPUs on hand given the high demand from clients.

UPDATE: This is the situation. Contrary to what we learned at the March 2022 keynote, the Eos machine was constructed utilizing a three-tier InfiniBand network to connect its 576 nodes instead of the NVSwitch connection fabric. The DGX SuperPOD reference architecture is available for viewing here.

Using 448 Nvidia Quantum InfiniBand switches in a 3-tier architecture with fully connected/non-blocking fabric that is rail-optimized, enabling for lowest hop lengths, is how Nvidia describes the EOS compute connectivity. Deep learning workloads have the lowest latency as a consequence. The DHX H100 uses NVSwitch, but not for internode communication.

Therefore, the DGX H100 256 SuperPOD, a huge shared memory NVSwitch connection, is not utilized in Eos. This is strange in light of the obvious performance and scalability benefits it provides, which we covered in our initial coverage in March 2022 and which many of us assumed were included in Eos because of the way the keynote discussed the node, the NVSwitch connection, and finally the Eos machine.

It’s also interesting to note that, as we reported in November of last year, Amazon Web Services is connecting 32 Grace-Hopper GH200 superchips via NVSwitch fabrics into a rackscale shared memory system known as the GH200 NVL32. The company is then connecting the racks via its own EFA2 Ethernet fabric to form massive hybrid CPU-GPU systems known as “Ceiba,” which are designed to handle a variety of HPC and AI workloads.

It seems that not even Nvidia is directly interlinking the GPU memories utilizing the NVSwitch memory fabric, for some reason. It’s a little disappointing to see that Nvidia wasn’t able to display this NVSwitch memory fabric with Eos, given our and others’ expectations.


- Advertisement -

Superhuman AI


Please enter your comment!
Please enter your name here

Most Popular