The most powerful NVIDIA datacenter GPUs and Superchips

12 min readDec 1, 2024

NVIDIA GB200 Grace Blackwell Superchip, a picture from NVIDIA site.

In addition to popular customer segment video cards known under the brand GeForce, NVIDIA makes energy and space efficient GPUs intended for running in datacenters and powering training and inference of generative AI models. Space efficiency means that many GPUs are supposed to be installed in a single server and to make that possible, datacenter GPUs do not have fans and rely on server’s fans to cool them.

Datacenter GPUs boast a lot of memory and can be connected using NVLink, a technology that allows GPUs to exchange data lightning fast when many GPUs are used to train a single large model. Not all GPUs can be plugged into a motherboard of a PC or regular server. While there are some GPUs that have PCIe interface, usually only a pair of such GPUs can be connected using NVLink. NVIDIA has introduced an SXM (Server PCI Express Module) interface for powerful GPUs, which can be used to connect multiple GPUs using NVLink.

Furthermore, NVIDIA GPUs with SXM interface do not require an additional cable to provide them with enough power, they can draw all required power, up to 700 W from the interface. Also, horizontal mounting of the GPUs with SXM interface allows for easier cooling. The GPUs with this interface are intended for training of large models while GPUs with PCIe interface are intended more for inference workloads but can be used to train small models as well.

Understanding the type of the interface is important whether you rent datacenter GPUs or build your own workstation. If you need more than two GPUs connected using a fast link, your choice is renting GPUs with SXM interface. If you would like to install an older datacenter GPU which offers a good price for a gigabyte of memory into a regular server or PC you should shoot for a GPU with a PCIe system interface.

The GPUs in the list below are sorted by architecture from older to newer. To properly select a GPU you need to understand your memory requirements. Not enough memory won’t allow to fit a model and too much will lead to inefficient use of GPU and money. Newer GPUs have more and faster CUDA cores, higher memory bandwidth and offer faster computation involving smaller precisions such as INT8 or FP8. All GPUs in the list support FP64 also known as double precision, but it is used for scientific computations whereas AI and Deep Learning models rely on FP32 or single precision and smaller ones.

NVIDIA Datacenter GPUs

Tesla P100 PCIe 16 GB

This powerhouse GPU, built on the Pascal architecture (released in 2016), is designed for high-performance workloads. It features HBM2 memory with a blazing 732.2 GB/s bandwidth and a massive 4096-bit interface, ensuring ultra-fast data transfer for demanding tasks.

Packed with 3584 CUDA cores, P100 PCIe 16 GB excels in parallel computing but does not include tensor cores, making it focused on traditional workloads rather than AI-specific acceleration. It supports FP16 and FP32 data types, perfect for mixed-precision calculations.

The GPU connects via PCI-Express 3.0 x16 and draws 250W of power, balancing performance with efficiency. It’s an ideal choice for high-compute environments needing a reliable workhorse.

Tesla P100 SXM2 16 GB

This Pascal-architecture GPU, launched in 2016, is built for serious performance in data centers and HPC environments. Equipped with cutting-edge HBM2 memory, it delivers a staggering 732.2 GB/s of memory bandwidth across a 4096-bit interface, ensuring ultra-smooth data throughput for the most demanding workloads.

With 3584 CUDA cores, P100 SXM2 16 GB excels in parallel processing, handling complex computations with ease. While it lacks tensor cores, it still shines for workloads requiring precision, supporting both FP16 and FP32 data types.

Designed for high-efficiency deployment, P100 SXM2 16 GB connects via SXM2, delivering maximum performance at 300W power consumption. This GPU is a robust choice for industries needing dependable, high-performance compute power.

Tesla V100 PCIe 32 GB

This Volta-architecture GPU, released in 2018, is a powerhouse designed for cutting-edge AI, machine learning, and high-performance computing. With HBM2 memory delivering a blazing 897 GB/s of bandwidth across a 4096-bit interface, it ensures lightning-fast data access for demanding workloads.

Boasting 5120 CUDA cores and 640 tensor cores, this GPU excels in both traditional computations and AI acceleration. The first-generation tensor cores enable FP16 support, making V100 PCIe 32 GB ideal for mixed-precision training, alongside native support for INT32 and FP32 data types.

Built for flexible deployment, it uses a PCI-Express 3.0 x16 interface and consumes a modest 250W, striking a balance between performance and power efficiency. Additionally, a 16 GB memory variant of V100 PCIe is available, providing ample capacity for large datasets and complex models.

This GPU is a game-changer for data centers, delivering exceptional versatility and performance.

Tesla V100 SXM2 32 GB

Built on the advanced Volta architecture and launched in 2018, this GPU is engineered for next-level AI, deep learning, and high-performance computing. Featuring HBM2 memory, it delivers an impressive 898 GB/s bandwidth over a 4096-bit interface, ensuring seamless data transfer for intensive workloads.

With 5120 CUDA cores and 640 tensor cores, V100 SXM2 32 GB is optimized for both traditional computations and AI-specific tasks. Its first-generation tensor cores enable support for FP16, alongside FP32 and INT32 precision, making it ideal for mixed-precision training and inference.

This GPU is designed for maximum performance, using the SXM2 interface and operating at 300W power consumption. A 16 GB memory variant is also available, offering expanded capacity for handling large datasets and complex models.

Whether for AI research, simulation, or enterprise-level HPC, V100 SXM2 32 GB sets a new benchmark for versatility and power.

Tesla V100s PCIe 32 GB

Powered by the advanced Volta architecture, this GPU, released in 2018, is designed for high-performance computing and AI workloads. With HBM2 memory delivering an exceptional 1134 GB/s bandwidth over a 4096-bit interface, it ensures unparalleled data throughput for even the most demanding tasks.

Packed with 5120 CUDA cores and 640 tensor cores, V100s PCIe 32 GB excels in parallel computing and AI acceleration. The first-generation tensor cores support FP16 precision, while the GPU also handles FP32 and INT32 data types, making it a versatile solution for diverse computational needs.

Built for efficiency, V100s PCIe 32 GB connects via PCI-Express 3.0 x16 and operates at 250W power consumption, offering a balance of performance and energy efficiency.

This Volta GPU is a game-changing tool for data centers, researchers, and AI developers aiming to push computational limits.

NVIDIA A100 40 GB PCIe

Built on the groundbreaking Ampere architecture and launched in 2020, this GPU is designed for exceptional performance across AI, data analytics, and HPC workloads. With HBM2 memory delivering a staggering 1,555 GB/s bandwidth over a 5120-bit interface, it handles data-intensive tasks with ease.

Featuring 6912 CUDA cores and 432 third-generation tensor cores, this GPU brings unmatched parallel processing and AI acceleration. The advanced tensor cores support a wide range of precisions, including TF32, FP16, BF16, INT8, and INT4, alongside FP32, making A100 40 GB PCIe versatile for everything from training deep learning models to running inferencing at scale.

With a PCI Express 4.0×16 interface and 250W power consumption, it strikes the perfect balance between cutting-edge performance and energy efficiency.

Ideal for modern data centers and research environments, A100 40 GB PCIe sets a new standard for speed and flexibility in computing.

NVIDIA A100 40 GB SXM4

Designed on the state-of-the-art Ampere architecture and released in 2020, this GPU delivers unmatched power for AI, HPC, and advanced analytics. With HBM2 memory achieving a lightning-fast 1,555 GB/s bandwidth, it ensures seamless performance for data-intensive operations.

Equipped with 6912 CUDA cores and 432 third-generation tensor cores, A100 40 GB SXM4 offers unparalleled computational capability. The tensor cores support a wide spectrum of precisions, including TF32, FP16, BF16, INT8, and INT4, alongside FP32, making it an optimal choice for both training and inferencing across diverse AI workloads.

Built for maximum performance, this GPU leverages the SXM4 interface and operates at 400W, delivering peak throughput for demanding data center deployments.

A100 40 GB SXM4 is the ultimate tool for modern computing, providing the power and flexibility to handle the most challenging tasks with efficiency.

NVIDIA A100 80 GB PCIe

Powered by the cutting-edge Ampere architecture and introduced in 2021, this GPU is a true powerhouse for AI, high-performance computing, and complex analytics. Featuring advanced HBM2e memory, it achieves an astounding 1,935 GB/s bandwidth over a 5120-bit interface, ensuring seamless performance for the most data-intensive tasks.

With 6912 CUDA cores and 432 third-generation tensor cores, A100 80 GB PCIe excels in parallel processing and AI acceleration. The tensor cores support a broad range of precisions, including TF32, FP16, BF16, INT8, and INT4, alongside FP32, making it perfect for both deep learning training and inferencing at scale.

Utilizing a PCI Express 4.0 ×16 interface and operating at 300W, it strikes a balance between peak performance and energy efficiency.

A100 80 GB PCIe is ideal for modern data centers and researchers looking to push the boundaries of what’s possible in AI and advanced computing.

NVIDIA A100 80 GB SXM4

Built on the advanced Ampere architecture and launched in 2020, this GPU is designed for top-tier performance in AI, scientific research, and HPC workloads. It features next-generation HBM2e memory, delivering an incredible 2,039 GB/s bandwidth across a 5120-bit interface, making it ideal for processing massive datasets at lightning speed.

With 6912 CUDA cores and 432 third-generation tensor cores, A100 80 GB SXM4 is a computational powerhouse. The tensor cores support a wide range of precisions, including TF32, FP16, BF16, INT8, and INT4, along with FP32, proviA100 80 GB SXM4ding unmatched flexibility for diverse AI and deep learning applications.

Designed for maximum performance in high-demand environments, A100 80 GB SXM4 uses the SXM4 interface and operates at 400W, ensuring peak efficiency and reliability in data center deployments.

This GPU sets a new standard for high-performance computing, delivering the power and versatility needed for the most advanced workloads.

NVIDIA H100 PCIe

Meet the cutting-edge Hopper architecture GPU, released in 2023, designed to redefine data center performance and efficiency. Equipped with a massive 80 GB of HBM2e memory, it delivers an astonishing 2039 GB/sec bandwidth over a 5120-bit interface, making it ideal for the most demanding AI, HPC, and large-scale simulation workloads.

Boasting an incredible 16,896 CUDA cores and 456 fourth-generation Tensor Cores, H100 PCIe is optimized for both traditional computations and next-level AI tasks. The Tensor Cores support a versatile range of data types, including FP32, FP16, FP8, INT8, and BF16, as well as TF32, enabling unmatched precision and performance for diverse workloads.

Built for the future, this GPU features a PCI Express Gen5 x16 interface, runs at 350W, and supports advanced cooling solutions to maintain peak performance under heavy workloads. For even greater computational power, two GPUs can be connected via NVLink, creating a powerhouse configuration with 160 GB of memory, perfect for tackling the largest AI models and datasets.

The Hopper GPU sets a new benchmark for innovation, empowering researchers and enterprises to achieve unprecedented breakthroughs.

NVIDIA H100 NVL

The Hopper architecture GPU, launched in 2023, is a technological marvel, pushing the boundaries of AI, HPC, and advanced data analytics. With an impressive 94 GB of HBM3 memory, it delivers a jaw-dropping 3.9 TB/s bandwidth over a 6,016-bit interface, making it a top-tier choice for tackling the most complex and data-intensive workloads.

Equipped with fourth-generation Tensor Cores, H100 NVL is built to excel in cutting-edge AI applications. It supports an extensive range of data types, including FP32, FP16, FP8, INT8, BF16, and TF32, offering unmatched versatility and precision for training and inference tasks.

The GPU operates with a PCI Express Gen5 x16 interface and a power envelope of 400W, ensuring maximum throughput and efficiency for modern data centers. Advanced cooling solutions ensure optimal performance, even under heavy computational loads.

Designed for the future, the Hopper GPU sets a new standard for high-performance computing, enabling enterprises and researchers to unlock groundbreaking innovations with unparalleled speed and accuracy.

Two H100 NVL GPUs can be connected using NVLink offering a more powerful configuration with 188 GB of memory. Optimized for LLM inference.

NVIDIA H100 SXM5

Introducing the powerhouse Hopper architecture GPU, unveiled in 2023, engineered to deliver unparalleled performance for AI, HPC, and advanced computational workloads. With a massive 80 GB of HBM3 memory, it achieves an extraordinary 3,352 GB/sec bandwidth over a 5120-bit interface, making it an exceptional choice for tackling the most demanding data challenges.

Packed with 14,592 CUDA cores and 528 fourth-generation Tensor Cores, H100 SXM5 excels in both traditional and AI-driven computations. Its Tensor Cores support a versatile range of data types, including FP32, FP16, FP8, INT8, BF16, and TF32, offering unmatched flexibility for precision-demanding workloads.

Built for maximum performance, this GPU uses the high-bandwidth SXM5 interface and operates at 700W, delivering incredible throughput. Advanced cooling solutions ensure H100 SXM5 performs optimally under the heaviest workloads. For even greater power, multiple GPUs can be seamlessly connected using NVLink, enabling scalable configurations for large-scale projects.

The Hopper GPU redefines what’s possible in modern data centers, empowering innovators to push the limits of AI, simulation, and scientific research.

NVIDIA H200 NVL

Unveiled in 2024, the latest Hopper architecture GPU raises the bar for AI, HPC, and data-intensive workloads. With a colossal 141 GB of HBM3e memory, it delivers an industry-leading 4.8 TB/s bandwidth, setting a new standard for speed and efficiency in high-performance computing.

While details on CUDA cores and Tensor Cores remain under wraps, H200 NVL is built to support advanced AI and deep learning tasks. It handles a versatile range of data types, including FP32, FP16, FP8, INT8, BF16, and TF32, ensuring precision and adaptability for cutting-edge workloads.

Utilizing the high-speed PCI Express Gen5 x16 interface and operating at 600W, this GPU is optimized for peak performance. Advanced cooling solutions are designed to maintain stability under heavy computational loads, making it a reliable choice for next-gen data centers.

The Hopper GPU represents the forefront of innovation, delivering unmatched power and scalability to meet the demands of tomorrow’s most ambitious projects.

NVIDIA H200 SXM5

Debuting in 2024, the latest Hopper architecture GPU redefines the limits of high-performance computing and AI. With an enormous 141 GB of HBM3e memory and a record-breaking 4.8 TB/s bandwidth, this GPU is designed to handle the most demanding workloads with unparalleled speed and efficiency.

While details about CUDA cores and Tensor Cores are yet to be announced, the GPU is built to support advanced AI and HPC applications. Its data type support includes FP32, FP16, FP8, INT8, BF16, and TF32, providing unmatched flexibility and precision for diverse computational tasks.

Harnessing the high-bandwidth SXM5 interface and operating at 700W, this GPU is engineered for top-tier performance. Advanced cooling solutions ensure reliability and stability even under the heaviest workloads, making it an ideal choice for next-generation data center deployments.

Nvidia H200 SXM5 GPU sets a new standard in computational power and scalability, paving the way for groundbreaking innovation across AI, research, and enterprise applications.

NVIDIA Superchips

In 2023 in addition to its GPUs NVIDIA introduced a high performance Grace CPU. The CPU is used as a part of NVIDIA Superchips and offers higher performance per Watt in comparison with conventional x86–64 CPUs. The Superchips fall into two categories, one that contains a pair of Grace CPUs and the ones that include a Grace CPU and one or two NVIDIA datacenter GPUs. The first Superchips were introduced along with Hopper GPU architecture and consist of one H100 or H200 GPUs and one Grace CPU. Blackwell architecture introduces Superchips consisting of one Grace CPU and two GPUs with Blackwell architecture such as B100.

NVIDIA Grace Superchip consists of two NVIDIA Grace CPUs connected via NVLink. In total the Superchip boasts 144 Arm Neoverse V2 cores and up to 960 GB of LPDDR5X with the bandwidth that can reach 1 TB/s and draws 500 W of power. Grace Superchip is intended for High Performance Computing (HPC) Applications such as fluid dynamics, numerical weather prediction, and DNA sequencing.

NVIDIA Grace Hopper GH200 Superchip features up to 480 GB of LPDDR5X CPU memory and up to 72 Arm Neoverse V2 cores. The Superchip variant with HBM3 has 96 GB of memory with the bandwidth of 4 TB/s. Another variant with HBM3e has 144 GB of GPU memory with 4.9 GB/s bandwidth. The CPU and GPU are connected via NVLink inside the Superchip.

The idea behind a CPU-GPU superchip such as Grace Hopper is to remove communication, computation, and memory bottlenecks. For example LLM inference requires a lot of memory to store model’s weights and biases along with intermediate results and data batches. One approach to solve memory requirements is to use multiple GPUs. Another approach is to offload some weights to the CPU memory. As the CPU and GPU in the Grace Hopper are connected via NVLink, not PCIe interface, a possible bottleneck to layer data offloading is removed.

If you are interested in building your own AI Deep Learning workstation, I shared my experience in the article below.

How I built a cheap AI and Deep Learning Workstation quickly

With the growing popularity of generative AI I decided that I need my own AI/Deep Learning workstation with a dedicated…

javaeeeee.medium.com

Listen to the podcast part 1 and part 2 generated based on this article by NotebookLM.