Ampere (microarchitecture)

Ampere is the codename for a graphics processing unit (GPU) microarchitecture developed by Nvidia as the successor to both the Volta and Turing architectures, officially announced on May 14, 2020. It is named after French mathematician and physicist André-Marie Ampère.[1][2] Nvidia announced the next-generation GeForce 30 series consumer GPUs at a GeForce Special Event on September 1, 2020,[3][4] with more RTX products to be revealed on January 12, 2021.[5] Nvidia announced A100 80GB GPU at SC20 on November 16, 2020.[6]

Nvidia Ampere
Fabrication process
History
Predecessor
Successor

Details

Architectural improvements of the Ampere architecture include the following:

  • CUDA Compute Capability 8.0 for A100 and 8.6 for the GeForce 30 series[7]
  • TSMC's 7 nm FinFET process for A100
  • Custom version of Samsung's 8nm process (8N) for the GeForce 30 series[8]
  • Third-generation Tensor Cores with FP16, bfloat16, TensorFloat-32 (TF32) and FP64 support and sparsity acceleration[9]
  • Second-generation ray tracing cores; concurrent ray tracing, shading, and compute for the GeForce 30 series
  • High Bandwidth Memory 2 (HBM2) on A100 40GB & A100 80GB
  • GDDR6X memory for GeForce RTX 3090 and 3080
  • Double FP32 cores per SM on GA10x GPUs
  • NVLink 3.0 with a 50Gbit/s per pair throughput[9]
  • PCI Express 4.0 with SR-IOV support (SR-IOV is reserved only for A100)
  • Multi-Instance GPU (MIG) virtualization and GPU partitioning feature in A100 supporting up to seven instances
  • PureVideo feature set K hardware video decoding with AV1 hardware decoding[10] for the GeForce 30 series and feature set J for A100
  • 5 NVDEC for A100
  • Adds new hardware-based 5-core JPEG decode (NVJPG) with YUV420, YUV422, YUV444, YUV400, RGBA. Should not be confused with Nvidia NVJPEG (GPU-accelerated library for JPEG encoding/decoding)

Chips

  • GA100
  • GA102
  • GA104
  • GA106
  • GA107

Comparison of Compute Capability: GP100 vs GV100 vs GA100[11]

GPU Features NVIDIA Tesla P100 NVIDIA Tesla V100 NVIDIA A100
GPU Codename GP100 GV100 GA100
GPU Architecture NVIDIA Pascal NVIDIA Volta NVIDIA Ampere
Compute Capability 6.0 7.0 8.0
Threads / Warp 32 32 32
Max Warps / SM 64 64 64
Max Threads / SM 2048 2048 2048
Max Thread Blocks / SM 32 32 32
Max 32-bit Registers / SM 65536 65536 65536
Max Registers / Block 65536 65536 65536
Max Registers / Thread 255 255 255
Max Thread Block Size 1024 1024 1024
FP32 Cores / SM 64 64 64
Ratio of SM Registers to FP32 Cores 1024 1024 1024
Shared Memory Size / SM 64 KB Configurable up to 96 KB Configurable up to 164 KB

Comparison of Precision Support Matrix[12][13]

Supported CUDA Core Precisions Supported Tensor Core Precisions
FP16 FP32 FP64 INT1(Binary) INT4 INT8 TF32 bfloat16(BF16) FP16 FP32 FP64 INT1(Binary) INT4 INT8 TF32 bfloat16(BF16)
NVIDIA Tesla P4 NoYesYesNoNoYesNoNoNoNoNoNoNoNoNoNo
NVIDIA P100 YesYesYesNoNoNoNoNoNoNoNoNoNoNoNoNo
NVIDIA Volta YesYesYesNoNoYesNoNoYesNoNoNoNoNoNoNo
NVIDIA Turing YesYesYesNoNoYesNoNoYesNoNoYesYesYesNoNo
NVIDIA A100 YesYesYesNoNoYesNoYesYesNoYesYesYesYesYesYes

Comparison of Decode Performance

Concurrent Streams H.264 Decode (1080p30) H.265(HEVC) Decode (1080p30) VP9 Decode (1080p30)
V100 16 22 22
A100 75 157 108

A100 accelerator and DGX A100

Announced and released on May 14, 2020 was the Ampere-based A100 accelerator.[9] The A100 features 19.5 teraflops of FP32 performance, 6912 CUDA cores, 40GB of graphics memory, and 1.6TB/s of graphics memory bandwidth.[14] The A100 accelerator was initially available only in the 3rd generation of DGX server, including 8 A100s.[9] Also included in the DGX A100 is 15TB of PCIe gen 4 NVMe storage,[14] two 64-core AMD Rome 7742 CPUs, 1 TB of RAM, and Mellanox-powered HDR InfiniBand interconnect. The initial price for the DGX A100 was $199,000.[9]

Comparison of accelerators used in DGX:[9][15]

Accelerator
A100 80GB
A100
V100
P100
ArchitectureFP32 CUDA CoresFP64 Cores(excl. Tensor)INT32 CoresBoost ClockMemory ClockMemory Bus WidthMemory BandwidthVRAMSingle PrecisionDouble Precision(FP64)INT8(non-Tensor)INT8 TensorINT32FP16FP16 Tensorbfloat16 TensorTensorFloat-32(TF32) TensorFP64 TensorInterconnectGPUL1 Cache SizeL2 Cache SizeGPU Die SizeTransistor CountTDPManufacturing Process
Ampere6912345669121410 MHz3.2Gbit/s HBM25120-bit2039GB/sec80GB19.5 TFLOPs9.7 TFLOPsN/A624 TOPs19.5 TOPs78 TFLOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPs600GB/secGA10020736KB(192KBx108)40960 KB826mm254.2B400WTSMC 7 nm N7
Ampere6912345669121410 MHz2.4Gbit/s HBM25120-bit1555GB/sec40GB19.5 TFLOPs9.7 TFLOPsN/A624 TOPs19.5 TOPs78 TFLOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPs600GB/secGA10020736KB(192KBx108)40960 KB826mm254.2B400WTSMC 7 nm N7
Volta5120256051201530 MHz1.75Gbit/s HBM24096-bit900GB/sec16GB/32GB15.7 TFLOPs7.8 TFLOPs62 TOPsN/A15.7 TOPs31.4 TFLOPs125 TFLOPsN/AN/AN/A300GB/secGV10010240KB(128KBx80)6144 KB815mm221.1B300W/350WTSMC 12 nm FFN
Pascal35841792N/A1480 MHz1.4Gbit/s HBM24096-bit720GB/sec16GB10.6 TFLOPs5.3 TFLOPsN/AN/AN/A21.2 TFLOPsN/AN/AN/AN/A160GB/secGP1001344KB(24KBx56)4096 KB610mm215.3B300WTSMC 16 nm FinFET+


Products using Ampere

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.