Contents

Introduction:

CUDA, which stands for Compute Unified Device Architecture, is a parallel computing platform and programming model developed by NVIDIA. It allows developers to use NVIDIA GPUs (Graphics Processing Units) for general-purpose parallel computing. CUDA enables efficient parallel processing for a wide range of applications beyond traditional graphics rendering, such as scientific simulations, deep learning, and numerical computations.

Motivation

There are two reasons to use GPU over CPU:

  1. Very High computational throughput; computational throughput = Number of cores . clock speed . number of Instructions per cycle
  2. High memeory bandwidth/throughput; Memory bandwidth = (memory bus width in bits. Memory clock speed )

GPU Architectures

There are many Architectures to use, each with different compute capability.

Architecture Compute Capability
Tesla 1.x
Fermi 2.x
Kepler 3.x
Maxwell 5.x
Pascal 6.x
Volta 7.x
Turing 7.5
Ampere 8.x

Compute Capability:

Compute Capability is a numerical value assigned to each NVIDIA GPU architecture, and it represents the level of features and capabilities supported by that architecture. It is an important parameter for developers, especially when writing CUDA (Compute Unified Device Architecture) code for GPU programming. Each compute capability introduces new features, instructions, and improvements over previous versions.

Some key aspects related to Compute Capability include:

Numbering Scheme:

Compute Capability is represented in the form of major and minor version numbers (e.g., 6.1 for Pascal, 7.0 for Volta).

Feature Support:

Higher compute capability generally implies support for newer features, instructions, and optimizations.

CUDA Toolkit Compatibility:

Some CUDA Toolkit versions may be designed to support specific compute capabilities. It's important for developers to ensure compatibility between their code and the targeted GPU architecture.

Performance and Efficiency:

Newer compute capabilities often bring improvements in performance and energy efficiency, allowing developers to take advantage of advanced features and optimizations.

Tensor Cores:

In more recent architectures like Volta, Turing, and Ampere, special hardware units called Tensor Cores were introduced to accelerate deep learning tasks.

Pascal Architecture(chep@iisc cluster)

We use 10 nodes of GPU.
The NVIDIA Tesla P100 is a high-performance GPU designed for data center and high-performance computing (HPC) applications. The Tesla P100 PCIe variant has a compute capability of 6.0. The compute capability is a numerical representation that indicates the features and capabilities supported by the GPU architecture.

For the Tesla P100 PCIe 16GB:

Compute Capability: 6.0
The compute capability is represented in the form of major and minor version numbers, where the major version indicates significant architectural changes and the minor version represents incremental improvements within that architecture.

Here's a breakdown of the compute capability version 6.0:

Major Version (6.x): Indicates the Pascal architecture.
Minor Version (6.0): Specific incremental improvements and features within the Pascal architecture.
Developers often use compute capability information when programming GPUs using frameworks like CUDA (Compute Unified Device Architecture). The compute capability helps ensure compatibility and optimization of code for specific GPU architectures.

The Pascal architecture is a GPU (Graphics Processing Unit) architecture developed by NVIDIA and was introduced in 2016. It succeeded the Maxwell architecture and brought several improvements in terms of performance, energy efficiency, and new features. Pascal GPUs were designed for a variety of applications, including gaming, professional graphics, and high-performance computing.

Number of cores (Pascal)

The NVIDIA Tesla P100 GPU features a number of streaming multiprocessors (SMs), and each SM contains a certain number of CUDA cores. To determine the number of CUDA cores on a Tesla P100 GPU, you can refer to the specifications provided by NVIDIA.

Here are the key details for the Tesla P100 PCIe 16GB:

Now, you can calculate the total number of CUDA cores by multiplying the number of CUDA cores per SM by the number of SMs:

Total CUDA Cores = CUDA Cores per SM x Number of SMs

For the Tesla P100:

Total CUDA Cores = 64 x 56 = 3584 CUDA Cores

So, the Tesla P100 PCIe 16GB has a total of 3,584 CUDA cores. Keep in mind that this information is specific to the Tesla P100 GPU architecture, and different GPU models may have different configurations. Always refer to the official specifications provided by the GPU manufacturer for accurate and up-to-date information.

Use nvidia-smi -h for details.

Here are key features and characteristics of the Pascal architecture:

16nm FinFET Process Technology:

Pascal GPUs were manufactured using a 16nm FinFET process technology, a significant advancement over the 28nm process used in the Maxwell architecture. This transition to a smaller process node contributed to improved energy efficiency and performance.

Pascal_FinFET Technology

CUDA Cores and Streaming Multiprocessors (SMs):

Pascal GPUs featured an increase in the number of CUDA cores compared to Maxwell. CUDA cores are the processing units responsible for executing parallel tasks. The architecture maintained the Streaming Multiprocessor (SM) design, with each SM containing multiple CUDA cores. (For chep@iisc, we have Pascal 6.0, which have 56 SM per node, where each SM has 64 cores).

Simultaneous Multi-Projection (SMP):

Pascal introduced Simultaneous Multi-Projection, a feature designed to enhance virtual reality (VR) and multi-display experiences. SMP enables more efficient rendering for multiple projections, improving the visual quality and realism in VR applications.

High Bandwidth Memory 2 (HBM2):

Some Pascal GPUs, particularly those in the professional and data center-oriented Tesla and Quadro series, adopted High Bandwidth Memory 2 (HBM2). HBM2 provides higher memory bandwidth compared to traditional GDDR5 memory, which is beneficial for memory-intensive applications.

Unified Memory and Page Migration Engine:

Pascal continued to support Unified Memory, allowing developers to access both CPU and GPU memory seamlessly. The introduction of the Page Migration Engine enhanced the efficiency of data transfers between CPU and GPU memory.

Performance Considerations:

Pascal GPUs demonstrated improvements in performance across various applications, including gaming, content creation, and parallel computing. The architecture was designed to provide a balance of computational power, memory bandwidth, and energy efficiency.

Enhanced Deep Learning Performance:

Pascal GPUs showed notable improvements in deep learning performance. While not as specialized as later architectures like Volta and Turing, Pascal GPUs were still capable of handling deep learning workloads efficiently.
NVLink:

Pascal introduced NVLink, a high-speed interconnect technology primarily used in data center and high-performance computing (HPC) environments. NVLink enables faster communication between multiple GPUs, supporting scalable and efficient parallel processing.
It's important to note that the Pascal architecture was succeeded by the Volta architecture and later by Turing and Ampere. Each subsequent architecture introduced new features and optimizations, especially in the context of AI and deep learning, but Pascal GPUs remain relevant for various applications.

Pascal (mathematician / physicist)

Pascal is unit of pressure, 1 atm = \(10^5\) pascal, where one pascal is 1 \(\frac{N}{m^2}\)

CPU and GPU Differences

Feature CPU GPU
Function General Purpose processing Specialized for parallel tasks
Architecture few cores Many cores
Parallelism single threaded can use so many threads
Memory Hierarchy complex, low latency throughput High throughput parallel access
Flexibility Versatile, Handles various workloads specialized for specific tasks
Execution type Sequential parallel

Please note that this table provides a high-level overview, and the specifics can vary based on the architecture and design of individual CPUs and GPUs. Additionally, advancements in technology may lead to changes in the characteristics of future CPU and GPU architectures.

CPU Beats GPU

Task CPU Advantage
Single-Threaded Work Higher clock speeds for individual tasks.
Sequential Algorithms Better for step-by-step execution.
Task Switching Efficient handling of diverse concurrent tasks.
General-Purpose Computing Versatility for various computing tasks.
Operating System Operations Integral role in managing system operations.
Data Serialization Sequential processing for dependent data.
Low-Latency Access Complex memory hierarchy for quick data retrieval.

It's important to note that the superiority of CPUs in these scenarios does not imply that GPUs cannot perform these tasks. Rather, CPUs are typically more optimized for these specific types of workloads. The choice between CPU and GPU often depends on the nature of the task and the specific requirements of the application.

Current CPU Architecture

I am using a rocky linux 9.2, (podman container) , which by default makes me root user(su/ super user). The OS is x86_64 compatible, means that the CPU cores (20 cores of chep cluster) is x86_64 processor.

Introduction

The term "x86_64" refers to a particular computer architecture that is widely used in modern CPUs. It is an extension of the original x86 architecture. Here's a breakdown of what "x86_64" means:

x86 Architecture:

The term "x86" originally referred to a family of instruction set architectures for certain processors, particularly those developed by Intel. The name "x86" is derived from the 86 in "8086," which was one of the early processors in this family.
64-bit Extension:

The "64" in "x86_64" indicates that this architecture is an extension to the original x86 architecture to support 64-bit computing. In the context of CPUs, a 64-bit architecture allows the processor to handle larger amounts of memory and perform computations involving 64-bit integers.

Compatibility:

x86_64 CPUs are designed to be backward compatible with the earlier 32-bit x86 architecture. This means they can run software that was written for 32-bit x86 processors. However, they also introduce additional features and capabilities associated with 64-bit computing.

Memory Addressing:

One of the significant advantages of the x86_64 architecture is its ability to address larger amounts of memory. While 32-bit architectures are limited to addressing 4 GB of RAM directly, x86_64 allows for a much larger address space, theoretically supporting up to 18.4 million TB (terabytes) of RAM.

Instruction Set:

The x86_64 architecture retains the x86 instruction set but introduces new 64-bit instructions. This allows the CPU to handle larger data sizes and perform more complex computations.

Widespread Adoption:

x86_64 has become the dominant architecture for personal computers and servers. Most modern desktop and laptop CPUs, as well as many server CPUs, use the x86_64 architecture.
Commonly Referred to as AMD64:

While initially introduced by AMD (Advanced Micro Devices), the x86_64 architecture is now commonly referred to as AMD64. Intel later adopted the same architecture and used the term "Intel 64" to refer to their implementation.
In summary, x86_64 is an extension of the x86 architecture that brings 64-bit computing capabilities to CPUs. It has become the standard architecture for most desktop, laptop, and server processors due to its backward compatibility and support for larger memory addressing.

64bit computing

64-bit computing refers to the use of processors that have a data bus, address bus, or both that are 64 bits wide. This specification refers to the amount of data that a processor can handle in a single instruction or the size of memory addresses. Here are the key aspects of 64-bit computing:

Data Bus Width:

In a 64-bit processor, the data bus is 64 bits wide. This means that the processor can handle 64 bits of data in a single instruction. The wider data bus allows for more data to be processed simultaneously, potentially leading to improved performance for certain types of computations.

Memory Addressing:

One of the significant advantages of 64-bit computing is the ability to address larger amounts of memory. With a 64-bit address bus, a processor can theoretically address up to 2^64 individual memory locations. This translates to an enormous address space, allowing systems to support much larger amounts of RAM compared to 32-bit systems.

Increased Memory Capacity:

64-bit processors enable systems to access and use more RAM. While 32-bit systems are limited to addressing 4 GB of RAM directly, 64-bit systems can address several terabytes of RAM. This is particularly beneficial for memory-intensive applications and tasks.
The maximum amount of RAM that an x86_64 bit (64-bit) Intel processor can manage depends on various factors, including the specific processor architecture, the motherboard, and the operating system. However, in theory, the x86_64 architecture supports a vast amount of addressable memory.

Performance Improvements:

Certain types of applications, especially those dealing with large datasets or complex calculations, may experience performance improvements when running on a 64-bit system. The wider data bus allows for more efficient processing of larger chunks of data in each instruction.

Backward Compatibility:

64-bit processors are designed to be backward compatible with 32-bit software. This means that they can run both 32-bit and 64-bit applications. However, to fully utilize the benefits of 64-bit computing, applications need to be compiled as 64-bit versions.

Compatibility with 32-bit Systems:

While 64-bit processors can run 32-bit software, the reverse is not true. A 32-bit processor cannot directly run 64-bit software. Therefore, systems with 64-bit processors can accommodate both 32-bit and 64-bit applications.

Security Enhancements:

64-bit computing allows for certain security enhancements, including the ability to implement more advanced security features at the hardware level. This can contribute to improved system security.
In summary, 64-bit computing provides advantages in terms of larger memory addressing, increased RAM capacity, and potential performance improvements for certain types of applications. It has become the standard for modern desktops, laptops, and servers, allowing for more powerful and capable computing systems.

Here are the theoretical limits for the x86_64 architecture:

Theoretical Maximum Addressable Memory:

An x86_64 processor can theoretically address 2^(64) memory locations.
This translates to 18.4 million terabytes (TB) of addressable memory.

Practical Limits:

While the theoretical limit is extremely high, the practical limit is determined by other factors, such as the limitations of current memory technology, motherboard design, and the operating system.

Current Real-World Limits:

As of my last knowledge update in January 2022, common x86_64 desktop and server systems support up to several terabytes of RAM.
High-end servers and workstations may support memory configurations ranging from 256 GB to multiple terabytes.

Operating System and Hardware Considerations:

The maximum amount of supported RAM can also depend on the operating system. For example, Windows and Linux have different limitations.
The motherboard architecture, chipset, and BIOS/UEFI firmware play crucial roles in determining the maximum supported memory.

Intel Specifics:

Different Intel processors may have varying levels of support for memory. Some high-end Intel processors designed for servers or workstations may support larger amounts of RAM.

ECC (Error-Correcting Code) Memory:

ECC memory support can also affect the maximum amount of RAM a system can handle. ECC memory is commonly used in server environments for enhanced reliability but may have specific limitations.

Future Developments:

As technology evolves, new processor architectures and memory technologies may increase the practical limits of addressable memory.

Properties of Our GPU (Tesla P100 )

Property Description
GPU Architecture Pascal
CUDA Cores 3,584
Memory Size 16 GB HBM2
Memory Interface 4096-bit
Memory Bandwidth 732 GB/s
Base Clock Approximately 1328 MHz
Boost Clock Varies(boosts higher under certain conditions)
Compute Capability 6.0
Streaming Multiprocessors (SMs) 56
Maximum Concurrent Warps per SM 64 (64 core or one warp(32 threads) per core)
Total Threads per GPU 114,688
GPU Direct (RDMA) Support Yes (for certain configurations)
Memory ECC (Error-Correcting Code) Yes
L2 Cache Size per SM 4 MB
Transistors Approximately 15.3 billion
Manufacturing Process 16 nm
Form Factor PCIe (available in various form factors)
TDP (Thermal Design Power) Typically around 300W
NVIDIA GPU Boost Yes
GPU Virtualization (vGPU) Yes (with NVIDIA GRID)
NVIDIA NVLink Yes (for certain configurations)
NVIDIA GPUDirect RDMA Yes

Please note that the total number of threads per GPU is based on the CUDA cores and the number of threads per warp.

Number of threads

For Tesla P100 which is Pascal based architecture with compute capability 6.0 has 56 SM. Each SM has 64 cores. Each core can run one warp (32 threads parallelly) parallelly at a time. So for a given GPU node we have 56 x 64 x 32 = 114688 parallel threads per node.

Parallel threads per core = 32 = (1 warp)
Parallel threads per SM = 64 x 32 = 2048 (32 warps)
Parallel threads per node = 56 x Parallel threads per SM = 114688 instructions per clock cycle.

The NVIDIA Tesla P100 has a base clock of 1190 MHz and a boost clock of 1329 MHz. It also has a memory clock of 715 MHz/1430 Mbps effective.
The Tesla P100 is a dual-slot card that draws power from a single 8-pin power connector. It has a maximum power draw of 250 W.
The Tesla P100 was released on April 5, 2016. It has a Pascal architecture and 3584 CUDA cores

Number of instructions per second

Given the clock cycle of 1329 MHz, in one second it will execute : 1329 x 114688 (Total number of threads per node) = 152420352000000 instructions per second(IPS). (floating point operations per second(FLOPS)=Instructions per second/ Number of instructions for Floating point operations) or 1.52420352e13 FLOPS.(If we require 10 instructions for one FLOP)

FLOPS: Floating Point Operations Per Second
NFLO: Number of Floating Point Operations
IPS : Instructions Per Second

References

  1. https://www.nvidia.com/en-in/
  2. https://chat.openai.com