GPU and CPU basics

CUDA, which stands for Compute Unified Device Architecture, is a parallel computing platform and programming model developed by NVIDIA. It allows developers to use NVIDIA GPUs (Graphics Processing Units) for general-purpose parallel computing. CUDA enables efficient parallel processing for a wide range of applications beyond traditional graphics rendering, such as scientific simulations, deep learning, and numerical computations.

There are two reasons to use GPU over CPU:

Very High computational throughput; computational throughput = Number of cores . clock speed . number of Instructions per cycle
High memeory bandwidth/throughput; Memory bandwidth = (memory bus width in bits. Memory clock speed )

There are many Architectures to use, each with different compute capability.

Architecture	Compute Capability
Tesla	1.x
Fermi	2.x
Kepler	3.x
Maxwell	5.x
Pascal	6.x
Volta	7.x
Turing	7.5
Ampere	8.x

Compute Capability is a numerical value assigned to each NVIDIA GPU architecture, and it represents the level of features and capabilities supported by that architecture. It is an important parameter for developers, especially when writing CUDA (Compute Unified Device Architecture) code for GPU programming. Each compute capability introduces new features, instructions, and improvements over previous versions.

Some key aspects related to Compute Capability include:

Compute Capability is represented in the form of major and minor version numbers (e.g., 6.1 for Pascal, 7.0 for Volta).

Higher compute capability generally implies support for newer features, instructions, and optimizations.

Some CUDA Toolkit versions may be designed to support specific compute capabilities. It's important for developers to ensure compatibility between their code and the targeted GPU architecture.

Newer compute capabilities often bring improvements in performance and energy efficiency, allowing developers to take advantage of advanced features and optimizations.

In more recent architectures like Volta, Turing, and Ampere, special hardware units called Tensor Cores were introduced to accelerate deep learning tasks.

We use 10 nodes of GPU.
The NVIDIA Tesla P100 is a high-performance GPU designed for data center and high-performance computing (HPC) applications. The Tesla P100 PCIe variant has a compute capability of 6.0. The compute capability is a numerical representation that indicates the features and capabilities supported by the GPU architecture.

For the Tesla P100 PCIe 16GB:

Compute Capability: 6.0
The compute capability is represented in the form of major and minor version numbers, where the major version indicates significant architectural changes and the minor version represents incremental improvements within that architecture.

Here's a breakdown of the compute capability version 6.0:

Major Version (6.x): Indicates the Pascal architecture.
Minor Version (6.0): Specific incremental improvements and features within the Pascal architecture.
Developers often use compute capability information when programming GPUs using frameworks like CUDA (Compute Unified Device Architecture). The compute capability helps ensure compatibility and optimization of code for specific GPU architectures.

The Pascal architecture is a GPU (Graphics Processing Unit) architecture developed by NVIDIA and was introduced in 2016. It succeeded the Maxwell architecture and brought several improvements in terms of performance, energy efficiency, and new features. Pascal GPUs were designed for a variety of applications, including gaming, professional graphics, and high-performance computing.

The NVIDIA Tesla P100 GPU features a number of streaming multiprocessors (SMs), and each SM contains a certain number of CUDA cores. To determine the number of CUDA cores on a Tesla P100 GPU, you can refer to the specifications provided by NVIDIA.

Here are the key details for the Tesla P100 PCIe 16GB:

CUDA Cores per SM (Streaming Multiprocessor): 64 CUDA cores
Number of SMs: 56 SMs

Now, you can calculate the total number of CUDA cores by multiplying the number of CUDA cores per SM by the number of SMs:

Total CUDA Cores = CUDA Cores per SM x Number of SMs

For the Tesla P100:

Total CUDA Cores = 64 x 56 = 3584 CUDA Cores

So, the Tesla P100 PCIe 16GB has a total of 3,584 CUDA cores. Keep in mind that this information is specific to the Tesla P100 GPU architecture, and different GPU models may have different configurations. Always refer to the official specifications provided by the GPU manufacturer for accurate and up-to-date information.

Use nvidia-smi -h for details.

Here are key features and characteristics of the Pascal architecture:

Pascal GPUs were manufactured using a 16nm FinFET process technology, a significant advancement over the 28nm process used in the Maxwell architecture. This transition to a smaller process node contributed to improved energy efficiency and performance.

Pascal_FinFET Technology

Pascal GPUs featured an increase in the number of CUDA cores compared to Maxwell. CUDA cores are the processing units responsible for executing parallel tasks. The architecture maintained the Streaming Multiprocessor (SM) design, with each SM containing multiple CUDA cores. (For chep@iisc, we have Pascal 6.0, which have 56 SM per node, where each SM has 64 cores).

Pascal introduced Simultaneous Multi-Projection, a feature designed to enhance virtual reality (VR) and multi-display experiences. SMP enables more efficient rendering for multiple projections, improving the visual quality and realism in VR applications.

Some Pascal GPUs, particularly those in the professional and data center-oriented Tesla and Quadro series, adopted High Bandwidth Memory 2 (HBM2). HBM2 provides higher memory bandwidth compared to traditional GDDR5 memory, which is beneficial for memory-intensive applications.

Pascal continued to support Unified Memory, allowing developers to access both CPU and GPU memory seamlessly. The introduction of the Page Migration Engine enhanced the efficiency of data transfers between CPU and GPU memory.

Pascal GPUs demonstrated improvements in performance across various applications, including gaming, content creation, and parallel computing. The architecture was designed to provide a balance of computational power, memory bandwidth, and energy efficiency.

Pascal GPUs showed notable improvements in deep learning performance. While not as specialized as later architectures like Volta and Turing, Pascal GPUs were still capable of handling deep learning workloads efficiently.
NVLink:

Pascal introduced NVLink, a high-speed interconnect technology primarily used in data center and high-performance computing (HPC) environments. NVLink enables faster communication between multiple GPUs, supporting scalable and efficient parallel processing.
It's important to note that the Pascal architecture was succeeded by the Volta architecture and later by Turing and Ampere. Each subsequent architecture introduced new features and optimizations, especially in the context of AI and deep learning, but Pascal GPUs remain relevant for various applications.

Pascal is unit of pressure, 1 atm = \(10^5\) pascal, where one pascal is 1 \(\frac{N}{m^2}\)

Feature	CPU	GPU
Function	General Purpose processing	Specialized for parallel tasks
Architecture	few cores	Many cores
Parallelism	single threaded	can use so many threads
Memory Hierarchy	complex, low latency throughput	High throughput parallel access
Flexibility	Versatile, Handles various workloads	specialized for specific tasks
Execution type	Sequential	parallel

Please note that this table provides a high-level overview, and the specifics can vary based on the architecture and design of individual CPUs and GPUs. Additionally, advancements in technology may lead to changes in the characteristics of future CPU and GPU architectures.

Task	CPU Advantage
Single-Threaded Work	Higher clock speeds for individual tasks.
Sequential Algorithms	Better for step-by-step execution.
Task Switching	Efficient handling of diverse concurrent tasks.
General-Purpose Computing	Versatility for various computing tasks.
Operating System Operations	Integral role in managing system operations.
Data Serialization	Sequential processing for dependent data.
Low-Latency Access	Complex memory hierarchy for quick data retrieval.

It's important to note that the superiority of CPUs in these scenarios does not imply that GPUs cannot perform these tasks. Rather, CPUs are typically more optimized for these specific types of workloads. The choice between CPU and GPU often depends on the nature of the task and the specific requirements of the application.

I am using a rocky linux 9.2, (podman container) , which by default makes me root user(su/ super user). The OS is x86_64 compatible, means that the CPU cores (20 cores of chep cluster) is x86_64 processor.

The term "x86_64" refers to a particular computer architecture that is widely used in modern CPUs. It is an extension of the original x86 architecture. Here's a breakdown of what "x86_64" means:

The term "x86" originally referred to a family of instruction set architectures for certain processors, particularly those developed by Intel. The name "x86" is derived from the 86 in "8086," which was one of the early processors in this family.
64-bit Extension:

The "64" in "x86_64" indicates that this architecture is an extension to the original x86 architecture to support 64-bit computing. In the context of CPUs, a 64-bit architecture allows the processor to handle larger amounts of memory and perform computations involving 64-bit integers.

x86_64 CPUs are designed to be backward compatible with the earlier 32-bit x86 architecture. This means they can run software that was written for 32-bit x86 processors. However, they also introduce additional features and capabilities associated with 64-bit computing.

One of the significant advantages of the x86_64 architecture is its ability to address larger amounts of memory. While 32-bit architectures are limited to addressing 4 GB of RAM directly, x86_64 allows for a much larger address space, theoretically supporting up to 18.4 million TB (terabytes) of RAM.

The x86_64 architecture retains the x86 instruction set but introduces new 64-bit instructions. This allows the CPU to handle larger data sizes and perform more complex computations.

x86_64 has become the dominant architecture for personal computers and servers. Most modern desktop and laptop CPUs, as well as many server CPUs, use the x86_64 architecture.
Commonly Referred to as AMD64:

While initially introduced by AMD (Advanced Micro Devices), the x86_64 architecture is now commonly referred to as AMD64. Intel later adopted the same architecture and used the term "Intel 64" to refer to their implementation.
In summary, x86_64 is an extension of the x86 architecture that brings 64-bit computing capabilities to CPUs. It has become the standard architecture for most desktop, laptop, and server processors due to its backward compatibility and support for larger memory addressing.

64-bit computing refers to the use of processors that have a data bus, address bus, or both that are 64 bits wide. This specification refers to the amount of data that a processor can handle in a single instruction or the size of memory addresses. Here are the key aspects of 64-bit computing:

In a 64-bit processor, the data bus is 64 bits wide. This means that the processor can handle 64 bits of data in a single instruction. The wider data bus allows for more data to be processed simultaneously, potentially leading to improved performance for certain types of computations.

One of the significant advantages of 64-bit computing is the ability to address larger amounts of memory. With a 64-bit address bus, a processor can theoretically address up to 2^64 individual memory locations. This translates to an enormous address space, allowing systems to support much larger amounts of RAM compared to 32-bit systems.

64-bit processors enable systems to access and use more RAM. While 32-bit systems are limited to addressing 4 GB of RAM directly, 64-bit systems can address several terabytes of RAM. This is particularly beneficial for memory-intensive applications and tasks.
The maximum amount of RAM that an x86_64 bit (64-bit) Intel processor can manage depends on various factors, including the specific processor architecture, the motherboard, and the operating system. However, in theory, the x86_64 architecture supports a vast amount of addressable memory.

Certain types of applications, especially those dealing with large datasets or complex calculations, may experience performance improvements when running on a 64-bit system. The wider data bus allows for more efficient processing of larger chunks of data in each instruction.

64-bit processors are designed to be backward compatible with 32-bit software. This means that they can run both 32-bit and 64-bit applications. However, to fully utilize the benefits of 64-bit computing, applications need to be compiled as 64-bit versions.

While 64-bit processors can run 32-bit software, the reverse is not true. A 32-bit processor cannot directly run 64-bit software. Therefore, systems with 64-bit processors can accommodate both 32-bit and 64-bit applications.

64-bit computing allows for certain security enhancements, including the ability to implement more advanced security features at the hardware level. This can contribute to improved system security.
In summary, 64-bit computing provides advantages in terms of larger memory addressing, increased RAM capacity, and potential performance improvements for certain types of applications. It has become the standard for modern desktops, laptops, and servers, allowing for more powerful and capable computing systems.

Theoretical Maximum Addressable Memory:

An x86_64 processor can theoretically address 2^(64) memory locations.
This translates to 18.4 million terabytes (TB) of addressable memory.

While the theoretical limit is extremely high, the practical limit is determined by other factors, such as the limitations of current memory technology, motherboard design, and the operating system.

As of my last knowledge update in January 2022, common x86_64 desktop and server systems support up to several terabytes of RAM.
High-end servers and workstations may support memory configurations ranging from 256 GB to multiple terabytes.

The maximum amount of supported RAM can also depend on the operating system. For example, Windows and Linux have different limitations.
The motherboard architecture, chipset, and BIOS/UEFI firmware play crucial roles in determining the maximum supported memory.

Different Intel processors may have varying levels of support for memory. Some high-end Intel processors designed for servers or workstations may support larger amounts of RAM.

ECC memory support can also affect the maximum amount of RAM a system can handle. ECC memory is commonly used in server environments for enhanced reliability but may have specific limitations.

As technology evolves, new processor architectures and memory technologies may increase the practical limits of addressable memory.

Property	Description
GPU Architecture	Pascal
CUDA Cores	3,584
Memory Size	16 GB HBM2
Memory Interface	4096-bit
Memory Bandwidth	732 GB/s
Base Clock	Approximately 1328 MHz
Boost Clock	Varies(boosts higher under certain conditions)
Compute Capability	6.0
Streaming Multiprocessors (SMs)	56
Maximum Concurrent Warps per SM	64 (64 core or one warp(32 threads) per core)
Total Threads per GPU	114,688
GPU Direct (RDMA) Support	Yes (for certain configurations)
Memory ECC (Error-Correcting Code)	Yes
L2 Cache Size per SM	4 MB
Transistors	Approximately 15.3 billion
Manufacturing Process	16 nm
Form Factor	PCIe (available in various form factors)
TDP (Thermal Design Power)	Typically around 300W
NVIDIA GPU Boost	Yes
GPU Virtualization (vGPU)	Yes (with NVIDIA GRID)
NVIDIA NVLink	Yes (for certain configurations)
NVIDIA GPUDirect RDMA	Yes

Please note that the total number of threads per GPU is based on the CUDA cores and the number of threads per warp.

Number of threads

For Tesla P100 which is Pascal based architecture with compute capability 6.0 has 56 SM. Each SM has 64 cores. Each core can run one warp (32 threads parallelly) parallelly at a time. So for a given GPU node we have 56 x 64 x 32 = 114688 parallel threads per node.

Parallel threads per core = 32 = (1 warp)
Parallel threads per SM = 64 x 32 = 2048 (32 warps)
Parallel threads per node = 56 x Parallel threads per SM = 114688 instructions per clock cycle.

The NVIDIA Tesla P100 has a base clock of 1190 MHz and a boost clock of 1329 MHz. It also has a memory clock of 715 MHz/1430 Mbps effective.
The Tesla P100 is a dual-slot card that draws power from a single 8-pin power connector. It has a maximum power draw of 250 W.
The Tesla P100 was released on April 5, 2016. It has a Pascal architecture and 3584 CUDA cores

Given the clock cycle of 1329 MHz, in one second it will execute : 1329 x 114688 (Total number of threads per node) = 152420352000000 instructions per second(IPS). (floating point operations per second(FLOPS)=Instructions per second/ Number of instructions for Floating point operations) or 1.52420352e13 FLOPS.(If we require 10 instructions for one FLOP)

FLOPS: Floating Point Operations Per Second
NFLO: Number of Floating Point Operations
IPS : Instructions Per Second