The demonstration

It was decided that we meet at 3:30PM in the adjacent sitting space to discuss the ideas for the demonstrations. After a while during the discussions it occurred to me that maybe the parallel programming can be of some fun to general audience. The idea got some encouragement and my quest to write the magical code begun.

On the day itself, I knew what to do, I came back from the office and started writing down the code which can show how parallel programming and serial programming wins over one another.

The idea was simple, but not that easy to write it an hour or a day. But I was willing to finish the code within few hours, because

The problem was simple, add two objects of type :

\[ \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \\ \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \\ \end{bmatrix} = \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \\ \end{bmatrix} \]

Simple, right. Let's do it then. We will start with the largest array (such column vectors/matrices are called as arrays) possible and write a C++ code to find the resulting addition.

The idea is to first do this task on CPU with a C++ program, and then via use of parallel programming on GPU with some programming Languages (like CUDA), which support use of GPU.

Beginning of the puzzles

So all I had to do is to write 2 programs to do the same task and produce the advantages of parallel computing (programming on GPU) over serial programming (CPU) and vice versa.

So what's a program, it's just a set of instructions written Sequentially in a language which either be read directly by computer (if written in machine language (100000 010101 010101 001010)) or read by some software (compiler; which can translate it for the computer into assembly language and from assembly language to machine language (1's and 0's by help of assembler)).
Any content we write, see and hear via the computer is written in form of 0's and 1's. If we write a program, then every word of set of instructions has to be converted into these 1's and 0's. Such form is often termed as binary language (or machine language) or the number system with base 2. It turns out that it is the way to store the information or instructions using on(or 1) and off(or 0) state of very tiny switches (also known as transistors) which our computer is mostly made up of.

Since we can't write each instruction in form of 1's and 0's, we have developed a way to convert the instructions written in familiar language (English or C or C++ or Python) and the compiler for that language plus assembler convert it to required machine language.

It turns out that C or C++ is the fastest language available. I chose C++ and started writing the desired code.

I might go bit technical here, will provide links to unknown keywords.

Serial Programming

Algorithm (Plan)

The Algorithm or the way to write the code was simple, include some header files(to support some functionality to read and write, and write the main program (called as int main() or void main() which are main heart of the program)).

The program or the set of instructions can be reproduced from these steps:

  1. Define a size(number of elements) of each array(both array has to be of equal size to add each respective \(n^{th}\) element).
  2. Store the size of array on some part of memory(size of array is just a number).
  3. Allocate space or memory of N( or size) numbers. Do it for another N such numbers too.
  4. Now fill up the memory space with some numbers.
  5. Write a piece of instruction to add these numbers \(a_{i}+b_{i}=c_{i}\) for all \(i\le N\).
  6. Store the respective addition of these N numbers into N memory spaces of first or second vector there itself. Therefore the first or second vector memory locations will have the answer of addition.
  7. Print the addition result for first or last few elements to confirm.

Let the fun begin

I started to finish the first step of plan(Algorithm).

So the program started with inclusion of header files, and defining main function. I knew that in order to show power of GPUs over CPUs, I have to start with arrays of largest possible size.
The largest possible size? Now the computer have two such memory locations:

  1. RAM (Random Access Memory)
  2. Hard disk storage or Solid state drives

I found that the size available was 256GB(1GB=1024x1024x1024x8 bits) of RAM, and 10TB(1TB=1024x1024x1024x1024x8) of Hard Disk space. I thought to use the maximum of these, to amaze the general audience.

Since the number of elements belongs to integers, I started with integer data type(All numbers and characters belongs to family of data types.

#include<iostream>
int main()
{
int N=10000000000;  //not allowed the maximum it can store is 2^{31}-1
}

I kept pressing on key '0' to make the size (N) as large as possible, it wasn't long to realise that I just can't store any such large number in variable 'N'. Since 'N' is declared of data type (int) or integer, which compiler can only assign 4 bytes of memory with each byte having 8 bits of space, where place of a bit can be filled by either 0 or 1. With 4 such bytes I am left with 32 bits. So with these bits, the maximum number I can store was just \(2^{31} -1\) rather than \(2^{32}-1\), where did the 1 bit go, it turns out that the compiler also takes care of negative integers by taking one bit reserved for negative numbers.

With \(2^{31}-1\) as maximum size I can allocate with integer data type, I was bit off. Sooner I found that there is another data type which can store only positive integers(unsigned integers(no need for 1 bit reservation for negative numbers now)) making the total number of elements double. I was happy and so I tried and again this large number can't be stored in this variable..

unsigned int N= 10000000000; //again this large number can't be stored in this variable..

The maximum number (\(2^{32}-1\)) can be stored using 'unsigned int' and binary shift operator (<< or >>).
but again error was on the way.....

unsigned int N = 1<<32 - 1 ; //here 1<<32 is just (100000000000000000000000000000000=2^32(in decimal number system(with base 10, our number system with 10 digits))) but we can't add to objects of where one piece is bigger than can be stored in unsigned int data type.

We can't store 1<<32 or \(2^{32}\) in unsigned integer data type before we reduce 1 from it.

I then tried to look for other data types, turns out that I can use 'long int(8 bytes)' or 'long long int(16 bytes)' or just 'long long(16 bytes)' to declare the sizes of the array, here the keywords like 'long' or 'long long' are termed as data modifiers. These modifiers allocate the more number of space to store larger numbers into memory.

But if I choose a single number so large which can occupy the whole of available storage, then where will I store the N elements of both the arrays? It is better to look for the largest possible data type to store largest possible integer on the computer. It was 'unsigned long long int', which allocates 16 bytes for each variable of its type. For 16 bytes(16x8=128 bits), the maximum number I can store is \(2^{128} - 1\).

This number is so huge that even my text editor (neovim which is used for writing this blog) takes minute to save the file.

For a moment even if we allocate memory for \(2^{127}\) elements(int *array1=new int[N];) and entered integers values (4 bytes) at all these \(2^{127}\) places for both column matrices. Even for entering the N+N values the CPU with (3400MHz clock frequency), will take 6 instructions per loop. In total it will take \(6\times2^{127}\) Instructions. And CPU can take care of \(3400.10^{6}\) instructions per second.
So the time CPU will take to assign the N+N values will be : \(\frac{6.2^{127}}{34. 10^{8}} = 3.022\times10^{29}\) sec. or \(3.022\times10^{22}\) years. We don't have that much time obviously.
Also storing 4 bytes at \(2^{127}\) places of memory is not possible as it requires huge amount of memory \(2^{102}\) GB of storage.

unsigned long long int N=1<<127;
//define both the column matrices and name these array1 and array2
for (unsigned long long int i=0;i<N;i++) //2 instructions to enter into loop each time(i++ and i<N)
{
    array1[i]=7;  //2 instruction or,  *(array+i)=7;
    array2[i]=3;  //2 instruction or, *(array+i)=3;
}

Later, I found that the number of instructions I was guessing were not entirely correct (correct number of instructions are little more then anticipated), it is better to generate assembly code using g++ -S programName.cpp to generate the assembly file and look for number of instructions for the above 'for' loop.

Starting from basic (modifying single array)

So it can't be the largest possible number a data type(unsigned long long int) can store, and with two such array the amount of required memory becomes double. I then thought to just store a single array with some numbers, and write a code to modify these numbers and store the result at same memory locations.

I thought to find the largest possible array which can be stored on 250GB of RAM. With 4 bytes at each locations(aiming to store integers or float/real numbers), I can store \(\frac{250.(1024.1024.1024)}{4}=67108864000\) numbers on the RAM. This number (67108864000) first needed to be registered in memory and we can't use integer data type(since int N=67108864000; can't be stored with 4 bytes or 32 bits of 1 and 0) , so we have to use unsigned long int data type which can store this number at some memory location (since unsigned long int reserves 8 bytes for data, we can store maximum of \(2^{64} - 1\) = huge number.

Good news, we now have a number which can be stored using data type unsigned long int and we can reserve size of 4 bytes (for integers upto some large number), for each of the N=\(2^{64}-1\) numbers. For the moment we can start with just \(2^{63}\) numbers.

I tried with following:

#include<iostream>
int main()
{
unsigned long int N=1<<63; //wrong as we need to explicitly tell that 1 is Long literal
}

output: Error: shift count > = Width of type, which can be get rid with use of

#include<iostream>
int main()
{
unsigned long int N=1L<<63; //telling 1 belongs to Long int rather than 'int' data type
}

I am now planning to store integers on these N memory locations with the following code:

#include<iostream>
int main()
{
unsigned long int N=1L<<63;
int arr[N];
for (int i=0;i<N;i++)
{
    arr[i]=i; //storing i at each place of the array.
}
return 0;
}

For most of the people the code will make sense, unless we really observe. The first error is in declaring an array of size N using int arr[N];, because this method allocate memory in stack, and since the stack memory is limited from 1MB to 8MB, we can't store 250GB sized array on these locations.
One has to look at dynamic memory allocation of the array, which is done by int *dynamic_array= new int[N]; or by using std::vector<int> arr(N);. This allocates memory in RAM region of space, so I thought that this is the best the compiler can do (Later while writing about this task, I found that it is possible to write on Hard disk and read from there and modify the vectors, but this story is saved for the end part.)

The second error is just a warning as the loop will not continue N times, since we have declared that variable 'i' is an integer, which will only run from 0 to \(2^{31}-1\) rather then 0 to \(2^{63}-1\). We have to use correct data types for variable 'i'.

I tried with 'std::vector<int> arr(N);'

#include<iostream>
int main()
{
unsigned long int N=1L<<63;
std::vector<int> arr(N); //takes a lot of instructions to assign memory ( so many seconds just to execute this line of code )
for (unsigned long int i=0;i<N;i++)
{
    arr[i]=i; //storing i at each place of the array.
}
return 0;
}

Turns out that there is some faster way of allocating the memory, it was the int *arr=new int[N];, but one has to delete the assigned memory locations explicitly in the end of the program.

#include<iostream>
int main()
{
unsigned long int N=1L<<63;
int *arr= new int[N]; //allocates memory in no time.
for (unsigned long int i=0;i<N;i++)
{
    arr[i]=i; //storing i at each place of the array. Still wrong, as it won't store 'i' at a[i] locations for i>2^{31}-1
}

delete [] arr; //we need to delete the assigned memory locations explicitly.
return 0;
}

This method returns the first address of the allocated memory for 'N' integer variables. To store an address in variable 'arr' we have to declare it with some symbol which makes the variable 'arr' an pointer variable.

Now it seems that we are fine, with the above code, but there is still something wrong, we have declared that the array locations can only store integer (integers store from \(-2^{31}\) to \(2^{31}-1\) and not upto \(2^{63}-1\), which is full range of 'i'), but in the loop I am instructing compiler to store value of 'i' which could be greater than what an integer data type can store. If I increase data type of array to be 'unsigned long int', then each of N elements require double the memory (8 bytes instead of 4 bytes per memory location); i.e. if I use long int * arr=new long int[N];, this will try to allocate 500GB of space, which I don't have.

#include<iostream>
int main()
{
unsigned long int N=1L<<63;
unsigned long int *arr= new unsigned long int[N]; //I don't have 500GB of RAM
for (unsigned long int i=0;i<N;i++)
{
    arr[i]=i; //storing i at each place of the array.
}
return 0;
}

If I am stubborn to insert the integers at each memory location with value of indices (i), then I am constrained to use the largest possible array to be of type unsigned integer.

Later it turns out that there are similar data types ('uint64_t, uint32_t, uint16_t, uint8_t') to allocate memory of 64bits, 32bits, 16bits and 8bit respectively for the declared variable.

Moving back to int

With all the high hopes in mind, I was constrained to chose the unsigned integers data types (which only store positive numbers).

I gave my first try with the following code:


//This program is intended to demonstrate simple serial programming where a vectors element gets modified serially. 
//
#include <iostream>
#include <vector>
#include <cmath>

using std::cout;
using std::endl;
using std::vector;

void increase_magnitude(float *starting_address,unsigned int size_vec, float mag_multiplyer)
{
  //the variable *starting_address is a pointer which will point to (or contain address of), the first element of the array.
  for (int i=0;i<size_vec;i++)
  {
    //let us multiply the whole vector by 2
    //since we have address of first element of vector, we can take its value by use of *
    *starting_address=*starting_address * 2;
    //now we shall increase its address by 1.
    starting_address+=1;
  }
  //in this way at the same memory location we will have modified the vector.
}
float magnitude_finder(float *starting_address_vec,unsigned int size)
{
  //This function will return the magnitude of the vector. 
  float sum=0.0;
  for (int i=0;i<size;i++)
  {
    sum+=pow((*starting_address_vec),2);
  }
 return pow(sum,0.5); 
}
int main()
{
  unsigned int N=1<<31; //the number implies 31 zeroes in front of 1. So it is 2^31.
  vector<float> vec(N); //whenever we declare an vector, it has undefined size(if we don't provide the size). Unless we declare it with some size. 
  //declaring vector of finite size; vector<int> v(100);
  //declaring vector without size; vector<int> vec;

  //let us fill the vector with N natural numbers(1,2,.......N).
  for (int i=0;i<N;i++)
  {
      vec[i]=i;
  }
  //let us define a pointer to integer vector
  float *ptr_to_vec=&vec[0]; //giving address of first element. Or we can just write: 

  //we shall modify the vector by multiply it with some real number. 
  //Real numbers are stored in float data types(require 4byte per real number) and double data types(8bytes).
  float multiplier=4.0;
  //Let us find the magnitude before changing the vector, 
  float mag_before=magnitude_finder(ptr_to_vec,N);
  increase_magnitude(ptr_to_vec,N,multiplier);
  float mag_after=magnitude_finder(ptr_to_vec,N);
  float ratio_of_magnitudes=mag_after/mag_before;
  cout<<"the final vector has magnitude "<<ratio_of_magnitudes<<" times the earlier one"<<endl<<"which should be equal to "<<pow(multiplier,0.5)<<endl;
  return 0;
}

There are still some mistakes, I hope you can find these out.

Think what are these mistakes, think really hard.

Let me show my tryouts:

Try outs

  1. code1_demo
  2. code2_demo
  3. code3_demo
  4. code4 demo
  5. code5 demo
  6. code6_demo
  7. code7_demo

Returning back to sum of two

Serial programming

After going through all the mistakes which I later wrote above while writing this blog, I am sharing the first try:

The first try out:

/******************************************************************************
* File:             code7.cpp
*
* Author:           Vishal  
* Created:          01/30/24 
* Description:      This code is supposed to do vector addition for cpu for very large size array of size 2^32
*****************************************************************************/
#include <cstdint>
#include <iostream>
#include <iterator>
#include <vector>
#include <cmath>
#include <chrono>

using std::cout;
using std::endl;
void AddVectorsOnCPU(double *array1, double *array2, size_t size)
{
  for (size_t i=0;i<size;i++)
  {
    //array1[i]+=array2[i];  //or *(array+i)+=*(array2+i);
    *array1+=*array2;
    array1+=1;
    array2+=1;
  }
}
int main(int argc, char *argv[])
{
  auto time_0 = std::chrono::high_resolution_clock::now();
  size_t size1=1L<<32; //1L denotes long data type of 1.
  size_t size2=size1; 
  auto time_1 = std::chrono::high_resolution_clock::now();
  double *vec1=new double[size1]; //dynamics_array allocation does not take any time comparable to stack memory 
  double *vec2=new double[size2]; //dynamics_array allocation does not take any time comparable to stack memory 
  //let us fill the vector with N natural numbers(1,2,.......N).
  auto time_2 = std::chrono::high_resolution_clock::now();
  auto elapsed_time_1 = std::chrono::duration_cast<std::chrono::microseconds>(time_2 - time_1).count() / 1e6;
  auto elapsed_time_storing_N = std::chrono::duration_cast<std::chrono::microseconds>(time_1 - time_1).count() / 1e6;
  for (size_t i=0;i<size1;i++)
  {
    vec1[i]=i+1;
    vec2[i]=i+2;
    //*(vec1+i)=i+1;
    //*(vec2+i)=i+2;
  }
  double sum_f=vec1[size1-1]+vec2[size2-1]; //
  printf("The last element of the array after sum shall be: %f \n",sum_f);
  auto time_3 = std::chrono::high_resolution_clock::now();
  auto elapsed_time_assigning_values = std::chrono::duration_cast<std::chrono::microseconds>(time_3 - time_2).count() / 1e6;
  auto time_4 = std::chrono::high_resolution_clock::now();
  AddVectorsOnCPU(vec1,vec2,size1);
  auto time_5 = std::chrono::high_resolution_clock::now();
  auto elapsed_time_modify_mag = std::chrono::duration_cast<std::chrono::microseconds>(time_5 - time_4).count() / 1e6;
  //let us print last 2 elements of new vector
  printf("The last 3 elements of the result are: \n")
  for(int i=0;i<3;i++)
  {
    printf("%f   ,",vec1[size1-(2-i)-1]);
  }
  cout<<"\nthe time to assign N: "<<elapsed_time_storing_N<<" seconds"<<endl;
  cout<<"the time to declare vector of N with doubles data type : "<<elapsed_time_1<<" seconds"<<endl;
  cout<<"the time to assign vectors: "<<elapsed_time_assigning_values<<" seconds"<<endl;
  cout<<"the time to modify the vector_in: "<<elapsed_time_modify_mag<<" seconds"<<endl;
  cout<<" The total time shall be :"<<elapsed_time_1+elapsed_time_assigning_values+elapsed_time_modify_mag<<" seconds"<<endl;
  delete[] vec1;
  delete[] vec2;
  return 0;
}

Output

The last element of the array after sum shall be: 8589934593.000000
The last 3 elements of the result are:
8589934589.000000 ,8589934591.000000 ,8589934593.000000 ,
the time to assign N: 0 seconds
the time to declare vector of N with doubles data type : 1.3e-05 seconds
the time to assign vectors: 26.8489 seconds
the time to modify the vector_in: 10.8359 seconds
The total time shall be :37.6848 seconds

real 0m37.837s
user 0m23.566s
sys 0m14.189s

Finally, found the code for serial programming, let us do that for parallel programming

Parallel Programming

The parallel programming is much more efficient in doing this task, since parallel programming requires use of multiple cores at the same time. This task is performed on NVIDIA GPUs, since these GPU has many cores/ALU. I am using GPU cluster with 56 SM (Streaming Multiprocessors) where each of these Streaming Multiprocessors has 64 cores, and each cores can run 32 threads concurrently. Which provides us \(56\times64\times32=1,14,688\) threads. This means that instead of adding each \(a_{i}+b_{i}\), one after one (in serial programming done above), we can utilize the parallel addition by maximum of 114688 threads at single clock cycle of GPU.
The GPU has 750MHz of clock speed. Which means that this GPU has 5 times slower clock than our CPU with 3500MHz clock speed, and with this speed GPU can handle 750,000,000 instructions per second.

Since I am using NVIDIA GPUs, we are constrained to use CUDA.

NOTE: We use the term host for CPU and the term device for GPU.

We have to start with little different Algorithm:

Algorithm

  1. We will declare the size of array.
  2. We have to allocate memory for two array on CPU
  3. We then also have to allocate the memory of two array on GPU
  4. We assign some value to both the array on CPU
  5. We copy both the array on memory locations of GPU
  6. We declare grid dimensions(number of blocks in x,y,z) and block dimensions (number of threads in x, y,z).

We can't just define any grid size and block size. Run this program

  1. Maximum blocks in x direction is \(2^{31}-1\).
  2. Maximum blocks in y direction is \(2^{16}-1\).
  3. Maximum blocks in z direction is \(2^{16}-1\).
  4. Maximum threads per block is 1024 or \(2^{10}\) or 1<<10;
  5. We then launch the kernel(we just name the function to be executed on device to be kernel).
  6. CPU shall be told to wait untill kernel finishes the task.
  7. The result will be saved on device memory locations, so we copy the data back.
  8. Print the results and time it took to perform the task.

Husttle

I started writing code for just taking an array of 100 elements and multiplying each element by 2. The resulting vector will be stored at respective memory locations.

#include <cstdint>
#include <iostream>
__global__ void modify(int *p, int N)
{
    int index=threadIdx.x+blockDim.x*blockIdx.x;
    if(index<N)
    {
        p[index]*=2;
    }
}
int main(int argc, char *argv[])
{
    int N=100;
    int *d_p;
    int *h_p=new int[N]; 
    for (int i=0;i<N;i++)
    {
        h_p[i]=i;
    }
    //need to allocate memory on device (GPU)
    cudaMalloc((void**)&d_p,N*sizeof(int));
    //copy the vector h_p to respective locations of d_p
    cudaMemcpy(d_p, h_p,N*sizeof(int) ,cudaMemcpyHostToDevice);
    //declaring grid_size (number of blocks)
    dim3 grid_size(1); //using 1 block in x direction only
    dim3 block_size(100); //using 100 threads in this block
    //calling kernel to add each element by 2
    modify<<<grid_size,block_size>>>(d_p,N);
    cudaDeviceSynchronize(); // to tell the cpu to wait for the GPU to finish the task
    //copying the data back
    cudaMemcpy(h_p,d_p,N*sizeof(int),cudaMemcpyDeviceToHost);
    //printing the first 10 resulting numbers
    for (int i=0;i<10;i++)
    {
        std::cout<<h_p[i]<<std::endl;
    }
    //deleting the memory locations and free the locations
    cudaFree(d_p);
    delete [] h_p; // this is how we free the array declared on host (CPU)
    return 0;
}

Output:

0
2
4
6
8
10
12
14
16
18

real 0m0.454s
user 0m0.037s
sys 0m0.400s

Adding Two vectors (Parallel Programming)

The major problem is that the GPU Nodes can maximum store 16384 MiB. So I can't store anything which require 16GB, I am then restricted to use 8 GB of total space, 4 GB for each vector. To demonstrate the efficiency I am restricted to use maximum size of the array, so all I had to do is to reduce the size of data type it can store.

Returning back to basic data type which store 8 bits. uint8_t, this can store numbers from 0 to \(2^8 -1=255\).
Now I can't store the \(a[i]=i\), so I had to store some numbers like 1, 2, 3...such that the sum of these don't exceed the number 255.

The maximum number of elements both array can hold is \(2^{32}\) with each element occupying 1 byte (8 bits). In this way each array will occupy \(2^{30}\times2^{4}bytes\) =4 GB (1GB = \(2^{30}\) bytes) of RAM of GPU.

Algorithm

  1. Declare the size of both the array size_t size1=1L<<32;
  2. Define and allocate the memory of host arrays.uint8_t *host_pointer_array1 = new uint8_t[size1];
  3. Fill the array elements with numbers 1 and 2 in array1 and array2 respectively.
  4. Allocate space for each of two vectors on GPU using cudaMemcpy(device_pointer_array1,host_pointer_array1,size1*sizeof(uint8_t),cudaMemcpyHostToDevice);
  5. Define grid size dim3 grid_size(1<<22) and block sizes dim3 block_size(1<<10).
  6. Define a function which runs on device to add two copied array on device and save the output on first device array.
  7. wait for the function(kernel) to finish. cudaDeviceSynchronize();
  8. Do it 50 times (in a loop).
  9. Copy the results back to host_pointer_array1. cudaMemcpy(host_pointer_array1,device_pointer_array1,size1*sizeof( uint8_t),cudaMemcpyDeviceToHost);
  10. Print the result.

Code (parallel programming)

/******************************************************************************
* File:             example_cuda.cpp
*
* Author:           Vishal Rao  
* Created:          01/30/24 
* Description:      This program will calculate the sum of vectors on gpu
*****************************************************************************/

#include <cstdint>
#include <chrono>
#include <iostream>
#include <sys/types.h>
using std::cout;
using std::endl;

__global__ void AddVectorsOnGPU( uint8_t *deviceptr1, uint8_t *deviceptr2, size_t size)
{
  size_t index=threadIdx.x + blockIdx.x * blockDim.x;
  //if(index==0) printf("Executing first thread \n");
 // if(index==size-1) printf("Executing last thread \n");
  if (index<size)
  {
    deviceptr1[index]+=deviceptr2[index];
  }
}
int main(int argc, char *argv[])
{
  size_t size1=1L<<32;   //this allocates 8 GB(because of uint8_t data type of array) of RAM , so this is max we can do 
   uint8_t *host_pointer_array1 = new uint8_t[size1];
   uint8_t *host_pointer_array2 = new uint8_t[size1];

//let us fill first array elements with 1 and another array elements with 99. 
  auto time_0 = std::chrono::high_resolution_clock::now();
   for(size_t i=0; i<size1;i++)
   {
     host_pointer_array1[i]=1;
     host_pointer_array2[i]=2;
   }
  auto time_1 = std::chrono::high_resolution_clock::now();
//let us allocate the memory for two vectors on gpu
   uint8_t *device_pointer_array1;
   uint8_t *device_pointer_array2;
  cudaMalloc((void**)&device_pointer_array1,size1*sizeof( uint8_t));
  cudaMalloc((void**)&device_pointer_array2,size1*sizeof( uint8_t));

//let us copy the arrays
  auto time_2 = std::chrono::high_resolution_clock::now();
  cudaMemcpy(device_pointer_array1,host_pointer_array1,size1*sizeof(uint8_t),cudaMemcpyHostToDevice);
  cudaMemcpy(device_pointer_array2,host_pointer_array2,size1*sizeof(uint8_t),cudaMemcpyHostToDevice);
  auto time_3 = std::chrono::high_resolution_clock::now();

  dim3 grid_size(1<<22);  //2^22 blocks, 
  dim3 block_size(1<<10); //2^10 threds per block

  auto time_4 = std::chrono::high_resolution_clock::now();
  for(int i=0;i<=50;i++)
  {
  AddVectorsOnGPU<<<grid_size,block_size>>>(device_pointer_array1,device_pointer_array2,size1);
  cudaDeviceSynchronize();
}
  //we will add the vectors and store that in address of device_pointer_array1
  //we shall copy the data back to host
  auto time_5 = std::chrono::high_resolution_clock::now();
  cudaMemcpy(host_pointer_array1,device_pointer_array1,size1*sizeof( uint8_t),cudaMemcpyDeviceToHost);
  auto time_6 = std::chrono::high_resolution_clock::now();



  auto elapsed_time_1 = std::chrono::duration_cast<std::chrono::microseconds>(time_1 - time_0).count() / 1e6;
  auto elapsed_time_2 = std::chrono::duration_cast<std::chrono::microseconds>(time_2 - time_1).count() / 1e6;
  auto elapsed_time_3 = std::chrono::duration_cast<std::chrono::microseconds>(time_3 - time_2).count() / 1e6;
  auto elapsed_time_4 = std::chrono::duration_cast<std::chrono::microseconds>(time_4 - time_3).count() / 1e6;
  auto elapsed_time_5 = std::chrono::duration_cast<std::chrono::microseconds>(time_5 - time_4).count() / 1e6;
  auto elapsed_time_6 = std::chrono::duration_cast<std::chrono::microseconds>(time_6 - time_5).count() / 1e6;

  //once copied we can print the array.
  printf("last 3 elements are (%d , %d, %d)   \n ",host_pointer_array1[size1-3],host_pointer_array1[size1-2],host_pointer_array1[size1-1]);

  //let us print the times
  cout<<"Time to perform the vector addition of "<<size1<<" elements done "<<50<<" times one after another is "<<elapsed_time_5<<" sec"<<endl;
  cout<<"Time to assign the vectors was "<<elapsed_time_1<<" seconds"<<endl;
  cout<<"Time to copy the arrays to GPU "<<elapsed_time_3<<" seconds"<<endl;
  cout<<"Time to copy the array back to CPU "<<elapsed_time_6<<" seconds"<<endl;
  cudaFree(device_pointer_array1);
  cudaFree(device_pointer_array2);
  delete [] host_pointer_array1;
  delete [] host_pointer_array2;
  return 0;
}

Output:

last 3 elements are (103 , 103, 103)
Time to perform the vector addition of 4294967296 elements done 50 times one after another is 2.37934 sec
Time to assign the vectors was 10.8694 seconds
Time to copy the arrays to GPU 0.936566 seconds
Time to copy the array back to CPU 0.558665 seconds

real 0m15.330s
user 0m12.975s
sys 0m2.303s

Comparing the same with serial programming

/******************************************************************************
* File:             code7.cpp
*
* Author:           Vishal  
* Created:          01/30/24 
* Description:      This code is supposed to do vector addition for cpu for very large size array of size 2^32
*****************************************************************************/
#include <cstdint>
#include <iostream>
#include <iterator>
#include <vector>
#include <cmath>
#include <chrono>
using std::cout;
using std::endl;
void AddVectorsOnCPU(uint8_t *array1, uint8_t *array2, size_t size)
{
  for (size_t i=0;i<size;i++)
  {
    //if(i==0) printf("adding first element \n");
   // if(i==size-1) printf("adding last element \n");
    array1[i]+=array2[i];
    /* *array1+=*array2; */
    /* array1+=1; */
    /* array2+=1; */
  }
}
int main(int argc, char *argv[])
{
  size_t size1=1L<<32; //1L denotes long data type of 1.
  size_t size2=size1; 
  uint8_t *vec1=new uint8_t[size1]; //dynamics_array allocation does not take any time comparable to stack memory 
  uint8_t *vec2=new uint8_t[size2]; //dynamics_array allocation does not take any time comparable to stack memory 

  //we can print the starting address of these array:
  cout<<"Printing the starting address of both array \n";
  cout<<static_cast<void*>(vec1)<<" "<<static_cast<void*>(vec2)<<endl;
  cout<<"Printing the last address of both array \n";
  cout<<static_cast<void*>(vec1+size1-1)<<" "<<static_cast<void*>(vec2+size1-1)<<endl;

  //let us fill the vector with 1 and 99 respectively.
  auto time_0 = std::chrono::high_resolution_clock::now();
  for (size_t i=0;i<size1;i++)
  {
    vec1[i]=1;
    vec2[i]=2;
  }
  auto time_1 = std::chrono::high_resolution_clock::now();
  for(int i=0;i<=50;i++)
  {
    AddVectorsOnCPU(vec1,vec2,size1);
  }
  auto time_2 = std::chrono::high_resolution_clock::now();

//finding elapsed_times
  auto elapsed_time_assigning_values = std::chrono::duration_cast<std::chrono::microseconds>(time_1 - time_0).count() / 1e6;
  auto elapsed_time_modify_mag = std::chrono::duration_cast<std::chrono::microseconds>(time_2 - time_1).count() / 1e6;

 printf(" Last 3 elements are ( %d, %d, %d ) \n",vec1[size1-3],vec1[size1-2],vec1[size1-1]);;
  cout<<"the time to assign vectors: "<<elapsed_time_assigning_values<<" seconds"<<endl;
  cout<<"the time to add the vectors : "<<elapsed_time_modify_mag<<" seconds"<<endl;
  cout<<" The total time shall be :"<<elapsed_time_assigning_values+elapsed_time_modify_mag<<" seconds"<<endl;
  delete[] vec1;
  delete[] vec2;
  return 0;
}

Output:

Printing the starting address of both array
0x7ff8d42b0010 0x7ff7d42af010
Printing the last address of both array
0x7ff9d42b000f 0x7ff8d42af00f
Last 3 elements are ( 103, 103, 103 )
the time to assign vectors: 10.954 seconds
the time to add the vectors : 555.668 seconds
The total time shall be :566.622 seconds

real 9m26.644s
user 9m23.875s
sys 0m1.867s

Converting the code to Assembly language and then to machine language

  1. Assembly_code_for_serial_programm_above
  2. machine_language_for_serial_program_above

Additional resources

https://aws.amazon.com/compare/the-difference-between-gpus-cpus/#:~:text=GPUs%20excel%20in%20parallel%20processing,them%20through%20at%20high%20speed.

Thankyou! for reading...