

#### Introduction to GPU Computing

Martin Merck IT Lunch February 26<sup>th</sup> 2010



# What is GPU computing

- 3D Graphic cards need to perform lots of matrix operations to render images
  - To achieve the high throughput massively parallel computing is implemented on graphic cards
  - Basically old vector processing at a much bigger scale
  - CHEAP !!! (Gaming hardware is driving the development and has a mass market)



#### Computing power evalution





# Difference between a classical CPU and the GPU architecture



4



# Detailed architecture of the NVidia Graphics cards



- Basic building block is a Streaming Multiprocessor (SM) (30 on GTX 295)
  - Executes the same program on all processors
  - Branching handled by executing all branches and turning of single processors
  - Each multiprocessor consists of 8 Scalar
     Processor (SP), 2 special function units and shared (fast) memory
  - Runs 32 threads concurrently (called warp)



#### The CUDA software architecture

- Basic programming unit is a "Kernel"
  - Represents a function which is executed on a CUDA device un a huge number of parallel threads. (~1000 10000)
  - Individual threads are combined into "blocks". A block can have up to 3 dimensions to map easily to vectors, matrices and fields
  - Each block is executed on one multiprocessor
  - Blocks are group at a higher level into one or two dimensional "Grids". This allows to execute the same kernel on several of the multiprocessors



#### Software



Each thread is the same code. To select different data to process they use global variables for the blockIdx/ gridDim and threadIdx/ blockDim. These are 3 dimensional indices.



#### The CUDA software architecture

- Basic programming unit is a "Kernel"
  - Represents a function which is executed on a CUDA device un a huge number of parallel threads. (~1000 10000)
  - Individual threads are combined into "blocks". A block can have up to 3 dimensions to map easily to vectors, matrices and fields
  - Each block is executed on one multiprocessor
  - Blocks are group at a higher level into one or two dimensional "Grids". This allows to execute the same kernel on several of the multiprocessors



# A simple code example Matrix Multiplication 1

```
#include <stdio.h>
                       #include <cuda.h>
                       // Matrix multiplication kernel – per thread code
  IT Lunch
                         global void MatrixMulKernel(float* A, float* B, float* C, int Width)
February 26<sup>th</sup>
    2010
                         float Celement= 0;
                         for (int k = 0; k < Width; ++k) {
                           float Aelement = A[ threadIdx.y * Width + k ];
                           float Belement = B[ k * Width + threadIdx.x ];
                           Celement += Aelement * Belement:
                         B[ threadIdx.y * Width+threadIdx.x ] = Celement;
                       }
```



# A simple code example Matrix Multiplication 1



# int main(void) { float \*a\_h, \*b\_h; float \*a\_d, \*b\_d, \*c\_d; int N = 10\*10; size\_t size = N\*sizeof(float); // allocate arrays on host a\_h = (float \*) malloc(size); b\_h = (float \*) malloc(size); // Initialize arrays

// pointers to host memory // pointer to device memory

cudaMalloc((void \*\*) &a\_d, size); // allocate array on device cudaMalloc((void \*\*) &b\_d, size); // allocate array on device cudaMalloc((void \*\*) &c\_d, size); // allocate array on device cudaMemcpy(a\_d, a\_h, sizeof(float)\*N, cudaMemcpyHostToDevice); // copy data from h2d cudaMemcpy(b\_d, b\_h, sizeof(float)\*N, cudaMemcpyHostToDevice)

MatrixMulKernel<<< 1, N\*N>>> (a\_d, b\_d, c\_d, 10); // do calculation on device cudaMemcpy(c\_h, a\_d, sizeof(float)\*N, cudaMemcpyDeviceToHost); free(a\_h); free(b\_h); cudaFree(a\_d); cudaFree(b\_d); cudaFree(c\_d);

}



#### nvcc The CUDA C-Compiler

- Code is C-like
  - No C library functions can be used, but cuda library replicates all math functions and basic malloc, memcpy etc.
  - No recursions
  - No variable list parameters
  - No static variables
  - Just basic operators and flow control
- Compiler separates device and host code and compiles device code.
- Does all register allocation etc.



#### Memory models in CUDA



- Each thread has a set of registers and local variables.
- A thread block can share fast memory between all threads. Threads in a block can be synchronized.
- Threads in different blocks and grids can only share global device memory.
- The host process can only access the global memory area



# Streams Parallel execution of kernels

- CUDA host functions are all asynchronous.
- Streams can be used to group host/memory IO and kernel execution into parallel flows.
- Overlaps IO and kernel execution to use device while other IO is being performed.



Conclusions

- Easy to learn (basics)
- Cheap to build test/play systems
- Works great for highly parallel compute intensive applications

(Image processing, video en-/decoding, neural nets, SVM)

- Algorithms need lots of optimization to hardware limitations
- Lots of applications already available or having optimized plugins (Mathematica, MatLab, Adobe Flash, NueralNet tools, etc.)



#### Showcase

|                           |                                                  |                                                        | Showing 1 - 15 of 1002                                 |                                                       |                                                       |
|---------------------------|--------------------------------------------------|--------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------|-------------------------------------------------------|
| L cs                      | DIGITAL LIBRARY                                  |                                                        |                                                        |                                                       | <b>L</b> CS DIGITAL LIBRARY                           |
| CUDA-Ba                   | ased Jacobi's Iterative CUD/                     | ANdarray                                               | gpuocelot                                              | Multiple Back-Propagation<br>source code              | Optimal Data Distribution for<br>Versatile Finite Imp |
| February 26 <sup>th</sup> |                                                  |                                                        |                                                        | 179 x                                                 |                                                       |
| 2010                      |                                                  |                                                        |                                                        |                                                       |                                                       |
| GPU com<br>Kaczmar        | puting with NVID<br>z's and otheriterative base  | IA Nexus - Visual Studio-<br>d GPU Development         | Acceleration of a Finite-<br>Difference WENO Scheme fo | Fast Disk Encryption through<br>GPGPU Acceleration    | Program Optimization of<br>Array-Intensive SPEC2k Ben |
|                           | DIGITAL LIBRARY                                  |                                                        | 50 ×                                                   |                                                       |                                                       |
| Block Cr                  | m Behavior Study of Para<br>yptography Alg Keple | llel Algorithm for Solving<br>er's Equation o<br>600 x | Cuda-Renderer 2009 - A Multi-<br>Volume Polyhedral Ren | Improving Performance of<br>Matrix Multiplication and | RankBoost Acceleration on both NVIDIA CUDA and ATI    |



# More info

- <u>http://www.nvidia.com/object/</u> <u>cuda\_home\_new.html</u>
- <u>http://developer.nvidia.com/object/</u> <u>gpucomputing.html</u>
- <u>http://www.drdobbs.com/architect/</u> 207200659
- <u>http://courses.ece.illinois.edu/ece498/al/</u>
   <u>Syllabus.html</u>