# **CUDA-C**

## (Compute Unified Device Architecture-C): Un linguaggio di programmazione per schede GPGPU

William Spataro spataro@unical.it

### History

- 2001/2002 researchers see GPU as data-parallel coprocessor
  - The GPGPU field is born
- 2007 NVIDIA releases CUDA 1.0
  - CUDA Compute Uniform Device Architecture
  - GPGPU shifts to GPU Computing
- 2008 Khronos releases OpenCL specification
- 2014 Cuda 6.0

### History

- Nvidia creates CUDA to facilitate the development of parallel programs on GPUs (2007)
- The CUDA language is ANSI C extended with very few keywords for labeling data-parallel functions (kernels) and their associated data
- Nvidia technology benefits from massive economies of scale in the gaming market, CUDA-enabled cards are very inexpensive for the performance they provide



#### CUDA

- Scalable parallel programming model
- Minimal extensions to familiar C/C++ environment
- Heterogeneous serial-parallel computing

## Modello di esecuzione

- Un codice CUDA alterna porzioni di codice seriale, eseguito dalla CPU,e di codice parallelo, eseguito dalla GPU.
- Il codice parallelo viene lanciato, ad opera della CPU, sulla GPU come kernel.
   ≻La GPU esegue un solo kernel alla volta.
- Un kernel è organizzato in grids di blocks.
   ➢Ogni block contiene lo stesso numero di threads.
- Ogni block viene eseguito da un solo multiprocessore: non può essere spezzato su più SM, mentre più blocks possono risiedere ed essere eseguiti in parallelo dallo stesso multiprocessore.



# Gerarchia dei thread



Host = CPU Device = GPU

**Thread**: codice concorrente, eseguibile in parallelo ad altri threads su un device CUDA.

**Warp**: un gruppo di threads che possono essere eseguiti <u>fisicamente</u> in parallelo.

**Half-warp**: una delle 2 metà di un warp spesso eseguiti sullo stesso multiprocessore.

**Block**: un insieme di threads eseguiti sullo stesso Multiprocessore, e che quindi possono condividere memoria (stessa shared memory).

**Grid**: un insieme di thread blocks che eseguono un singolo kernel CUDA, in parallelismo logico, su una singola GPU.

**Kernel**: il codice CUDA che viene lanciato dalla CPU su una o più GPU.

## Esecuzione del codice



In ogni "Menipitden sod el del de la carte de la contra co

# Multidimensionalità degli IDs

Il codice parallelo viene lanciato, dalla CPU, sulla GPU, questa esegue un solo kernel alla volta.

La dimensione della griglia si misura in blocchi questi possono essere:

Block: 1-D o 2-D (3D da comp. capability 2.0 in poi)

La dimensione dei blocchi si misura in thread

Thread 1-D,2-D,3-D



## Organizzazione gerarchica della memoria

- **Register file**: area di memoria privata di ciascun thread (var. locali).
- Shared memory: accessibile a tutti i threads dello stesso block. Può essere usata sia come spazio privato che come spazio condiviso.
- Tutti i threads accedono alla medesima **global memory** (off-chip DRAM).
- Memorie read-only accessibili da tutti i threads: constant e texture memory.
   > dotate di cache locale in ogni SM
- Global, constant e texture memory sono memorie persistenti tra differenti lanci di kernel della stessa applicazione.
- Global memory bandwidth: 2 ordini di grandezza superiori della shared memory!



## G80 Implementation of CUDA Memories

### • Each thread can:

- Read/write per-thread registers
- Read/write per-thread local memory
- Read/write per-block shared memory
- Read/write per-grid global memory
- Read/only per-grid constant memory



# Extended C

- Declspecs
  - global, device, shared, local, constant
- Keywords
  - threadIdx, blockIdx
- Intrinsics
  - \_\_\_\_\_syncthreads
- Runtime API
  - Memory, symbol, execution management
- Function launch

```
device float filter[N];
global void convolve (float *image)
                                        {
  shared float region[M];
  . . .
 region[threadIdx] = image[i];
  syncthreads()
  image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
```

// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);

# **Application Programming Interface**

- The API is an extension to the C programming language
- It consists of:
  - Language extensions
    - To target portions of the code for execution on the device
  - A runtime library split into:
    - A common component providing built-in vector types and a subset of the C runtime library in both host and device codes
    - A host component to control and access one or more devices from the host
    - A device component providing device-specific functions

Language Extensions: Built-in Variables

- dim3 gridDim;
  - Dimensions of the grid in blocks
- dim3 blockDim;
  - Dimensions of the block in threads
- dim3 blockIdx;
  - Block index within the grid
- dim3 threadIdx;
  - Thread index within the block

# Common Runtime Component: Mathematical Functions

- pow, sqrt, cbrt, hypot
- exp, exp2, expm1
- log, log2, log10, log1p
- sin, cos, tan, asin, acos, atan, atan2
- sinh, cosh, tanh, asinh, acosh, atanh
- ceil, floor, trunc, round
- Etc.
  - When executed on the host, a given function uses the C runtime implementation if available
  - These functions are only supported for scalar types, not vector types

# Device Runtime Component: Mathematical Functions

 Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. sin(x))



# **CUDA** Function Declarations

|                                 | Executed on the: | Only callable from the: |
|---------------------------------|------------------|-------------------------|
|                                 | device           | device                  |
|                                 | device           | host                    |
| <pre>hostfloat HostFunc()</pre> | host             | host                    |

- \_\_\_\_\_\_\_ defines a kernel function
  - Must return void
- <u>device</u> and <u>host</u> can be used together

## Calling a Kernel Function – Thread Creation

• A kernel function must be called with an execution configuration:

```
__global__ void KernelFunc(...);
```

- dim3 DimGrid(100, 50); // 5000 thread blocks
- dim3 DimBlock(4, 8, 8); // 256 threads per block

```
size_t SharedMemBytes = 64; // 64 bytes of shared
   memory
```

KernelFunc<<< DimGrid, DimBlock, SharedMemBytes
>>>(...);

• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

# Device Runtime Component: Synchronization Function

- void \_\_syncthreads();
- Synchronizes all threads in a block
- Once all threads have reached this point, execution resumes normally
- Used to avoid RAW / WAR / WAW hazards when accessing shared or global memory
- Allowed in conditional constructs only if the conditional is uniform across the entire thread block

# Confronto codice seriale e parallelo

#### Programma CPU

```
void add_matrix
( float* a, float* b, float* c, int N ) {
    int index,
    for ( int i = 0; i < N; ++i )
        for ( int j = 0; j < N; ++j ) {
            mdcm=i+j*N:
            c[index] = a[index] + b[index];
    }
}
int main() {
    add_matrix( a, b, c, N );
}</pre>
```

#### Programma CUDA

```
__global__ add_matrix
(float* a, float* b, float* c, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
```

int main() {
 dim3 dimBlock( blocksize, blocksize );
 dim5 dimGrid( N/dimBlock.x, N/dimBlock.y );
 add\_matrix<<<dimGrid, dimBlock>>>( a, b, c, N );

Il ciclo for è sostituito da una griglia implicita



### NVIDIA GPU Architecture Fermi GF100





#### SM Multiprocessor

- 32 CUDA Cores per SM (512 total)
- Direct load/store to memory
  - Usual linear sequence of bytes
  - High bandwidth (Hundreds GB/sec)
- 64KB of fast, on-chip RAM
  - Software or hardware-managed
  - Shared amongst CUDA cores
  - Enables thread communication

**SIMT** (Single Instruction Multiple Thread) **execution** 

| Instruction Cache      |                       |          |      |  |  |
|------------------------|-----------------------|----------|------|--|--|
| Scheduler              |                       |          |      |  |  |
| Dispatch               |                       | Dispatch |      |  |  |
| Register File          |                       |          |      |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Core                   | Core                  | Core     | Core |  |  |
| Load                   | Load/Store Units x 16 |          |      |  |  |
| Special Func Units x 4 |                       |          |      |  |  |
| Interconnect Network   |                       |          |      |  |  |
| 64K Configurable       |                       |          |      |  |  |
| Cache/Shared Mem       |                       |          |      |  |  |
| Uniform Cache          |                       |          |      |  |  |
|                        |                       |          |      |  |  |

#### **Compute Capability**



#### The compute capability of a device describes its architecture, e.g.

- Number of SMs and registers
- Sizes of memories
- Features & capabilities

| Compute<br>Capability | Selected Features<br>(see CUDA C Programming Guide for complete list)                         | Tesla models |
|-----------------------|-----------------------------------------------------------------------------------------------|--------------|
| 1.0                   | Fundamental CUDA support                                                                      | 870          |
| 1.3                   | Double precision, improved memory accesses, atomics                                           | 10-series    |
| 2.0                   | Caches, 3D grids, surfaces, ECC, P2P, concurrent kernels/copies, function pointers, recursion | 20-series    |
| 3.0                   | Dynamic parallelism                                                                           | K-series     |

#### We will concentrate on Fermi devices

Compute Capability >= 2.0



### **Heterogeneous computing**





### **Heterogeneous computing**



#### **Simple Processing Flow**





#### **Simple Processing Flow**





DRAM

#### **Simple Processing Flow**







```
int main(void) {
    printf("Hello World!\n");
    return 0;
}
```

```
Standard C that runs on the host
```

NVIDIA compiler (nvcc) can be used to compile programs with no *device* code

```
$ nvcc
hello_world.
cu
$ a.out
Hello World!
$
```

**Output:** 



```
__global___void mykernel(void) {
}
int main(void) {
    mykernel<<<1,1>>>();
    printf("Hello World!\n");
    return 0;
}
```

Two new syntactic elements...



global void mykernel(void) {
}

- CUDA C/C++ keyword \_\_global\_\_ indicates a function that:
  - Runs on the device
  - Is called from host code

nvcc separates source code into host and device components

- Device functions (e.g. mykernel()) processed by NVIDIA compiler
- Host functions (e.g. main()) processed by standard host compiler
  - gcc, cl.exe



mykernel<<<1,1>>>>();

- Triple angle brackets mark a call from host code to device code
  - Also called a "kernel launch"
- That's all that is required to execute a function on the GPU!



```
_global__ void mykernel(void) {
}
int main(void) {
    mykernel<<<1,1>>>();
    printf("Hello World!\n");
    return 0;
}

    Cutput:
    S nvcc
    hello.cu
    $ a.out
    Hello World!
```

\$

mykernel() does nothing....

### Parallel Programming in CUDA C/C++



- GPU computing is about massive parallelism
- We need a more interesting example...
- We'll start by adding two integers and build up to vector addition



#### **Addition on the Device**



A simple kernel to add two integers

As before \_\_global\_\_\_ is a CUDA C/C++ keyword meaning

- add() will execute on the device
- add() will be called from the host

#### **Addition on the Device**



Note that we use pointers for the variables

```
__global___void add(int *a, int *b, int *c) {
    *c = *a + *b;
}
```

- add() runs on the device, so a, b and c must point to device memory
- We need to allocate memory on the GPU

### Memory Management

#### Host and device memory are separate entities

- Device pointers point to GPU memory
   May be passed to/from host code
   May not be dereferenced in host code
- Host pointers point to CPU memory
   May be passed to/from device code
   May not be dereferenced in device code





#### Simple CUDA API for handling device memory

- cudaMalloc(), cudaFree(), cudaMemcpy()
- Similar to the C equivalents malloc(), free(), memcpy()



#### Addition on the Device: add()



Returning to our add() kernel

```
__global___void add(int *a, int *b, int *c) {
     *c = *a + *b;
}
```

Let's take a look at main()...

#### Addition on the Device: main()



// Allocate space for device copies of a, b, c
cudaMalloc((void \*\*)&d\_a, size);
cudaMalloc((void \*\*)&d\_b, size);
cudaMalloc((void \*\*)&d\_c, size);

// Setup input values
a = 2;
b = 7;

#### Addition on the Device: main()



// Copy inputs to device

cudaMemcpy(d\_a, &a, size, cudaMemcpyHostToDevice); cudaMemcpy(d b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU
add<<<1,1>>>(d\_a, d\_b, d\_c);

// Copy result back to host
cudaMemcpy(&c, d c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d\_a); cudaFree(d\_b); cudaFree(d\_c);
return 0;

}

#### Moving to Parallel



#### GPU computing is about massive parallelism

So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

Instead of executing add() once, execute N times in parallel

### **CUDA logical architecture**



- A kernel is launched as a grid of blocks of threads
  - blockIdx and threadIdx can represent up to 3 dimensions
- Built-in variables:
  - threadIdx
  - blockIdx
  - blockDim
  - gridDim



#### **Vector Addition on the Device**



- With add() running in parallel we can do vector addition
- Terminology: each parallel invocation of add() is referred to as a block
  - The set of blocks is referred to as a grid
  - Each invocation can refer to its block index using blockIdx.x

```
__global___void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
```

By using blockIdx.x to index into the array, each block handles a different index

#### **Vector Addition on the Device**



```
__global___void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
```

On the device, each block can execute in parallel:





Returning to our parallelized add() kernel

```
__global___void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
```

Let's take a look at main()...

#### Vector Addition on the Device: main()



// Alloc space for device copies of a, b, c
cudaMalloc((void \*\*)&d\_a, size);
cudaMalloc((void \*\*)&d\_b, size);
cudaMalloc((void \*\*)&d c, size);

// Alloc space for host copies of a, b, c and setup input values
a = (int \*)malloc(size); random\_ints(a, N);
b = (int \*)malloc(size); random\_ints(b, N);
c = (int \*)malloc(size);

#### Vector Addition on the Device: main()



// Copy inputs to device

cudaMemcpy(d\_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d\_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks
add<<<N,1>>>(d\_a, d\_b, d\_c);

// Copy result back to host
cudaMemcpy(c, d\_c, size, cudaMemcpyDeviceToHost);

```
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
```

}

### **CUDA logical architecture**



- A kernel is launched as a grid of blocks of threads
  - blockIdx and threadIdx can represent up to 3 dimensions
- Built-in variables:
  - threadIdx
  - blockIdx
  - blockDim
  - gridDim

#### Device

Thread

(0,2,0)

Thread

(1,2,0)



Thread

(2,2,0)

Thread

(3,2,0)

Thread

(4, 2, 0)

## **Threads Hierarchy**



all threads execute the same sequential program

#### Thread block is a group of threads that can:

- Synchronize their execution
- Communicate via shared memory



Thread t





- Terminology: a block can be split into parallel threads
- Let's change add() to use parallel threads instead of parallel blocks

```
__global___void add(int *a, int *b, int *c) {
    c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
```

- We use threadIdx.x instead of blockIdx.x
- Need to make one change in main()...

### Vector Addition Using Threads: main()



```
#define N 512
int main(void) {
    int *a, *b, *c;
    int *d_a, *d_b, *d_c;
    int size = N * sizeof(int);
```

```
// host copies of a, b, c
// device copies of a, b, c
```

```
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
```

```
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
```

### Vector Addition Using Threads: main()



// Copy inputs to device

cudaMemcpy(d\_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads
add<<<1,N>>>(d\_a, d\_b, d\_c);

// Copy result back to host
cudaMemcpy(c, d c, size, cudaMemcpyDeviceToHost);

#### // Cleanup

```
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
```

}

### Combining Blocks <u>and</u> Threads



We've seen parallel vector addition using:

- Many blocks with one thread each
- One block with many threads
- Adapting vector addition to use both *blocks* and *threads*
- Data indexing...

# Indexing Arrays with Blocks and Threads



• No longer as simple as using blockIdx.x and threadIdx.x

Consider indexing an array with one element per thread (8 threads/block)



With M threads/block a unique index for each thread is given by: int index = blockIdx.x \* M + threadIdx.x;

### Indexing Arrays: Example



Which thread will operate on the red element?





int index = threadIdx.x + blockIdx.x \* M; = 5 + 2 \* 8; = 21;



Use the built-in variable blockDim.x for threads per block int index = threadIdx.x + blockIdx.x \* blockDim.x;

What changes need to be made in main()?

### Addition with Blocks and Threads: main()



```
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
    int *a, *b, *c; /
    int *d_a, *d_b, *d_c; /
    int size = N * sizeof(int);
```

```
// host copies of a, b, c
// device copies of a, b, c
```

```
// Alloc space for device copies of a, b, c
cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);
```

```
// Alloc space for host copies of a, b, c and setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
```

### Addition with Blocks and Threads: main()



// Copy inputs to device
cudaMemcpy(d\_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU
add<<<N/THREADS\_PER\_BLOCK, THREADS\_PER\_BLOCK>>>(d\_a, d\_b, d\_c);

// Copy result back to host

cudaMemcpy(c, d\_c, size, cudaMemcpyDeviceToHost);

#### // Cleanup

```
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
```

}

### Handling Arbitrary Vector Sizes



- Typical problems are not friendly multiples of blockDim.x
- Avoid accessing beyond the end of the arrays:

```
__global___void add(int *a, int *b, int *c, int n) {
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if (index < n)
        c[index] = a[index] + b[index];
    }
Update the kernel launch:</pre>
```

```
add<<<(N + M-1) / M, M >>>(d_a, d_b, d_c, N);
```

### Kernel with 2D Indexing



```
__global___void kernel( int *a, int dimx, int dimy )
{
    int ix = blockldx.x*blockDim.x + threadIdx.x;
    int iy = blockldx.y*blockDim.y + threadIdx.y;
    int idx = iy*dimx + ix;
    a[idx] = a[idx]+1;
}
```



### Kernel with 2D Indexing

```
global___void kernel( int *a, int dimx, int dimy )
{
    int ix = blockldx.x*blockDim.x + threadIdx.x;
    int iy = blockldx.y*blockDim.y + threadIdx.y;
    int idx = iy*dimx + ix;
    a[idx] = a[idx]+1;
```

#### int main()

```
int dimx = 16;
int dimy = 16;
int num_bytes = dimx*dimy*sizeof(int);
```

int \*d\_a=0, \*h\_a=0; // device and host pointers

```
h_a = (int*)malloc(num_bytes);
cudaMalloc( (void**)&d_a, num_bytes );
```

```
dim3 grid, block;
block.x = 4;
block.y = 4;
grid.x = dimx / block.x;
grid.y = dimy / block.y;
```

kernel<<<grid, block>>>( d\_a, dimx, dimy );

```
cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
```

```
•••••
```





#### **Coordinating Host & Device**



#### Kernel launches are asynchronous

Control returns to the CPU immediately

#### CPU needs to synchronize before consuming the results

| cudaMemcpy()                       | Blocks the CPU until the copy is complete                    |  |
|------------------------------------|--------------------------------------------------------------|--|
|                                    | Copy begins when all preceding CUDA calls have completed     |  |
| cudaMemcpyAsync()                  | ) Asynchronous, does not block the CPU                       |  |
| <pre>cudaDeviceSynchronize()</pre> | Blocks the CPU until all preceding CUDA calls have completed |  |

### **Reporting Errors**



All CUDA API calls return an error code (cudaError\_t)

- Error in the API call itself
   OR
- Error in an earlier asynchronous operation (e.g. kernel)
- Get the error code for the last error:

cudaError\_t cudaGetLastError(void)

Get a string to describe the error:

char **\*cudaGetErrorString**(cudaError\_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

#### **Device Management**



#### Application can query and select GPUs

cudaGetDeviceCount(int \*count)
cudaSetDevice(int device)
cudaGetDevice(int \*device)
cudaGetDeviceProperties(cudaDeviceProp \*prop, int device)

#### Multiple threads can share a device

#### A single thread can manage multiple devices

cudaSetDevice(i) to select current device
cudaMemcpy(...) for peer-to-peer copies<sup>†</sup>

<sup>+</sup> requires OS and device support



# A Simple Running Example Matrix Multiplication

- Let's now see a simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs
  - Leave shared memory usage until later
  - Local, register usage
  - Thread ID usage
  - Memory data transfer API between host and device
  - Assume square matrix for simplicity

# Programming Model: Square Matrix Multiplication Example

- P = M \* N of size WIDTH x WIDTH
- Without tiling:
  - One thread calculates one element of P
  - Needed parts of M and N are loaded
     WIDTH times from global memory

М



# Memory Layout of a Matrix in C

| M <sub>0,0</sub> | M <sub>1,0</sub> | M <sub>2,0</sub> | М <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>0,1</sub> | M <sub>1,1</sub> | M <sub>2,1</sub> | M <sub>3,1</sub> |
| M <sub>0,2</sub> | M <sub>1,2</sub> | M <sub>2,2</sub> | M <sub>3,2</sub> |
| M <sub>0,3</sub> | M <sub>1,3</sub> | M <sub>2,3</sub> | M <sub>3,3</sub> |

Μ



```
Step 2: Input Matrix Data Transfer
(Host-side Code)
```

void MatrixMulOnDevice(float\* M, float\* N, float\* P, int Width)

```
int size = Width * Width * sizeof(float);
float* Md, Nd, Pd;
```

 // Allocate and Load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

```
// Allocate P on the device
cudaMalloc(&Pd, size);
```

## Step 3: Output Matrix Data Transfer (Host-side Code)

2. // Kernel invocation code – to be shown later

 // Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices
cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}

## Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

\_global\_\_ void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width)

// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;

#### Step 4: Kernel Function (cont.)

```
for (int k = 0; k < Width; ++k) {
  float Melement = Md[threadIdx.y*Width+k];
  float Nelement = Nd[k*Width+threadIdx.x];
  Pvalue += Melement * Nelement;</pre>
```

Pd[threadIdx.y\*Width+threadIdx.x] = Pvalue;

Md



k

Step 5: Kernel Invocation (Host-side Code)

// Setup the execution configuration dim3 dimGrid(1, 1); // one block only! dim3 dimBlock(Width, Width);

// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

## Only One Thread Block Used !

- One Block of threads compute matrix Pd
  - Each thread computes one element of Pd
- Each thread
  - Loads a row of matrix Md
  - Loads a column of matrix Nd
  - Perform one multiply and addition for each pair of Md and Nd elements
  - Compute to global memory access ratio\* close to 1:1 (not very high)
- Size of matrix limited by the number of threads allowed in a thread block ! (e.g. 1024 only!)



\*CGMA (Computation to Global Memory Access) index: around 20/30:1 to be REALLY good!

#### Step 7: Handling Arbitrary Sized Square Matrices

- Have each 2D thread block to compute a (TILE\_WIDTH)<sup>2</sup> sub-matrix (tile) of the result matrix
  - Each has (TILE\_WIDTH)<sup>2</sup> threads
- Generate a 2D Grid of (WIDTH/TILE\_WIDTH)<sup>2</sup> blocks



### Matrix Multiplication Using Multiple Blocks

TILE WIDTH-

- Break-up Pd into tiles
- Each block calculates one tile
  - Each thread calculates one element

tv

- Block size equal tile size

0

2

by 1



#### A Small Example



#### A Small Example: Multiplication



## Revised Matrix Multiplication Kernel using Multiple Blocks

\_\_global\_\_\_void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width)

```
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE WIDTH + threadIdx.x;
```

```
float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
    Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];</pre>
```

```
Pd[Row*Width+Col] = Pvalue;
```

Revised Step 5: Kernel Invocation (Host-side Code)

// Setup the execution configuration dim3 dimGrid(Width/TILE\_WIDTH, Width/TILE\_WIDTH); dim3 dimBlock(TILE\_WIDTH, TILE\_WIDTH);

// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

#### Review: CUDA Thread Block

- All threads in a block execute the same kernel program (SPMD)
- Programmer declares block:
  - Block size 1 to 1024 concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- Threads have thread id numbers within block
  - Thread program uses thread id to select work and address shared data
- Threads in the same block share data and <u>synchronize</u> while doing their share of the work
- Threads in different blocks cannot cooperate:
  - Each block can execute in any order relative to other blocks!

#### **CUDA Thread Block**



## **Transparent Scalability**

 Since threads in different blocks <u>cannot</u> perform barrier synchronization with each other, the runtime system is free to assigns blocks to any processor at any time, depending on hardware

A kernel scales across any number of parallel processors



### G80 CUDA mode – A Review

- Processors execute computing threads
- New operating mode/HW interface for computing



ECE498AL, University of Illinois, Urbana-Champaign

#### G80 Example: Executing Thread Blocks



Threads are assigned to Streaming

Blocks

Multiprocessors in block granularity

- Up to 8 (physical) blocks to each SM as resource allows
- SM in G80 can take up to 768 (physical) threads
  - Could be 256 (threads/block) \* 3 blocks
  - Or 128 (threads/block) \* 6 blocks, etc.
- Threads run concurrently
  - SM maintains thread/block id #s
  - SM manages/schedules thread execution

## G80 Example: Warps and Thread Scheduling

- Each Block is executed as 32-thread Warps
  - An implementation decision, not part of the CUDA programming model
  - Warps are scheduling units in SM
  - <u>SIMD</u> !
- If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
  - Each Block is divided into 256/32 = 8 Warps
  - There are 8 \* 3 = 24 Warps





# G80 Example: Thread Scheduling (Cont.)

- Each SM implements zero-overhead warp scheduling:
  - Warps whose next instruction has its operands ready for consumption are eligible for execution (i.e. Cuda runtime system maintains a list of warp blocks...)
  - (Latency tollerance) Eligible Warps are selected for execution on a prioritized scheduling policy (ex: Post office queue)
  - All threads in a warp execute the same instruction when selected (i.e. <u>SIMD fashion</u>)

## **G80 Block Granularity Considerations** (max 8 blocks – 768 threads, which ever comes first!)

- For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?
  - For 8X8, we have 64 threads per Block. Since <u>each SM</u> can take up to 768\* threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 8\*64=512 threads will go into each SM!



For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve <u>full</u> <u>capacity</u> unless other resource considerations overrule.

 For 32X32, we have 1024 threads per Block. Not even one can fit into an SM! (max 768!)

\* Physical threads for G80 ! 1024 virtual threads per block from Compute capability 2.0 on...



#### 1.Cuda occupancy calculator!

#### 2.cudaGetDeviceProperties()

## **CUDA Variable Type Qualifiers**

| Memory   | Scope                     | Lifetime                         |
|----------|---------------------------|----------------------------------|
| local    | thread                    | thread                           |
| shared   | block                     | block                            |
| global   | grid                      | application                      |
| constant | grid                      | application                      |
| -        | local<br>shared<br>global | localthreadsharedblockglobalgrid |

- <u>device</u> is optional when used with
   <u>local</u>, <u>shared</u>, or <u>constant</u>
- Automatic variables without any qualifier reside in a register
  - Except arrays that reside in local memory



## Global memory access efficiency

- Although having many threads available for execution can theoretically tolerate long memory access latency, one can easily run into a situation where traffic congestion (ie. <u>too</u> <u>much global memory accesses</u>) prevents all but few threads from making progress, thus rendering some SM idle!
- A common strategy for reducing global memory traffic (i.e. increasing the number of floating-point operations performed for each access to the global memory) is to partition the data into subsets called tiles such that each tile fits into the shared memory and the kernel computations on these tiles can be done independently of each other.
- In the simplest form, the tile dimensions equal those of the block.

## A Common Programming Strategy

- Global memory resides in device memory (DRAM) much slower access than shared memory (up to 2 order magnitude!)
- So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory:
  - Partition data into subsets that fit into shared memory
  - Handle each data subset with one thread block by:
    - Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism (i.e. <u>cooperation</u>)
    - Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element
    - Copying results from shared memory to global memory

## A Common Programming Strategy (Cont.)

- Constant memory also resides in device memory (DRAM) - much slower access than shared memory
  - But... cached!
  - Highly efficient access for read-only data
- Carefully divide data according to access patterns
  - R/Only → constant memory (very fast if in cache)
  - R/W shared within Block  $\rightarrow$  shared memory (very fast)
  - R/W within each thread  $\rightarrow$  registers (very fast)
  - R/W inputs/results  $\rightarrow$  global memory (very slow)

### Matrix Multiplication using Shared Memory

## Review: Matrix Multiplication Kernel using Multiple Blocks

\_global\_\_\_void MatrixMulKernel(float\* Md, float\* Nd, float\* Pd, int Width)

```
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE WIDTH + threadIdx.x;
```

```
float Pvalue = 0;
// each thread computes one element of the block sub-
matrix
for (int k = 0; k < Width; ++k)
    Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];</pre>
```

```
Pd[Row*Width+Col] = Pvalue;
```

## Idea: Use Shared Memory to reuse global memory data

- Each input element is read by Width threads.
- Load each element into Shared Memory and have <u>several</u> threads use the local version to reduce the memory bandwidth
  - Tiled algorithms



## **Tiled Multiply**

 Break up the execution of the kernel into phases so that the data accesses in each phase is focused on one subset (tile) of Md and Nd

0

2

by 1

tv

TILE WIDTH

TILE WIDTH



#### A Small Example



#### Every Md and Nd Element is used <u>exactly</u> <u>twice</u> in generating a 2X2 tile of P

|                 | P <sub>0,0</sub>                    | P <sub>1,0</sub>                    | P <sub>0,1</sub>                    | P <sub>1,1</sub>                    |
|-----------------|-------------------------------------|-------------------------------------|-------------------------------------|-------------------------------------|
|                 | thread <sub>0,0</sub>               | thread <sub>1,0</sub>               | thread <sub>0,1</sub>               | thread <sub>1,1</sub>               |
|                 | M <sub>0,0</sub> * N <sub>0,0</sub> | $M_{0,0} * N_1$                     | M <sub>0,1</sub> * N <sub>0,0</sub> | $M_{0,1} * N_{1}$                   |
| Access<br>order | _                                   |                                     |                                     | M <sub>1,1</sub> * N <sub>1,1</sub> |
|                 | M <sub>2,0</sub> * N <sub>0,2</sub> | M <sub>2,0</sub> * N <sub>1,2</sub> | M <sub>2,1</sub> * N <sub>0,2</sub> | M <sub>2,1</sub> * N <sub>1,2</sub> |
|                 | M <sub>3,0</sub> * N <sub>0,3</sub> | M <sub>3,0</sub> * N <sub>1,3</sub> | M <sub>3,1</sub> * N <sub>0,3</sub> | M <sub>3,1</sub> * N <sub>1,3</sub> |

## Breaking Md and Nd into Tiles

- Break up the inner product loop of each thread into phases
- At the beginning of each phase, load the Md and Nd elements that <u>everyone needs during</u> <u>the phase</u> into shared memory
- Everyone access the Md and Nd elements from the shared memory during the phase



## Each phase of a Thread Block uses one tile from Md and one from Nd

|                  | Phase 1                                      |                                              |                                                                                                                 | Phase 2                                      |                                              |                                                                                                         |
|------------------|----------------------------------------------|----------------------------------------------|-----------------------------------------------------------------------------------------------------------------|----------------------------------------------|----------------------------------------------|---------------------------------------------------------------------------------------------------------|
| Τ <sub>0,0</sub> | Md <sub>0,0</sub>                            | Nd <sub>0,0</sub>                            | PValue <sub>0,0</sub> +=                                                                                        | Md <sub>2,0</sub>                            | Nd <sub>0,2</sub>                            | $PValue_{0,0} +=$                                                                                       |
|                  | ↓                                            | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +                                                                        | ↓                                            | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>0,0</sub> +                                                                |
|                  | Mds <sub>0,0</sub>                           | Nds <sub>0,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>0,1</sub>                                                                          | Mds <sub>0,0</sub>                           | Nds <sub>0,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>0,1</sub>                                                                  |
| Τ <sub>1,0</sub> | Md <sub>1,0</sub>                            | Nd <sub>1,0</sub>                            | PValue <sub>1,0</sub> +=                                                                                        | Md <sub>3,0</sub>                            | Nd <sub>1,2</sub>                            | PValue <sub>1,0</sub> +=                                                                                |
|                  | ↓                                            | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +                                                                        | ↓                                            | ↓                                            | Mds <sub>0,0</sub> *Nds <sub>1,0</sub> +                                                                |
|                  | Mds <sub>1,0</sub>                           | Nds <sub>1,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>1,1</sub>                                                                          | Mds <sub>1,0</sub>                           | Nds <sub>1,0</sub>                           | Mds <sub>1,0</sub> *Nds <sub>1,1</sub>                                                                  |
| Т <sub>0,1</sub> | Md <sub>0,1</sub><br>↓<br>Mds <sub>0,1</sub> | Nd <sub>0,1</sub><br>↓<br>Nds <sub>0,1</sub> | PdValue <sub>0,1</sub> +=<br>MdS <sub>0,1</sub> *NdS <sub>0,0</sub> +<br>MdS <sub>1,1</sub> *NdS <sub>0,1</sub> | Md <sub>2,1</sub><br>↓<br>Mds <sub>0,1</sub> | Nd <sub>0,3</sub><br>↓<br>Nds <sub>0,1</sub> | $\begin{array}{l} PdValue_{0,1} \ += \\ Mds_{0,1}^* Nds_{0,0} \ + \\ Mds_{1,1}^* Nds_{0,1} \end{array}$ |
| T <sub>1,1</sub> | Md <sub>1,1</sub>                            | Nd <sub>1,1</sub>                            | PdValue <sub>1,1</sub> +=                                                                                       | Md <sub>3,1</sub>                            | Nd <sub>1,3</sub>                            | PdValue <sub>1,1</sub> +=                                                                               |
|                  | ↓                                            | ↓                                            | Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +                                                                        | ↓                                            | ↓                                            | Mds <sub>0,1</sub> *Nds <sub>1,0</sub> +                                                                |
|                  | Mds <sub>1,1</sub>                           | Nds <sub>1,1</sub>                           | Mds <sub>1,1</sub> *Nds <sub>1,1</sub>                                                                          | Mds <sub>1,1</sub>                           | Nds <sub>1,1</sub>                           | Mds <sub>1,1</sub> *Nds <sub>1,1</sub>                                                                  |

```
Tiled Matrix Multiplication Kernel
 _global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
   shared float Mds[TILE WIDTH][TILE WIDTH];
    shared float Nds[TILE WIDTH][TILE WIDTH];
3.
   int bx = blockIdx.x; int by = blockIdx.y;
   int tx = threadIdx.x; int ty = threadIdx.y;
4.
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE WIDTH + ty;
   int Col = bx * TILE WIDTH + tx;
6.
   float Pvalue = 0;
7.
// Loop over the Md and Nd tiles required to compute the Pd element (#phases)
8. for (int m = 0; m < Width/TILE WIDTH; ++m) {
  Collaborative loading of Md and Nd tiles into shared memory
Mds[ty][tx] = Md[Row*Width + (m*TILE WIDTH + tx)];
9.
10.
   Nds[ty][tx] = Nd[(m*TILE WIDTH + ty)*Width + Col];
       syncthreads();
11.
12.
      for (int k = 0; k < TILE WIDTH; ++k)
13.
       Pvalue += Mds[ty][k] * Nds[k][tx];
14.
       syncthreads();
15. Pd[Row*Width + Col] = Pvalue;
}
```

## CUDA Code – Kernel Execution Configuration

// Setup the execution configuration

dim3 dimBlock(TILE WIDTH, TILE WIDTH);

dim3 dimGrid(Width / TILE\_WIDTH,

Width / TILE\_WIDTH);

#### First-order Size Considerations in G80

- Each thread block should have many threads
  - Global memory accesses reduced by N = #tile width !
  - TILE\_WIDTH of 16 gives 16\*16 = 256 threads
- There should be many thread blocks
  - A 1024\*1024 Pd gives 64\*64 = 4096 Thread Blocks
  - TILE\_WIDTH of 16 gives each SM 3 blocks, 768 threads (full capacity)
- Each thread block performs 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.

inner product

- Memory bandwidth no longer a limiting factor

## **Tiled Multiply**

by

k.

TILE WIDTH

m

TILE WIDTH

- Each block computes one square sub-matrix Pd<sub>sub</sub> of size TILE\_WIDTH
- Each thread computes one element of Pd<sub>sub</sub>

0

2

by 1

tv



# Memory as a limiting factor to parallelism

- The limited amount of CUDA shared memory (e.g. 16KB) limits the number of threads that can simultaneously reside in the SM!
- For the matrix multiplication example, shared memory <u>can become</u> a limiting factor:
- TILE\_WIDTH = 16, so each block requires 16x16x4 = 1kB of storage for Mds + 1kB for Nds
  - 2kB of shared memory per block
- The 16-kB shared memory allows 8 blocks to simultaneously reside in an SM. Ok!
- But the maximum number of threads per SM is 1024 (for Tesla T10)
  - For 1024\*1024 matrix only 1024/256 = 4 blocks are allowed in each SM
  - only  $4 \times 2kB = 8kB$  of the shared memory will be used.

#### Hint: Use occupancy calculator

```
Tiled Matrix Multiplication Kernel
 _global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
   shared float Mds[TILE WIDTH][TILE WIDTH];
    shared float Nds[TILE WIDTH][TILE WIDTH];
3.
   int bx = blockIdx.x; int by = blockIdx.y;
   int tx = threadIdx.x; int ty = threadIdx.y;
4.
// Identify the row and column of the Pd element to work on
5. int Row = by * TILE WIDTH + ty;
   int Col = bx * TILE WIDTH + tx;
6.
   float Pvalue = 0;
7.
// Loop over the Md and Nd tiles required to compute the Pd element
8. for (int m = 0; m < Width/TILE WIDTH; ++m) {
  Collaborative loading of Md and Nd tiles into shared memory
Mds[ty][tx] = Md[Row*Width + (m*TILE WIDTH + tx)];
9.
10.
   Nds[ty][tx] = Nd[(m*TILE WIDTH + ty)*Width + Col];
       syncthreads();
11.
12.
      for (int k = 0; k < TILE WIDTH; ++k)
      Pvalue += Mds[ty][k] * Nds[k][tx];
13.
14.
       syncthreads();
15. Pd[Row*Width + Col] = Pvalue;
}
```

### **Tiling Size Effects**



## Summary- Typical Structure of a CUDA Program

- Global variables declaration
  - \_\_host\_\_
  - \_\_device\_\_... \_global\_\_, \_\_constant\_\_, \_\_texture\_\_
- Function prototypes
  - \_\_global\_\_ void kernelOne(...)
  - float handyFunction(...)
- Main ()
  - allocate memory space on the device cudaMalloc(&d\_GlbIVarPtr, bytes)
  - transfer data from host to device cudaMemCpy(d\_GlblVarPtr, h\_Gl...)

repeat

needed

as

- execution configuration setup
- kernel call kernelOne<<<execution configuration>>>( args... );
- transfer results from device to host cudaMemCpy(h\_GlblVarPtr,...)
- optional: compare against golden (host computed) solution
- Kernel void kernelOne(type args,...)
  - variables declaration \_\_local\_\_, \_\_shared\_\_
    - automatic variables transparently assigned to registers or local memory
  - syncthreads()...
- Other functions
  - float handyFunction(int inVar...);