# GPU ARCHITECTURES AND NEW PROGRAMMING MODEL FEATURES

Nikolay Sakharnykh, 8/1/2016



### HOW GPU ACCELERATION WORKS



### **HETEROGENEOUS ARCHITECTURES**

Memory hierarchy





### **GPU ARCHITECTURE**



### **GPU SM ARCHITECTURE**

#### Kepler SM





### **GPU SM ARCHITECTURE**

#### Maxwell SM





### **GPU SM ARCHITECTURE**

#### Pascal SM



56 SMs on Tesla P100



# LOW LATENCY OF HIGH THROUGHPUT?

CPU architecture must minimize latency within each thread

GPU architecture hides latency with computation from other threads (warps)



### **ACCELERATOR FUNDAMENTALS**

Must expose enough parallelism to saturate the GPU

Accelerator threads are slower than CPU threads

Accelerators have orders of magnitude more threads

| t0  | t1  | t2  | t3  |
|-----|-----|-----|-----|
| t4  | t5  | t6  | t7  |
| t8  | t9  | t10 | t11 |
| t12 | t13 | t14 | t15 |

| <b>Fine-grained</b> | paralle | lism i | s good |
|---------------------|---------|--------|--------|
|---------------------|---------|--------|--------|

| t0 | t0 | t0 | t0 |
|----|----|----|----|
| t1 | t1 | t1 | t1 |
| t2 | t2 | t2 | t2 |
| t3 | t3 | t3 | t3 |

#### Coarse-grained parallelism is bad

### BEST PRACTICES Optimize data locality for CPU and GPU

Minimize data transfers between CPU and GPU, and between peer GPUs



# **BEST PRACTICES**

Optimize data locality for SM

Minimize redundant accesses to L2 and DRAM

Store intermediate results in registers instead of DRAM

Use shared memory for data frequently used within SM

Use constant and read-only caches on SM



### BEST PRACTICES Coalesce memory requests

If addresses from a *warp* lie within the same cache line, that line is fetched once

Best case: addresses lie in a single cache line (128B), 4x32B transactions



Worst case: fully scattered access, 32 allocated cache lines, **32**x32B transactions



# **BEST PRACTICES**

Avoid warp divergence

if( threadIdx.x < 12 ) {</pre>

}

}
else {

\*\*\*\*

\*\*\*\*

Instructions are issued per warp

Different execution paths within a warp are serialized

Different warps can execute different code with no impact on performance

Avoid branching on thread index

# **BEST PRACTICES**

Other common optimizations

Minimize thread block synchronization

Expose instruction-level parallelism

Use 64-bit and 128-bit vector loads

Control occupancy with compiler hints

Tile computation for better cache reuse

Use mixed or reduced precision

### **PROGRAMMING GPUS**

### **3 WAYS TO PROGRAM GPUS**



OpenACC hands-on session today 7:30pm - 9:30pm

### **NVIDIA DEVELOPER LIBRARIES**



https://developer.nvidia.com/gpu-accelerated-libraries

# NVGRAPH

#### **Accelerated Graph Analytics**

# Process graphs with up to 2.5 Billion edges on a single GPU (24GB M40)

#### Accelerate a wide range of applications:

| PageRank                  | Single Source<br>Shortest Path       | Single Source<br>Widest Path |
|---------------------------|--------------------------------------|------------------------------|
| Search                    | Robotic Path Planning                | IP Routing                   |
| Recommendation<br>Engines | Power Network<br>Planning            | Chip Design / EDA            |
| Social Ad Placement       | Logistics & Supply<br>Chain Planning | Traffic sensitive routing    |

#### nvGRAPH: 4x Speedup



PageRank on Wikipedia 84 M link dataset

https://developer.nvidia.com/nvgraph

### **UNIFIED MEMORY**

### UNIFIED MEMORY

#### Dramatically Lower Developer Effort



### SIMPLIFIED MEMORY MANAGEMENT CODE

#### Single pointer for CPU and GPU

#### CPU code

```
void sortfile(FILE *fp, int N) {
   char *data;
   data = (char *)malloc(N);
```

```
fread(data, 1, N, fp);
```

```
qsort(data, N, 1, compare);
```

```
use_data(data);
```

```
free(data);
```

#### GPU code with Unified Memory

```
void sortfile(FILE *fp, int N) {
   char *data;
   cudaMallocManaged(&data, N);
```

```
fread(data, 1, N, fp);
```

```
qsort<<<...>>>(data,N,1,compare);
cudaDeviceSynchronize();
```

```
use_data(data);
```

```
cudaFree(data);
```

### **UNIFIED MEMORY ON PRE-PASCAL**

#### Code example explained

GPU always has address translation during the kernel execution Pages allocated **before** they are used - **cannot oversubscribe GPU** Pages migrate to GPU only on kernel launch - **cannot migrate on-demand** 

### **UNIFIED MEMORY ON PRE-PASCAL**

Kernel launch triggers bulk page migrations



### **CUDA 8: UNIFIED MEMORY**

Large datasets, simple programming, high performance



### UNIFIED MEMORY ON PASCAL

#### Now supports GPU page faults

If GPU does not have a VA translation, it issues an interrupt to CPU Unified Memory driver could decide to map or migrate depending on heuristics Pages populated and data migrated **on first touch** 

### **UNIFIED MEMORY ON PASCAL**

#### True on-demand page migrations



### UNIFIED SYSTEM ALLOCATOR

#### Any memory will be available for GPU\*

#### CPU code

```
void sortfile(FILE *fp, int N) {
   char *data;
   data = (char *)malloc(N);
```

```
fread(data, 1, N, fp);
```

```
qsort(data, N, 1, compare);
```

```
use_data(data);
```

```
free(data);
```

#### GPU code with Unified Memory

```
void sortfile(FILE *fp, int N) {
   char *data;
   data = (char *)malloc(N);
```

```
fread(data, 1, N, fp);
```

```
qsort<<<...>>>(data,N,1,compare);
cudaDeviceSynchronize();
```

```
use_data(data);
```

```
free(data);
```

\*on supported operating systems





### **SUMMIT** 2017 OLCF Leadership System



Vendor: IBM (Prime) / NVIDIA<sup>™</sup> / Mellanox Technologies®

Approximately 3400 nodes, each with:

IBM POWER9 CPUs + NVIDIA Volta GPUs

CPUs and GPUs connected with high speed NVLink

Large coherent memory: over 512 GB (HBM + DDR4)

Over 40 TF peak performance

Dual-rail Mellanox® EDR-IB full, non-blocking fat-tree interconnect

### SUMMIT

### How does Summit compare to Titan

| Feature                                           | Summit                      | Titan                              |
|---------------------------------------------------|-----------------------------|------------------------------------|
| Application Performance                           | 5-10x Titan                 | Baseline                           |
| Number of Nodes                                   | ~3,400                      | 18,688                             |
| Node performance                                  | > 40 TF                     | 1.4 TF                             |
| Memory per Node                                   | >512 GB (HBM + DDR4)        | 38GB (GDDR5+DDR3)                  |
| NVRAM per Node                                    | 800 GB                      | 0                                  |
| Node Interconnect                                 | NVLink (5-12x PCIe 3)       | PCIe 2                             |
| System Interconnect<br>(node injection bandwidth) | Dual Rail EDR-IB (23 GB/s)  | Gemini (6.4 GB/s)                  |
| Interconnect Topology                             | Non-blocking Fat Tree       | 3D Torus                           |
| Processors                                        | IBM POWER9<br>NVIDIA Volta™ | AMD Opteron™<br>NVIDIA Kepler™     |
| File System                                       | 120 PB, 1 TB/s, GPFS™       | 32 PB, 1 TB/s, Lustre <sup>®</sup> |
| Peak power consumption                            | 10 MW                       | 9 MW                               |

# SUMMIT

#### Titan & Summit Application Differences

#### Fewer but much more powerful nodes

1/6<sup>th</sup> the number of nodes, but 25x more powerful

#### Must exploit more node-level parallelism

Multiple CPUs and GPU to keep busy

Likely requires OpenMP or OpenACC programming model

Very large memory

Summit has ~15x more memory per node than Titan

Interconnect is only ~3x the bandwidth of Titan

Need to exploit data locality within nodes to minimize message passing traffic

### **RESOURCES** Learn more about GPUs

CUDA resource center: <u>http://docs.nvidia.com/cuda</u>

GTC on-demand: <u>http://on-demand-gtc.gputechconf.com</u>

Parallel Forall blog: <a href="http://devblogs.nvidia.com/parallelforall">http://devblogs.nvidia.com/parallelforall</a>

Self-paced labs: <a href="http://nvidia.qwiklab.com">http://nvidia.qwiklab.com</a>



### **COOPERATIVE GROUPS**

### **COOPERATIVE GROUPS**

#### A Programming Model for Coordinating Groups of Threads

Support clean composition across software boundaries (e.g. Libraries)

Optimize for hardware fast-path using safe, flexible synchronization

A programming model that can scale from Kepler to future platforms



### **COOPERATIVE GROUPS SUMMARY**

#### Flexible, Explicit Synchronization

Thread groups are explicit objects in the program

thread\_group group = this\_thread\_block();

Collectives, such as barriers, operate on thread groups

sync(group);

New groups are constructed by partitioning existing groups

thread\_group tiled\_partition(thread\_group base, int size);



### **MOTIVATING EXAMPLE**

**Optimizing for Warp Size** 

```
device
int warp_reduce(int val) {
  extern _____shared____int smem[];
  const int tid = threadIdx.x;
 #pragma unroll
  for (int i = warpSize/2; i > 0; i /= 2) {
    smem[tid] = val; _____syncthreads(); ______syncthreads() is too expensive
   val += smem[tid ^ i]; syncthreads();
                                                   when sharing is only within warps
  }
  return val;
```

### **MOTIVATING EXAMPLE**

Implicit Warp-Synchronous Programming is Tempting...

```
device
int warp_reduce(int val) {
 extern shared int smem[];
  const int tid = threadIdx.x;
 #pragma unroll
 for (int i = warpSize/2; i > 0; i /= 2) {
    smem[tid] = val;
                                                  Barriers separating steps removed.
   val += smem[tid ^ i];
                                                             UNSAFE!
  }
  return val;
```

### **MOTIVATING EXAMPLE**

#### Safe, Explicit Programming for Performance

Safe and Fast!

39

Approximately equal performance to unsafe warp programming

```
__device__
int warp_reduce(int val) {
  extern __shared__ int smem[];
  const int tid = threadIdx.x;
  #pragma unroll
  for (int i = warpSize/2; i > 0; i /= 2) {
    smem[tid] = val; sync(this_warp());
    val += smem[tid ^ i]; sync(this_warp());
    }
    return val;
}
```

# PASCAL: MULTI-BLOCK COOPERATIVE GROUPS

Provide a new launch mechanism for multi-block groups

Cooperative Groups collective operations like sync(group)work across all threads in the group

Save bandwidth and latency compared to multi-kernel approach required on Kepler GPUs



