



# Aurora Exascale Architecture



## Servesh Muralidharan

Computer Scientist, Performance Engineering Team Argonne Leadership Computing Facility



# PATH TO EXASCALE



# Elements of a supercomputer

## Processor

- architecturally optimized to balance complexity, cost, performance, and power
- Memory
  - —generally commodity DDR, amount limited by cost
- Node
  - may contain multiple processors, memory, and network interface
- Network
  - -optimized for latency, bandwidth, and cost
- IO System
  - -complex array of disks, servers, and network
- Software Stack
  - -compilers, libraries, tools, debuggers, ...
- Control System
  - —job launcher, system management



https://www.top500.org/statistics/treemaps/

# **Exascale Computing Project**



Department of Energy (DOE) Roadmap to Exascale Systems

https://www.top500.org/statistics/perfdevel/

4

https://science.osti.gov/-/media/bes/besac/pdf/201907/1330 Diachin ECP Overview BESAC 201907.pdf



# Path to Exascale Computing

- Era of data parallel computing
  - Dominated by GPUs
  - Exploit SIMT/SIMD Parallelism
- Architectural Challenges
  - -Multichip Packaging
  - -Next generation technologies



Share

Intel's HPC GM Trish Damkroger Keynote ISC 2021 <u>https://www.youtube.com/watch?v=PuEcCRJLrvs</u> https://download.intel.com/newsroom/2021/data-center/Intel-ISC2021-keynote-presentation.pdf Accelerator/Co-Processor - Performance Share



https://www.top500.org/statistics/overtime/







## Aurora High-level System Overview

AURORA SYSTEM DAG 166 Compute racks Gat 10,624 Nodes DA GPU: 8.16 PB HBM CPU: 1.36 PB HBM, 10.9 PB DDR5 DAOS: 64 racks, 1024 nodes 230 PB (usable), 31 TB/s

System Service Nodes (SSNs) User Access Nodes (UANs) DAOS Nodes (DNs) Gateway Nodes (GNs) IOF service, scalable library loading DAOS <-> Lustre data mover

### **COMPUTE BLADE**

2x Intel Xeon Max Series w HBM 6x Intel Data Center GPU Max Series GPU: 768 GB HBM CPU: 128 GB HBM, 1024 GB DDR5



**COMPUTE RACK** 64 Compute blades 32 Switch blades GPU: 49.1 TB HBM CPU: 8.2 TB HBM, 64 TB DDR5



**SWITCH BLADE** 1 Slingshot switch 64 ports Dragonfly topology



# Aurora Exascale Compute Blade

## **NODE CHARACTERISTICS**

- 6 GPU Intel Data Center GPU Max Series (#)
- 2 CPU Intel Xeon CPU Max Series (#)
- 768 GPU HBM Memory (GB)
- 19.66 Peak GPU HBM BW (TB/s)
- 128 CPU HBM Memory (GB)
- 2.87 Peak CPU HBM BW (TB/s)
- 1024 CPU DDR5 Memory (GB)
- 0.56 Peak CPU DDR5 BW (TB/s)
- ≥ 130 Peak Node DP FLOPS (TF)
- **200** Max Fabric Injection (GB/s)
- 8 NICs (#)







# Aurora Exascale Compute Blade - Components

fe-Link

Ye - Link

- Intel Xeon Max Series CPU w HBM
  - DDR5 and HBM
  - PCle Gen5
- Intel Data Center Max Series GPU
  - Multi Tile architecture
    - Compute Tile
      - Xe Cores
      - L1 Cache
    - Base Tile
      - PCle Gen5
      - HBM2e Main Memory
      - MDFI
      - EMIB
- GPU GPU Interconnect
  - Xe Link





# Intel Xeon Max Series CPU w HBM

- Dual socket
- 52 cores
- First Level Cache: 32 KB Instruction Cache 48 KB Data Cache
- Mid-Level Cache: 2 MB private per core
- Last Level Cache: 1.875 MB per core
- 8 channels DDR5 @ 4400MT/s
- ITB DDR5 Memory
- 64GB HBM2e per socket
- 80 PCIe lanes with PCIe Gen 5.0 support
  - PCIe bifurcation support: x16, x8, x4, x2(Gen4)

|        | I-TLB + I-Cache |            |            |            |              |               |            | Predict         |            |            |            |            |  |
|--------|-----------------|------------|------------|------------|--------------|---------------|------------|-----------------|------------|------------|------------|------------|--|
| MS     |                 |            |            | Decode     |              |               |            | µop Cache       |            |            |            |            |  |
|        |                 |            |            |            | μο           | p Queu        | e          |                 |            |            |            |            |  |
|        |                 | 1          | Allocate   | / Rena     |              | love Elli     |            | <br>n / Zer<br> | o Idiom    |            |            |            |  |
|        |                 |            |            |            |              |               |            |                 |            |            |            |            |  |
|        | Port<br>00      | Port<br>01 | Port<br>05 | Port<br>06 | Port<br>10   | Port<br>04    | Port<br>09 | Port<br>02      | Port<br>08 | Port<br>03 | Port<br>07 | Port<br>11 |  |
| HZ     | ALU             | ALU        | ALU        | ALU        | ALU          |               |            | AGU             | AGU        | AGU        | AGU        | AGU        |  |
| 4      |                 | LEA        | LEA        | LEA        | LEA          | Store<br>Data |            |                 |            |            |            |            |  |
|        | Shift           | Mul        | MulHi      | Shift      | -            | De            | na         |                 |            |            | STA        |            |  |
|        |                 | IDIV       |            | JMP        |              |               |            |                 |            |            |            |            |  |
|        | FMA             | FMA        | FMAst      |            |              |               | 48KB       | 8KB Data Cache  |            |            |            |            |  |
| VEC.   | ) ALU           | ALU        | ALU        |            |              |               |            |                 |            |            |            |            |  |
| L<br>L | Shift           |            |            |            | 2MB ML Cache |               |            |                 |            |            |            |            |  |

https://www.hc33.hotchips.org/assets/program/conference/day1/H C2021.C1.4%20Intel%20Arijit.pdf



https://www.intel.com/content/www/us/en/developer/articles/ technical/fourth-generation-xeon-scalable-family-overview.htm



# Intel Data Center GPU Max Series Architectural Components

- Xe Cores
  - Vector Engine
    - Traditional compute pipeline
  - Matrix Engine
    - Low precision systolic pipeline
  - L1 Data Cache
    - Shared Local Memory
  - Instruction Cache
- Xe Slice
  - Hardware Context
  - -Offload Units
- Xe Stack
  - —LLC
  - —HBM2e controllers

—Xe link

- —Cache Memory Fabric
- PCIe Endpoint
- Hardware specific engines
- -Stack to Stack Interconnect
- —Xe links
  - Multi GPU Interconnect



https://hc33.hotchips.org/assets/program/conference/day2/hc2021\_pvc\_final.pdf



# **GPU Compute Execution**

XVE – Xe Vector Engine GRF – General Register File SLM – Shared Local Memory RW L1 – Read/Write L1 HBM – High Bandwidth Memory

iCACHE – Instruction Cache CCS – Compute Command Streamer SIMD – Single Instruction Multiple Data



- Execution on the GPU starts with the allocation of memory and the compute kernel scheduled on the GPU
- The GPU threads are spawned and scheduled through the CCS
- Execution stops when the kernel hits the "end of thread" instruction
- Shared vs Device allocation implies different latencies for accessing the data
- GPU threads can switch when any of the stall condition occurs
  - However during execution threads cannot be interrupted



guide-gpu/2024-0/execution-model-overview.html







# Network Switch

## **Consistent, Repeatable Application Performance**

- Advanced congestion control
- Fine grained adaptive routing
- Very low average and tail latency

## **Extremely Scalable RDMA Performance**

- Connectionless protocol
- Fine grained flow control
- MPI HW tag matching & progress engine
- Dragonfly topology 3 switch hops (typical)

## **Native Ethernet**

- Native IP no encapsulation
- High-scale bandwidth integration to campus

## HPE Slingshot Switches - 64 ports @ 200 Gbps



## **HPE Slingshot NICs - 200 Gbps**



HPE NIC ASIC



**PCIe Adapters** 



## 100% DLC NIC Mezz







- 1-D Dragonfly Topology 175 total groups (166 compute + 8 IO + 1 Service),
- All the global links are optical, all the local links in compute groups are electrical
- 2 global links between any two compute groups
- 24 links between any two IO groups, 8 links between the Service group and each IO group
- Total injection bandwidth: 2.12PB/s
- Total bisection bandwidth: 0.69PB/s



# Aurora Storage Systems

- DAOS provides Aurora's main "platform" high performance storage system
- Aurora leverages existing Lustre storage systems, Grand and Eagle, for center-wide data access and data sharing

| System      | Capacity                                                                           | Performance             |
|-------------|------------------------------------------------------------------------------------|-------------------------|
| Aurora DAOS | <ul> <li>230 PB @ EC16+2</li> <li>250 PB NVMe</li> <li>8 PB Optane PMEM</li> </ul> | 31 TB/s Read & Write    |
| Eagle       | 100 PB @ RAID6                                                                     | > 650 GB/s Read & Write |
| Grand       | 100 PB @ RAID6                                                                     | > 650 GB/s Read & Write |



- Intel Coyote Pass System
  - —(2) Xeon 5320 CPU (Ice Lake)
  - —(16) 32GB DDR4 DIMMs

  - —(16) 15.3TB Samsung PM1733
  - —(2) HPE Slingshot NIC

- 1024 Total Servers
  - Each node will run 2 DAOS engines
  - -2048 DAOS engines







# Aurora Storage Overview







## Peak Performance ≧ 2 Exaflops DP

#### Intel GPU

Intel<sup>®</sup> Data Center GPU Max Series 1550

#### **Intel Xeon Processor**

Intel<sup>®</sup> Xeon Max Series 9470C CPU with High Bandwidth Memory

## Platform

HPE Cray-Ex

#### Compute Node

2x Intel<sup>®</sup> Xeon Max Series processors 6x Intel<sup>®</sup> Data Center GPU Max Series 8x Slingshot11 fabric endpoints

#### **GPU** Architecture

Intel XeHPC architecture High Bandwidth Memory

## Node Performance >130 TF

**System Size** 166 Cabinets 10,624 Nodes 21,248 CPUs 63.744 GPUs

#### System Memory

1.36PB HBM CPU Capacity10.9PB DDR5 Capacity8.16PB HBM GPU Capacity

#### System Memory Bandwidth

30.58PB/s Peak HBM BW CPU 5.95PB/s Peak DDR5 BW 208.9PB/s Peak HBM BW GPU

## **High-Performance Storage**

230PB 31TB/s DAOS bandwidth 1024 DAOS Nodes

## System Interconnect HPE Slingshot 11 Dragonfly topology with adaptive routing

**System Interconnect BW** Peak Injection BW 2.12PB/s Peak Bisection BW 0.69PB/s

#### **Network Switch**

25.6 Tb/s per switch (64x 200 Gb/s ports) Links with 25 GB/s per direction

#### **Programming Environment**

- C/C++, Fortran
- SYCL/DPC++
- OpenMP 5.0
- Kokkos, RAJA



# **AURORA: SOFTWARE**



# Three Pillars of Aurora

| Simulation                                 | Data                   | Learning                 |  |  |  |  |  |
|--------------------------------------------|------------------------|--------------------------|--|--|--|--|--|
| HPC Languages                              | Productivity Languages | Productivity Languages   |  |  |  |  |  |
| Directives                                 | Big Data Stack         | DL Frameworks            |  |  |  |  |  |
| Parallel Runtimes                          | Statistical Libraries  | Statistical Libraries    |  |  |  |  |  |
| Solver Libraries                           | Databases              | Linear Algebra Libraries |  |  |  |  |  |
| Compilers, Performance Tools, Debuggers    |                        |                          |  |  |  |  |  |
| Math Libraries, C++ Standard Library, libc |                        |                          |  |  |  |  |  |
| I/O, Messaging                             |                        |                          |  |  |  |  |  |
| Containers, Visualization                  |                        |                          |  |  |  |  |  |
| Scheduler                                  |                        |                          |  |  |  |  |  |
| Linux Kernel, POSIX                        |                        |                          |  |  |  |  |  |



# Introducing oneAPI Ecosystem

**"oneAPI** is a cross-industry, open, standards-based unified programming model that delivers a common developer experience across accelerator architectures—for faster application performance, more productivity, and greater innovation."

## **Three Components**

- Language
  - DPC++
- Libraries
  - oneMKL, oneDAL, ...
- Hardware Abstraction Layer
  - Level Zero (L0)

Set of specifications that any one can implement

Intel has their own implementations https://software.intel.com/ONEAPI https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-programming-guide.pdf



## Overcoming Separate CPU and GPU Software Stacks



https://www.intel.com/content/www/us/en/newsroom/resources/press-kit-architecture-day-2021.html



# Aurora Programming Models

- Aurora applications may use
  - DPC++/SYCL
  - OpenMP
  - Kokkos
  - Raja
  - OpenCL
- Experimental
  - HIP
- Not available on Aurora
  - CUDA
  - OpenACC



Early Science Application Programming Model Distribution



DPC++/SYCL

HIPLZ

Intel Python Framework

- Kokkos
- Kokkos/OpenMP
- Kokkos/SYCL
- LLVM-JIT
- MKL
- OCCA/SYCL
- OpenMP



# Based on open source MPICH with new features to support Aurora Uses OFI (Open Fabrics Interface) to communicate with the Slingshot Interconnect Redesigned to reduce instruction counts and remove non-scalable data structures Innovative collective algorithms optimized for Dragonfly network topology GPU aware for Intel GPUs It is built on top of oneAPI Level Zero It supports point to point, one-sided, and collectives Support for different data types through the Yaksa library

- Intel GPUs and all-to-all connectivity across the GPUs inside the node
- Multiple NICs on the same node
  - Distribution of processes to NICs
  - Striping (a single rank distributes a single message across multiple NICS)
  - Hashing (a single rank sends different messages through different NICs, e.g., depending on the communicator or the target rank)
  - Efficient multithreading support to use multiple NICs



•

•

•







# Launching jobs on Aurora

- Workload manager (WLM) ٠
  - Handles allocations of nodes to Jobs
  - -PBS Pro

- Application Launcher
  - Provides a service to launch applications on the allocated nodes
  - -HPE PALS
- Process Management
  - Process Management Interface -Exascale (PMIx)
    - Scalable workflow orchestration by defining an abstract set of interfaces



Launch Service (PALS)







# Communication complexity of HPC Applications

- HPC Applications exhibit a variety of communication patterns
- Problem decomposition across the system is critical to avoid load imbalance
- Two forms of data flow design in a typical workload
  - Point to Point
  - Collective
- Significant portion of application runtime spent on communication calls
- Special patterns show up while executing workloads
- Understanding data flow is critical for performance efficacy



D.G. Chester et al. / Electronic Notes in Theoretical Computer Science 340 (2018) 55-65

Understanding Communication Patterns in HPCG https://www.sciencedirect.com/science/article/pii/S1571066118300598





(a) Butterfly

https://hpc-tutorials.llnl.gov/mpi/collective communication routines/

# An alternate view: Data Flow Design

- Optimize data flow for scalable computations
- Borrow concepts from DataFlow Computer Architecture to understand logical limitations
  - Computation driven by latency of memory operation
- Design/implement algorithms that minimize data movement
- GPUs are optimized data parallel operations
- CPUs are optimized for control flow
- Identify communication patterns
- Apply data flow centric optimizations







Conceptual Data Flow Design

Matrix-Based Algorithms for DataFlow Computer Architecture: An Overview and Comparison https://link.springer.com/chapter/10.1007/978-3-030-13803-5\_4



# Aurora Exascale Compute Blade – Data Flow







https://www.intel.com/content/www/us/en/developer/articles/technical/fourthgeneration-xeon-scalable-family-overview.html





# Intel Data Center GPU

https://www.intel.com/content/www/us/en/docs/oneapi/optimizationguide-gpu/2024-0/intel-xe-gpu-architecture.html

- Each GPU is actually composed of dual stacks
- The PCIe endpoint is present in only one of the stack ٠
- Data movement between stacks happens through ٠ stack to stack interconnect



Copy Engin



# CPU – NIC PCIe Switch Interface











# CPU to GPU Data Flow









# GPU to GPU Connectivity

Ę









# Dragon fly topology

- Hierarchical design
- Several groups connected with a mesh
- Intra group topology provides different Dragonfly "flavors"
- Reduces number of long links
- Minimizes no of hops
- Adaptive routing



https://commons.wikimedia.org/wiki/File:Dragonfly-topology.svg





# Aurora Dragonfly Interconnect





64p Sw

16x

NIC



# Conclusions

- Challenging design of a Exascale supercomputer
- Intricate system design
- Compute Performance Vs Communication Complexity
- Application scalability balanced by dense compute and hierarchical interconnect



www.anl.gov



# QUESTIONS? SERVESH@ANL.GOV