Exceptional service in the national interest











# Hardware/Software Co-Design for High Performance Interconnects for Extreme-Scale Systems

Ron Brightwell, R&D Manager Center for Computing Research

Center for Computing Research



Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

# **Portals Interconnect Programming Interface**

- Developed primarily by Sandia, U. New Mexico, Intel
- Deployed on several production massively parallel processing (MPP) and cluster systems
  - 1993: 1800-node Intel Paragon (SUNMOS)
  - 1997: 10,000-node Intel ASCI Red (Puma/Cougar)
  - 1999: 1800-node Cplant cluster (Linux)
  - 2005: 10,000-node Cray Sandia Red Storm (Catamount)
  - 2009: 18,688-node Cray XT5 ORNL Jaguar (Linux)
  - 2017: Bull BXI interconnect
- Focused on providing
  - Lightweight "connectionless" model for massively parallel systems
  - Low latency, high bandwidth
  - Independent progress
  - Overlap of computation and communication
  - Scalable buffering semantics
  - Protocol building blocks to support higher-level application protocols and libraries and system services
- At least three hardware implementations currently under development
- Portals influence can be seen in InfiniBand APIs and libfabric (Intel & others)



http://www.cs.sandia.gov/Portals/

### Intel Paragon Node (Circa 1993)





Message Processor. When an application decides to send a mesage, the node's i860 XP message processor handles message-protocol processing and frees the application processor to continue with numeric computing. Messaging software is executed from the message processor's internal cache, enabling overlapped communication and application processing to occur without incurring expensive contextswitching delays. The message processor is also used to implement efficient global operations such as synchronization, broadcasting and global reduction calculations (e.g., global sum).

Message Routing. The actual transmission of messages is carried out by an independent routing system of custom-designed Mesh Router Controllers (MRCs), one for each node, arranged in a two-dimensional mesh. These fixed-function



General-Purpose Node. Each GP node dedicates one i860 XP processor to user applications and one to message processing. The GP node's expansion port allows the addition of an I/O or networking interface.

## Intel ASCI Red (TFLOPS) Compute Node



# Paragon/TFLOPS Network Interface Controller (NIC)

- Attached to the memory bus
- Cache coherent with the processor(s)
- Programmed by the operating system (OS)
  - Device driver was embedded in the OS
  - Driver consisted of programming DMA engines and memory-mapped registers
- Interrupt-driven
  - An interrupt would be generated for:
    - Arrival of an incoming message
    - Completion of an incoming message
    - Completion of an outgoing message
- Messages initiation via system call trap
- Source-routed, circuit-switched, wormhole routed network
  - Message header contained route to destination
  - Message body was one contiguous block

# Basic Assumptions About Networking for MPPs

- A single low-level network API is needed
  - Compute node may not have a TCP/IP stack
  - System is space-shared
    - Compute node application should own all network resources
- Applications will use multiple protocols simultaneously
  - Can't focus on just MPI
  - Runtime system, system call forwarding, I/O protocols too
- Need to support communication between unrelated processes
  - Client/server communication between application processes and system services
- Need to support general-purpose interconnect capabilities
  - Can't assume special collective network hardware
- Interconnect hardware limitations can't be fixed in software

# **Key Network Capabilities**

- Independent progress
  - Data should move without requiring polling from user-level library
  - Adhere to the strong progress rule interpretation of MPI
- Overlap
  - Decouple the host processors from the network as much as possible
  - Enable overlap of computation and communication as well as communication and communication
- Scalable use of memory resources
  - Buffer space for MPI unexpected messages
  - Memory use should be independent of the number of peers
- High performance
  - Maximize bandwidth by avoiding memory-to-memory copies
  - Minimize latency by avoiding OS interaction

# Design Philosophy: Don't Arbitrarily Constrain

- Connectionless
  - Easy to do connections over connectionless
  - Impossible to do vice-versa
- One-sided
  - Easy to do two-sided over one-sided
  - Hard to do vice-versa
- Matching
  - Needed to enable flexible independent progress
  - Otherwise matching and progress must be done by upper layers
- Offload
  - Straightforward to onload API designed for offload
  - Hard to do vice-versa (see TCP Offload Engines [TOE])
- Progress
  - Must be implicit

# **Kernel-Level Networking**

- UNIX IP sockets (UDP/IP, TCP/IP)
- Kernel contains a ring of send and receive buffers
- Send
  - Application calls *write()* system call
  - Kernel copies data from user-space into kernel buffer or network device memory
  - System call returns number of bytes sent
- Receive
  - Network device interrupts OS when data arrives
  - Kernel copies data from network device into kernel buffers
  - Kernel copies data from kernel buffers to user memory during read()
  - System call returns number of bytes received
- Checking for incoming data
  - Use *select()* or *poll()* system calls to see if data can be read (or written)
  - Use sigaction()/fcntl() to receive a SIGIO signal when data can be read

# **User-Level Networking**

- Network device is directly controlled by application process after initial setup
- Send
  - Application process writes a command to the network device
  - Network device copies data directly from user-space onto the network
- Receive
  - Application process provides buffer(s) to the network device before data arrives
  - Network device copies data directly from network into user-space
- Checking for incoming data
  - Poll memory location (application memory or network device memory)
  - No mechanism for OS-generated signals
- Significant performance advantage over kernel-level approach
  - Increases bandwidth by eliminating memory copies
  - Decreases latency by avoiding system calls
  - Provides the opportunity to overlap data movement with computation
- Must coordinate with virtual memory system (page pinning)

#### Programmable User-Level Networks Enabled API Exploration

- Myrinet (~1994)
  - First commercially available Gb/s standalone network
  - Based on technology developed for Intel MPP networks
  - Initially available for Sun SPARC SBus, later for PCI-based PCs
  - Custom embedded MIPS-based programmable processor (LANai)
  - Myrinet Control Program (MCP) software development environment
  - Destination routed, maximum message size (packets)
  - Numerous APIs and MCPs: AM, FM, GM, PM, MX
- Quadrics QSNet (~2001)
  - Outgrowth of technology developed for Meiko MPP networks
  - Offered several different APIs for user-level networking
  - Provided a development environment for running user-level functions on NIC



# Fixing Semantic Mismatch Between Layers

Majority of interconnect software R&D is spent on dealing with the semantic mismatch between what the upper-layer protocols need and what the low-level network software and the underlying hardware provide

#### RDMA (e.g. InfiniBand Verbs)

- RDMA (e.g. InfiniBand Verbs)
- One-sided
  - Allows process to read/write remote memory implicitly
- Zero-copy data transfer
  - No need for intermediate buffering in host memory
- Low CPU overhead
  - Decouples host processor from network
- Fixed memory resources
  - No unexpected Messages
- Supports unstructured, non-blocking data transfer
  - Completion is a local event

#### **MPI Point-to-Point**

- Two-sided
  - Short messages are copied
  - Long messages need rendezvous
- CPU involved in every message
  - Message matching
- Unexpected messages
  - Need flow control
- Completion may be non-local
  - Need control messages

#### How to Implement MPI over RDMA (2002-2008)

- Mvapich-Aptus: Scalable High-Performance Multi-Transport MPI over InfiniBand, Int'l Conference on Parallel and Distributed Computing, Miami, FL, Apr. 2008
- Designing Passive Synchronization for MPI-2 One-Sided Communication to Maximize Overlap, Int'l Conference on Parallel and Distributed Computing, Miami, FL, Apr. 2008
- MPI-2 One Sided Usage and Implementation for Read Modify Write operations: A case study with HPCC, EuroPVM/MPI 2007, Sept. 2007.
- Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram, IEEE International Conference on Cluster Computing (Cluster'07), Austin, TX, September 2007.
- High Performance MPI over iWARP: Early Experiences, Int'l Conference on Parallel Processing, XiAn, China, September 2007.
- High Performance MPI Design using Unreliable Datagram for Ultra-Scale InfiniBand Clusters, 21st Int'l ACM Conference on Supercomputing, June 2007.
- Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach, Int'l Symposium on Cluster Computing and the Grid (CCGrid), Rio de Janeiro - Brazil, May 2007
- Hot-Spot Avoidance With Multi-Pathing Over InfiniBand: An MPI Perspective, Int'l Symposium on Cluster Computing and the Grid (CCGrid), Rio de Janeiro -Brazil, May 2007
- High Performance MPI on IBM 12x InfiniBand Architecture, International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS), held in conjunction with IPDPS '07, March 2007.
- High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth Performance Analysis, SuperComputing (SC 06), November, 2006.
- Efficient Shared Memory and RDMA based design for MPI\_Allgather over InfiniBand, EuroPVM/MPI, September 2006.
- Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with InfiniBand, Hot Interconnect (HOTI 06), August, 2006.
- MPI over uDAPL: Can High Performance andPortability Exist Across Architectures?, Int'l Sympsoium on Cluster Computing and the Grid (CCGrid), Singapore, May 2006.
- Shared Receive Queue based Scalable MPI Design for InfiniBand Clusters, Int'l Parallel and Distributed Processing Symposium (IPDPS '06), April 2006, Rhode Island, Greece.
- Adaptive Connection Management for Scalable MPI over InfiniBand , International Parallel and Distributed Processing Symposium
- Efficient SMP-Aware MPI-Level Broadcast over InfiniBand's Hardware Multicast, Communication Architecture for Clusters (CAC) Workshop, to be held in
  conjunction with Int'l Parallel and Distributed Processing Symposium (IPDPS '06), April 2006, Rhode Island, Greece.
- RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits, Symposium on Principles and Practice of Parallel Programming,
- High Performance RDMA Based All-to-all Broadcast for InfiniBand Clusters, International Conference on High Performance Computing (HiPC 2005)
- Supporting MPI-2 One Sided Communication on Multi-Rail InfiniBand Clusters: Design Challenges and Performance Benefits, International Conference on High Performance Computing (HiPC 2005), December 18-21, 2005, Goa, India.

- Designing a Portable MPI-2 over Modern Interconnects Using uDAPL Interface, EuroPVM/MPI 2005, Sept. 2005.
- Efficient Hardware Multicast Group Management for Multiple MPI Communicators over InfiniBand, EuroPVM/MPI 2005, Sept. 2005.
- Design Alternatives and Performance Trade-offs for Implementing MPI-2 over InfiniBand, EuroPVM/MPI 2005, Sept. 2005.
- Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?, Hot Interconnect (HOTI 05), August, 2005.
- Analysis of Design Considerations for Optimizing Multi-Channel MPI over InfiniBand, Workshop on Communication Architecture on Clusters
- Scheduling of MPI-2 One Sided Operations over InfiniBand, Workshop on Communication Architecture on Clusters (CAC 05) in conjunction with International Parallel and Distributed Processing Symposium
- Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation. SuperComputing Conference, Nov 6-12, 2004, Pittsburgh, Pennsylvania.
- Efficient Barrier and Allreduce on IBA clusters using hardware multicast and adaptive algorithms, IEEE Cluster Computing 2004, Sept. 20-23 2004, San Diego, California.
- Zero-Copy MPI Derived Datatype Communication over InfiniBand, EuroPVM/MPI 2004, Sept. 19-22 2004, Budapest, Hungary.
- Efficient Implementation of MPI-2 Passive One-Sided Communication on InfiniBand Clusters , EuroPVM/MPI 2004, Sept. 19-22 2004, Budapest, Hungary.
- Efficient and Scalable All-to-All Exchange for InfiniBand-based Clusters. International Conference on Parallel Processing (ICPP-04), Aug. 15-18, 2004, Montreal, Quebec, Canada.
- Design and Implementation of MPICH2 over InfiniBand with RDMA Support. Int'l Parallel and Distributed Processing Symposium (IPDPS 04), April, 2004.
- Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support. Int'l Parallel and Distributed Processing Symposium (IPDPS 04), April, 2004.
- High Performance Implementation of MPI Datatype Communication over InfiniBand. Int'l Parallel and Distributed Processing Symposium (IPDPS 04), April, 2004.
- Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand. Workshop on Communication Architecture for Clusters (CAC 04)
- High Performance MPI-2 One-Sided Communication over InfiniBand. IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 04), April, 2004.
- Fast and Scalable Barrier using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters. Euro PVM/MPI Conference, September 29-Oct 2, 2003, Venice, Italy.
- High Performance RDMA-Based MPI Implementation over InfiniBand. 17th Annual ACM International Conference on Supercomputing. San Francisco Bay Area. June, 2003.
- Impact of On-Demand Connection Management in MPI over VIA , Cluster '02, Sept. 2002.

#### **Network Portability Abstraction Layers Abound**



#### Recent Efforts to Develop Lower-Level Transport APIs

| OPEN MPI STRUCTURE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                                  | A High-level Overview                                                                                                                                                                      |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MPI/SHMEM Application       ORTE component is thown         MPI/SHMEM (pp messaging layor) bas       ORTE (pm ponent is thown)         Group of the sage matching       Orter (prime is the same is | <section-header><section-header><section-header><text></text></section-header></section-header></section-header> | <complex-block><complex-block><complex-block><complex-block><complex-block><complex-block></complex-block></complex-block></complex-block></complex-block></complex-block></complex-block> |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                  | 15                                                                                                                                                                                         |

# Motivation for Low-Level Transport APIs

- Targeting a single programming model target is too limiting (top down approach)
  - MPI MPICH, OpenMPI
  - PGAS GasNet, OpenSHMEM
  - I/O
  - Big Data
- Desire to reduce development costs
  - Provide one network abstraction for all ULPs
  - Large porting effort is a strong indication of the semantic mismatch
- "Thin is in"
  - Optimize to a semantic mismatch
  - Get as close to the functionality you don't really want as possible
    - Communication as well as memory management (page pinning)
- Vendor differentiation
  - Which really defeats the portability goal

## **Some Fundamental Principles**

- More layers of software degrades performance (and scalability)
- Hardware almost always outperforms software
- Software fixes to hardware are usually really slow

# Red Storm – Prototype for Cray XT Series

- Architected by Sandia, engineered jointly with Cray
  - Sandia contributed to the design of the SeaStar network interface and router
- Sandia also developed
  - Lightweight kernel compute node OS
  - Scalable parallel job launching system
  - Portals high-performance interconnect programming interface
  - SeaStar firmware
- 140+ systems to 80 different customers worldwide
  - Including ORNL, NERSC, and LANL
- Following Red Storm, Cray's market share rose from 6% in 2002 to 21% in 2007\*
- Revenue of \$1B +
- Basis of Cray's business today

\*Source: IDC #209251 Technical Computing Systems: Competitive Analysis, November 2007



SeaStar was a PowerPC-Based System-on-a-Chip (SOC)



# Onload Versus Offload Argument (~2005)

- Why design a custom NIC for offload?
- Just dedicate a core
  - A 3 GHz Xeon will outperform a 500 MHz embedded processor on network protocol processing
  - A custom ASIC is way too expensive (especially for the small HPC market)
- Cost will go down as core count increases
- Cores won't be getting slower, right?

## Core Clock Frequency Stalled in ~2007



20

### **Cray Core Specialization**

- Dedicate "OS" cores to handle MPI progress
  - MPI progress threads run on a dedicated set of cores

| Leveraging the Cray Li<br>Specialization Feat<br>Asynchronous Progres                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | ure to Realize MPI                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Howard Pritchard, Duncan Roweth,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | David Henseler, and Paul Cassella                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Abstract—Cray has enhanced the Linux operating system in<br>differentiated use of the compute cores available on Cray 3<br>are declared to running the penalit application while one<br>The MPICH2 MPI implementation has been enhanced to<br>independent program. In this paper, we discolible how the<br>features of the XE Germin Network Interface to obtain or<br>benchmarks and applications.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | IE compute nodes. With CoreSpec, most cores on a node<br>or more cores are reserved for OS and service threads,<br>make use of this CoreSpec feature to better support MPI<br>MPI implementation uses CoreSpec along with hardware                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| Index Terms—MPI, CLE, core specialization, asynchronous                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | s progress                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| 1 INTRODUCTION<br>The importance of overlapping computation<br>with communication and hodpenderate programs<br>with communication and hodpenderate programs<br>between the second second second second second<br>transfer and the second second second second second<br>protection, many MPI applications are not struc-<br>tured to take advantage of auto-capabilities,<br>and the second second second second second second<br>for this capability, including hardware-based<br>second second second second second second second<br>protection and second second second second second<br>for this capability, including hardware-based<br>protection is which the network adapter itself<br>offsate of the AMP protocol form the applica-<br>dition of the AMP protocol form the applica-<br>tion and the AMP protocol form the applica-<br>dition of the AMP protocol form the applica-<br>dition of the AMP protocol form the applica-<br>tion of the AMP protocol form the application of the AMP<br>protocol form of the AMP protocol form the AMP protocol<br>form the AMP protocol form the AMP protocol form the AMP<br>protocol form of the AMP protocol form the AMP protocol<br>form of the AMP protocol form of the AMP protocol form of the AMP protocol<br>form of the AMP protocol form of the AMP protocol form of the AMP protocol<br>form of the AMP protocol form of the AMP protocol form of the AMP protocol<br>form of the AMP protocol form of the AMP protocol form of the AMP protocol<br>form of the AMP protocol form of the AMP protocol form of the AMP protocol<br>form of the AMP protocol form | alized host software-based approaches whit<br>take initiation of modern multi-core proce-<br>tions of the software of the software of the<br>work adapter has features intended to assit-<br>per the software of the software of the<br>progression of MH and for allowing for over<br>log of communication with computation. It<br>applies that and the software of the software<br>progression of MH and for allowing for over<br>log of communication with computation. It<br>applies that software have a paryoches. The<br>software of the software of the software of the<br>composition of the software of the software description<br>result in the software of the software description<br>of the software of the Core Specialization for the<br>composition, mode to MHCICL to realize the<br>composition of the software description of the<br>software of the Core Specialization for the<br>work are described in Section 3. Section<br>better support for independent progress as then with<br>better support for independent progress as<br>the software. The software of the core of the software of the<br>software of the software of the software as follows<br>first, an overview of the Core Specialization<br>docubes the approach Core has then with<br>better support for independent progress as<br>communication computation overlap. In section 1, software of the core of the software of the core specialization of the<br>work are described in Section 3. Section 3.<br>Section 3. |

#### S3D Time Step Summary

| # Application | Progression | Progression |  |
|---------------|-------------|-------------|--|
| Threads       | disabled    | enabled     |  |
| 14            | 4.77        | 3.93        |  |
| 15            | 4.68        | 4.05        |  |
| 16            | 4.59        | 4.06        |  |

#### MILC Run Time Summary(secs)

| # Run Type            | 4096  | 8192  |
|-----------------------|-------|-------|
|                       | ranks | ranks |
| No progression        | 2165  | 1168  |
| Progression (phase 1) | 2121  | 1072  |
| Progression (phase 2) | 3782  | 2138  |
| Progression (phase 1) |       |       |
| no reserved cores     | 3560  | 2210  |
| Progression (phase 1) |       |       |
| reserve core but no   | 2930  | 2070  |
| corespec              |       |       |

# Portals 4 Reference Implementation

- OpenFabrics Verbs, UDP, shared memory transports
- Initial implementation by System Fabric Works
- Provides a high-performance reference implementation for experimentation
- Help identify issues with API, semantics, performance, etc.
- Independent analysis of the specification
- Enables development of ULPs

# Reference implementation issues

- Extra layer of software between ULP and hardware
  - Impacts latency performance
  - We are violating one of our fundamental principles Image:
- Needs a progress thread
  - Impacts latency performance
  - Issues when ULP wants a progress thread too
- Do we modify API for hardware we have or continue to design for hardware we need?
- Portals should be slow or we're not doing our job I

#### **Portals Hardware and Co-Design Activities**



# Active Messages (AM)

- T. von Eicken, et al.: "Active Messages: A Mechanism for Integrated Communication and Computation" (1992)
- Lots of different flavors
  - Pure active messages
    - Origin sends message to target containing code and data
    - Target invokes code on that data
  - Generalized active messages
    - Origin sends message to target containing function id and data
    - Function id maps to existing code in target's address space
- Similar to remote procedure invocation without returning a result to origin
- Semantically equivalent to blocking a thread on an incoming message and invoking a handler when the message arrives

## **Issues with Active Messages**

- Data delivery
  - Who determines where the data goes origin or target?
  - How much data can be delivered?
- Handlers
  - When are resources allocated a priori or on arrival?
  - What can be called?
  - Where do they run (context)?
  - When do they run relative to message delivery?
  - How long do they run?
  - Why?
    - One-sided messages decouple processor from network
    - Active messages tightly couple processor and network
      - Active messages aren't one-sided
      - Memory is the endpoint, not the cores
    - Lightweight mechanism for sleeping/waking thread on memory update
      - Why go through the network API for this?
- Scheduling lots of unexpected thread invocations leads to flow control issues

# Is There a Better Way to Get AM Semantics?

- Cores are slower, more energy-efficient
  - Modern cores require 15-20 ns to access L3 cache
    - Haswell 34 cycles
    - Skylake 44 cycles
- Terabit per second networks are coming
  - 400 Gib/s can deliver a 64-byte message every 1.2 ns
- Need to remove processor from network processing path (offload)
- RDMA only supports data transfer between virtual memory spaces
  - Data is placed blindly into memory
  - Need varying levels of steering the data at the target

## streaming Processing In the Network (sPIN)



Hoefler, Di Girolamo, Taranov, Grant, Brightwell. "sPIN: High-Performance Streaming Processing in the Network," in Proceedings of SC"17, November 2017.

## sPIN is not Active Messages

- Tightly integrated NIC packet processing
- AMs are invoked on full messages
  - sPIN works on packets
  - Allows for pipelining packet processing
- AM uses host memory for buffering messages
  - sPIN stores packets in fast buffer memory on the NIC
  - Accesses to host memory are allowed but should be minimized
- AM messages are atomic
  - sPIN packets can be processed atomically

# sPIN Approach

- Handlers are executed on NIC Handler Processing Units (HPUs)
- Simple runtime manages HPUs
- Each handler owns shared memory that is persistent for the lifetime of a message
  - Handlers can use this memory to keep state and communicate
- NIC identifies all packets belonging to the same message
- Three handler types
  - Header handler first packet in a message
  - Payload handler all subsequent packets
  - Completion handler after all payload handlers complete
- HPU memory is managed by the host OS
- Host compiles and offloads handler code to the HPU
- Handler code is only a few hundred instructions

# sPIN Approach (cont'd)

- Handlers are written in standard C/C++ code
- No system calls or complex libraries
- Handlers are compiled to the specific Network ISA
- Handler resources are accounted for on a per-application basis
  - Handlers that run too long may stall NIC or drop packets
- Programmers need to ensure handlers run at line rate
- Handlers can start executing within a cycle after packet arrival
  - Assuming an HPU is available
- Handlers execute in a sandbox relative to host memory
  - They can only access application's virtual address space
  - Access to host memory is via DMA

## **Expect More Hardware Specialization in HPC**

| System    | n Control       | CPU Platform                 |                                                               | Connectivity                    |                               |
|-----------|-----------------|------------------------------|---------------------------------------------------------------|---------------------------------|-------------------------------|
| Secu      | re JTAG         |                              |                                                               |                                 | USB2 HSIC<br>Host x2          |
| PLL       | , Osc.          |                              | Quad ARM <sup>®</sup> Cortex™-A9 Core                         |                                 |                               |
| Clock a   | and Reset       | 32 KB I-Cache<br>per Core    | 32 KB D-Cache<br>per Core                                     | MMC 4.4/                        | MIPI HSI                      |
| Dene      | rt DMA          | NEON per Co                  | re PTM per Core                                               | SDXC                            | S/PDIF                        |
|           | MUX             | 객<br>1 MB L2-                | 1 MB L2-Cache + VFPv3                                         |                                 | Tx/Rx                         |
| 10        | MUX             | Mul                          | timedia                                                       | 5 Mbps                          | PCIe 2.0                      |
| Tim       | ner x3          |                              | Hardware Graphics Accelerators                                |                                 | (1-Lane)                      |
| PW        | /M x4           | 3D                           | Vector Graphics                                               | SPI x5                          | FlexCAN x2                    |
| Watch     | n Dog x2        | 2D                           |                                                               | ESAI, I2S/SSI                   | MLB150 +                      |
| Power M   | anagement       | Video Codecs                 | Audio                                                         | x3                              | DICP                          |
| Power     | Temperature     | 1080p30 Enc/Dec              | ASRC                                                          | 3.3V GPIO                       | 1 Gb Ethernet<br>+ IEEE® 1588 |
| Supplies  | Monitor         |                              | 1                                                             | Keypad                          |                               |
| Interna   | I Memory        | #10000000 (ULLSCHOOL         | Processing Unit                                               |                                 | NAND Cntrl.                   |
| ROM       | RAM             |                              | Resizing and Blending Image Enhancement<br>Inversion/Rotation |                                 | (BCH40)                       |
| Sec       | curity          |                              |                                                               | PHY 3 Gbps                      |                               |
| RNG       | Security Cntrl. | Display and Camera Interface |                                                               | USB2 OTG                        | LP-DDR2,<br>DDR3/             |
| TrustZone | Secure RTC      | HDMI and PHY                 | 24-bit RGB, LVDS (x2)                                         | and PHY<br>USB2 Host<br>and PHY | LV-DDR3<br>x32/64,<br>533 MHz |
| TrustZone |                 | MIPI DSI                     | 20-bit CSI                                                    |                                 |                               |
| Ciphers   | eFuses          | MIPI CSI2                    |                                                               |                                 |                               |

SmartPhone SoC circa 2016 Dozens of kinds of integrated HW acceleration



[www.anandtech.com/show/8562/chipworks-a8]

Apple A8 SoC

#### **Open Source Hardware**



## DARPA Effort to Enable Open Source Hardware



# Acknowledgments

- Sandia
  - Ryan Grant
  - Scott Hemmert
  - Kevin Pedretti
- ETH Zurich
  - Torsten Hoefler
  - Salvatore Di Girolamo
  - Konstantin Taranov