## 2022 Argonne Training Program on Extreme-Scale Computing (ATPESC)



## Introduction of AI-testbed and hands-on

Zhen Xie, Murali Emani, Siddhisanket Raskar, Varuni Sastry, William Arnold, Bruce Wilson, Rajeev Thakur, Venkatram Vishwanath

Argonne Leadership Computing Facility (ALCF)

Argonne National Laboratory, Lemont, IL 60439

www.anl.gov

## **ALCF AI Testbed**

- The ALCF AI Testbed provides an infrastructure for the next-generation of Alaccelerator machines.
  - The AI Testbed aims to help evaluate the usability and performance of machine learningbased high-performance computing applications running on these accelerators. The goal is to better understand how to integrate with existing and upcoming supercomputers at the facility to accelerate science insights.



Cerebras CS-2 Wafer-Scale Deep Learning Accelerator



SambaNova Dataflow Accelerator



Graphcore MK1 Graphcore Intelligent Processing Unit (IPU)



Groq Tensor Streaming Processor



Habana Gaudi Tensor Processing Cores

https://www.alcf.anl.gov/support/ai-testbed-userdocs/index.html For CS: https://www.alcf.anl.gov/support/ai-testbed-userdocs/cerebras/System-Overview/index.html For SN: https://www.alcf.anl.gov/support/ai-testbed-userdocs/sambanova/System-Overview/index.html





**Hardware** 

Source: White Paper: AI Changes Everything - SambaNova Systems (https://sambanova.ai/wp-content/uploads/2021/06/SambaNova\_AlisChangingEverything2021\_Whitepaper\_English.pdf)

3 Argonne Leadership Computing Facility



## Motivation of hardware design

- A Flexible Dataflow Substrate: Parallel Patterns
  - Looping abstractions with extra information on parallelism & access patterns



## Motivation of hardware design

• Reconfigurable Dataflow Architecture (RDA)



5



• Reconfigurable Dataflow Architecture (RDU)



#### SambaNova Systems Cardinal SN10 RDU

World's First Reconfigurable Dataflow Unit



• Rapid Dataflow Compilation to RDU





• DataScale SN10-8R: Scalable performance for training and inference



8 Argonne Leadership Computing Facility



• Excelling at Model and Data Parallel Execution Models







**Software** 

10 Argonne Leadership Computing Facility



 Full stack co-engineering yields optimizations where best delivered with the highest impact





SambaFlow Open Software for DataScale Systems





SambaFlow Open Software for DataScale Systems



- t1 = conv(in) t2 = pool(t1) t3 = conv(t2) t4 = norm(t3) t5 = sum(t4)
- Time: Kernel by Kernel Execution



Traditional compilers map operations to processor instructions in time

Communication through the memory hierarchy is implicit and handled by hardware



Dataflow compilers map operations to instructions in time and in space and program the communication between them SambaFlow eliminates overhead and maximizes utilization



#### • A bit about precision

 The main idea of bfloat16 to provide a 16-bit floating point format that has the same dynamic range as a standard IEEE-FP32, but with less accuracy. That amounted to matching the size of the FP32 exponent field at 8 bits and shrinking the size of the FP32 fraction field down to 7 bits. With bfloat16, SambaNova can provide better training throughput.







#### **Hardware**

Source: White Paper for Cerebras System (https://www.cerebras.net/category/whitepaper/)

15 Argonne Leadership Computing Facility



## Motivation of hardware design

#### GPU approach

- < 10% silicon area used for Deep Learning</li>
  - > 90% used for graphics: Raster Engines, Shaders, Texture Maps, Thread and Instruction Control
- Memory is far from graphics core
  - Little on-chip memory
  - Cache memory hierarchy
- Graphics cores not built for communication
  - On chip: low bandwidth, through memory
  - Off chip: even lower bandwidth, PCIe/NVLink
- Designed for dense-matrix operations
  - Sparsity devastates performance
  - Implemented as CPU co-processor





• Cerebras CS-2: The world's only purpose-built Deep Learning solution



**Cerebras WSE-2** 2.6 Trillion Transistors 46,225 mm<sup>2</sup> Silicon



Largest GPU 54.2 Billion Transistors 826 mm<sup>2</sup> Silicon

#### Cerebras Wafer Scale Engine (WSE)

#### The Most Powerful Processor for AI

400,000 Al-optimized cores
46,225 mm<sup>2</sup> silicon
1.2 trillion transistors
18 Gigabytes of On-chip Memory
9 PByte/s memory bandwidth
100 Pbit/s fabric bandwidth
TSMC 16nm process





- The CS WSE architecture is built for deep learning
- Al-optimized **compute** 
  - Fully-programmable core, ML-optimized extensions
    - e.g. arithmetic, logical, load/store, branch
  - Dataflow architecture optimized for sparse, dynamic workloads
    - Higher performance and efficiency for sparse NN





- The CS WSE architecture is built for deep learning
- Al-optimized **memory** 
  - Traditional memory architectures shared memory far from compute
  - The right answer is distributed, high performance, on-chip memory

**Traditional Memory Architecture** 



**Cerebras Memory Architecture** 



Memory uniformly distributed across cores



- The CS WSE architecture is built for deep learning
- Al-optimized communication
  - High bandwidth, low latency cluster-scale networking on chip
  - Fully-configurable to user-specified topology





• Cerebras CS-2: Comparison with NVIDIA A100 GPU

|                     | Cerebras WSE-2         | A100                | Cerebras Advantage |
|---------------------|------------------------|---------------------|--------------------|
| Chip size           | 46,225 mm <sup>2</sup> | 826 mm <sup>2</sup> | 56 X               |
| Cores               | 850,000                | 6,912 + 432         | 123 X              |
| On chip<br>memory   | 40 Gigabytes           | 40 Megabytes        | 1,000 X            |
| Memory<br>bandwidth | 20 Petabytes/sec       | 1,555 Gigabytes/sec | 12,862 X           |
| Fabric<br>bandwidth | 220 Petabits/sec       | 600 Gigabytes/sec   | 45,833 X           |

Table 1. Overview of the magnitude of advancement made by the Cerebras WSE-2.







• Cerebras Software Stack handles graph compilation



- Extract Obtain graph representation of model from framework and express it in our intermediate form.
- . Match Consult kernel library for kernels that implement portions of model.
- Place & Route Assign kernels to regions of fabric guided by graph connectivity and kernel performance functions.
- Link Create executable output that can be loaded and run by CS-1.



• Program using familiar ML Frameworks

The user starts as usual by developing their ML model.

Cerebras integrates with popular ML Frameworks so researchers can write their models using familiar tools.

# TensorFlow <sup>(</sup>O'PyTorch



Model extraction from ML Framework -> LAIR

Cerebras LAIR (Linear Algebra Intermediate Representation) is the standard input into the Cerebras software stack.

We extract the explicit linear algebra graph representation of the model from the ML Framework and translate it into LAIR.







#### **Hands-on Section on SambaNova and Cerebras**



## **BERT (language model) on hands-on section**

- Bidirectional Encoder Representations from Transformers (BERT) is a transformer-based machine learning technique for natural language processing (NLP) pre-training developed by Google.
- The original English-language BERT has two models:
  - (1) BERT\_BASE: 12 encoders with 12 bidirectional selfattention heads;
  - (2) BERT\_LARGE: 24 encoders with 16 bidirectional self-attention heads.





## SambaNova

• 1. Login to sn:

ssh <u>ALCFUserID@sambanova.alcf.anl.gov</u>
ssh sm-01

• 2. SDK setup:

source /software/sambanova/envs/sn\_env.sh

• 3. Copy scripts:

```
cp
/var/tmp/Additional/slurm/Models/ANL_Acceptance_RC1_11_5
/bert_train-inf.sh ~/
```

• 4. Run scripts:

```
cd ~; ./bert_train-inf.sh;
```



### Cerebras

• 1. Login to CS-2: ssh ALCFUserID@cerebras.alcf.anl.gov ssh cs2-01-med1

• 2. Copy scripts:

cp -r /software/cerebras/model\_zoo ~/

cd modelzoo/transformers/tf/bert

modify data\_dir to
"/software/cerebras/dataset/bert\_large/msl128/" in
configs/params\_bert\_large\_msl128.yaml

## Cerebras

3. Run scripts: MODELDIR=model\_dir\_bert\_large\_msl128\_\$(hostname) rm -r \$MODELDIR

time -p csrun\_cpu python run.py --mode=train -compile\_only --params configs/params\_bert\_large\_msl128.yaml --model\_dir \$MODELDIR --cs\_ip \$CS\_IP

time -p csrun\_wse python run.py --mode=train --params configs/params\_bert\_large\_msl128.yaml --model\_dir \$MODELDIR --cs\_ip \$CS\_IP

## Thanks!

Zhen Xie, Murali Emani, Siddhisanket Raskar, Varuni Sastry, William Arnold, Bruce Wilson, Rajeev Thakur, Venkatram Vishwanath

Argonne Leadership Computing Facility

2022 Argonne Training Program on Extreme-Scale Computing (ATPESC) 31 Argonne Leadership Computing Facility