

# INTEL® ADVISOR AND ROOFLINE MODEL

## Contacts

Advisor Support Mail List <a href="mailto:vector.advisor@intel.com">vector.advisor@intel.com</a>

Zakhar Matveev <u>zakhar.a.matveev@intel.com</u>

Intel Advisor Product Architect

Kirill Rogozhin kirill.rogozhin@intel.com

Intel Advisor Project Manager

Egor Kazachkov egor.kazachkov@intel.com

Intel Advisor Senior Developer



## What is Intel® Advisor





## Cache Simulator and MAP



## Python API

## Threading prototyping



print '{} {:.0f} GFLOPS'.format(roof.name, bandwidth



# **VECTORIZATION**

## Get Faster Code Faster! Intel® Advisor

## **Vectorization Optimization**

### Have you:

- Recompiled for AVX2 with little gain
- Wondered where to vectorize?
- Recoded intrinsics for new arch.?
- Struggled with compiler reports?

### **Data Driven Vectorization:**

- What vectorization will pay off most?
- What's blocking vectorization? Why?
- Are my loops vector friendly?
- Will reorganizing data increase performance?
- Is it safe to just use #pragma omp simd?





# The Right Data At Your Fingertips

Get all the data you need for high impact vectorization



# 5 Steps to Efficient Vectorization

Intel® Advisor – Vectorization Advisor



### 1. Compiler diagnostics + Performance Data + SIMD efficiency information

|                                            | Vectorized Loops |      |             |       | >>      | Instruction Set Analysis |            |  |
|--------------------------------------------|------------------|------|-------------|-------|---------|--------------------------|------------|--|
| + Function Call Sites and Loops            | Self Time        | Vect | Efficiency▼ | Gain  | VL (    | Traits                   | Data T     |  |
| ⊕   © [loop in loopInit at LCALSSuite.cx:  | 0.016s I         | AVX  | 75%         | 2.99x | 4       | Divisions; Type C        | Float64;   |  |
| ⊕   © [loop in loopInit at LCALSSuite.cxx: | 0.016s I         | AVX  | 75%         | 2.99x | 4       | Divisions; Type C        | Float64;   |  |
| 🛨 🖔 [loop in runCForallLambdaLoops a       | 0.672s1          | AVX; | 70%         | 5.60x | 2; 4; 8 | Extracts; FMA; Ty        | Float64;   |  |
| ⊞ 🖔 [loop in runCRawLoops at runCRav       | 0.578s1          | AVX; | 70%         | 5.60x | 2; 4; 8 | Extracts; FMA; Ty        | Float64    |  |
| ⊞ 🖰 [loop in runOMPRawLoops\$omp\$p        | 0.953s I         | AVX  | 69%         | 2.75x | 4       | FMA                      | Float64    |  |
| 🛨 🖔 [loop in runOMPRawLoops\$omp\$բ        | 1.953s #         | AVX  | 68%         | 2.74x | 4       |                          | Float64;   |  |
| 🛨 🖔 [loop in runARawLoops at runARaw       | 0.734s l         | AVX2 | 67%         | 2.67x | 4       | Blends; Divisions;       | Float32; F |  |
| 🛨 🖔 [loop in runAForallLambdaLoops a       | 0.578s1          | AVX2 | 67%         | 2.67x | 4       | Blends; Divisions;       | Float32; F |  |
|                                            |                  |      |             | _     |         |                          |            |  |

+ Binary Analysis

# Vector Efficiency: All The Data In One Place

## My "performance thermometer"

| E E Continu Call Channel I annu                              | C-IST:    | Vectorized Loops |              |       |      | Instruction Set Analysis                            |  |  |
|--------------------------------------------------------------|-----------|------------------|--------------|-------|------|-----------------------------------------------------|--|--|
| + - Function Call Sites and Loops                            | Self Time | Vect             | Efficiency 🔺 | Gain  | VL ( | Traits                                              |  |  |
| 🛨 🖔 [loop in runCForallLambdaLoops at runCForallLar          | 0.734s I  | AVX;             | 26%          | 2.11x | 4; 8 | Extracts; Inserts; Type Conversions                 |  |  |
| ± <sup>©</sup> [loop in runCRawLoops at runCRawLoops.cxx:704 | 0.625s1   | AVX;             | 26%          | 2.11x | 4; 8 | Extracts; Inserts; Type Conversions                 |  |  |
| ± 5 [loop in runCForallLambdaLoops at runCForallLar          | 2.703s 0  | AVX2             | 31%          | 2.50x | 4; 8 | FMA; Inserts; Permutes; Unpacks                     |  |  |
| ± <sup>™</sup> [loop in runCRawLoops at runCRawLoops.cxx:117 | 2.609s I  | AVX2             | 31%          | 2.50x | 4; 8 | FMA; Inserts; Permutes; Unpacks                     |  |  |
| ± <sup>™</sup> [loop in runOMPRawLoops\$omp\$parallel@135 at | 0.453s1   | AVX2             | 45%          | 1.80x | 4    | Blends; Divisions; FMA; Masked Stores; Square Roots |  |  |
| ± 5 [loop in runAForallLambdaLoops at runAForallLar          | 0.234s I  | AVX2             | 45%          | 1.82x | 4    | Blends; Divisions; FMA; Masked Stores; Square Roots |  |  |



Original (scalar) code efficiency. Corresponds to 1x speed-up.

Achieved Efficiency Upper bound: 100% efficiency 4x gain (VL=4)

- Auto-vectorization: affected <3% of code</li>
  - With moderate speed-ups
- First attempt to simply put #pragma omp simd:
  - Introduced slow-down
- Look at Vector Issues and Traits to find out why
  - All kinds of "memory manipulations"
  - Usually an indication of "bad" access pattern

Survey: Find out if your code is "under vectorized" and why



# Vectorization tied to your code



## Don't Just Vectorize, Vectorize Efficiently

See detailed times for each part of your loops. Is it worth more effort?



#### 1. Compiler diagnostics + Performance Data + SIMD efficiency information

|                                         | Self Time | Vectorized Loops 🕥 Inst |             | Instruction Set Anal | struction Set Analysis |                    |            |
|-----------------------------------------|-----------|-------------------------|-------------|----------------------|------------------------|--------------------|------------|
| + Function Call Sites and Loops         | Sell Time | Vect                    | Efficiency▼ | Gain                 | VL (                   | Traits             | Data T     |
| ⊞ © [loop in loopInit at LCALSSuite.c∞: |           | AVX                     | 75%         | 2.99x                | 4                      | Divisions; Type C  | Float64;   |
| ⊞ 🗗 [loop in loopInit at LCALSSuite.cx: |           | AVX                     | 75%         | 2.99x                | 4                      | Divisions; Type C  | Float64;   |
| ⊞ 🖔 [loop in runCForallLambdaLoops a    |           | AVX;                    | 70%         | 5.60x                | 2; 4; 8                | Extracts; FMA; Ty  | Float64;   |
| ⊞ ७ [loop in runCRawLoops at runCRaw    |           | AVX;                    | 70%         | 5.60x                | 2; 4; 8                | Extracts; FMA; Ty  | Float64    |
| ⊕ 🖰 [loop in runOMPRawLoops\$omp\$p     | 0.953s1   | AVX                     | 69%         | 2.75x                | 4                      | FMA                | Float64    |
| ⊕ 🖰 [loop in runOMPRawLoops\$omp\$p     | 1.953s I  | AVX                     | 68%         | 2.74x                | 4                      |                    | Float64;   |
| ⊕ 🖰 [loop in runARawLoops at runARav    | 0.734s1   | AVX2                    | 67%         | 2.67x                | 4                      | Blends; Divisions; | Float32; F |
| ± ७ [loop in runAForallLambdaLoops a    |           | AVX2                    | 67%         | 2.67x                | 4                      | Blends; Divisions; | Float32; F |

# 2. Guidance: detect problem and recommend how to fix it

All Advisor-detectable issues: C++ | Fortran

#### Recommendation: Add data padding

The  $\underline{\text{trip count}}$  is not a multiple of  $\underline{\text{vector length}}$ . To fix: Do one of the following:

- Increase the size of objects and add iterations so the trip count is a multiple of vector length.
- Increase the size of static and automatic objects, and use a compiler option to add data padding.

| Windows* OS               | Linux* OS                 |
|---------------------------|---------------------------|
| /Qopt-assume-safe-padding | -qopt-assume-safe-padding |



## Get Specific Advice For Improving Vectorization



All Advisor-detectable issues: C++ | Fortran



Issue: Ineffective peeled/remainder loop(s) present

Advisor shows hints to move iterations to vector body.

ag source loop heralions from peeled/

All or some <u>source loop</u> iterations are not executing in the <u>loop body</u>. Improve penormal remainder loops to the loop body.



Add data padding

The trip count is not a multiple of vector length. To fix: Do one of the following:

- Increase the size of objects and add iterations so the trip count is a multiple of vector length.
- Increase the size of static and automatic objects, and use a compiler option to add data padding.



# Critical Data Made Easy

**Loop Trip Counts** 

Knowing the time spent in a loop is not enough!

| Firsting Call Standard Laura                | C-14 T:         | T                                             | Trip Counts |          |           |            |  |
|---------------------------------------------|-----------------|-----------------------------------------------|-------------|----------|-----------|------------|--|
| + - Function Call Sites and Loops           | Self Time▼      | Туре                                          | Average     | Min      | Max       | Call Count |  |
| ☐ Usop in runOMPRawLoops\$omp\$p            | 4.190s <b>0</b> | Vectorized+Threaded (Body; Peeled; Remainder) | 2; 110; 2   | 1; 17; 1 | 3; 111; 3 | 112590000  |  |
| ☑ ⑤ [loop in runOMPRawLoops\$om]            | 3.768s <b>0</b> | Remainder+Threaded (OpenMP)                   | 2           | 1        | 3         | 1125900000 |  |
| ☑ <sup>⑤</sup> [loop in runOMPRawLoops\$om] | 0.406s1         | Vectorized (Body)+Threaded (OpenMP)           | 110         | 17       | 111       | 12320000   |  |
| ☑ ⑤ [loop in runOMPRawLoops\$om]            | 0.016s I        | Peeled+Threaded (OpenMP)                      | 2           | 1        | 3         | 880000     |  |



Find trip counts for each part of a loop

# Precise Repeatable FLOP Metrics

- FLOPS by loop and function
- All recent Intel processors

- Instrumentation (count FLOP) plus sampling (time with low overhead)
- Adjusted for masking with AVX-512 processors

|                                                | C-16 T:   | Vectorized Loops |            |       | FLOPS |              |         |
|------------------------------------------------|-----------|------------------|------------|-------|-------|--------------|---------|
| + - Function Call Sites and Loops              | Self Time | Vect             | Efficiency | Gain  | VL (  | Self GFLOPS▼ | Self Al |
| ☐ ⑤ [loop in runOMPRawLoops\$omp\$r            | 1.984s I  | AVX;             | 100%       | 4.30x | 4     | 204.298      | 0.17103 |
| ☑ <sup>⑤</sup> [loop in runOMPRawLoops\$om)    | 1.469s I  | AVX2             |            |       | 4     | 398.921      | 0.17574 |
| [Ioop in runOMPRawLoops\$om]                   | 0.078s1   | AVX              |            |       | 4     | 20.633 0     | 0.06250 |
| ☑ ⑤ [loop in runOMPRawLoops\$om]               | 0.141s1   |                  |            |       |       | 13.1521      | 0.06250 |
| ☑ ⑤ [loop in runOMPRawLoops\$om]               | 0.234s1   |                  |            |       |       | 12.7971      | 0.14315 |
| ☑ ⑤ [loop in runOMPRawLoops\$om]               | 0.063s1   |                  |            |       |       | 0.1041       | 0.06250 |
| ± <sup>©</sup> [loop in runOMPRawLoops\$omp\$r | 1.406s )  | AVX2             | 52%        | 1.05x | 2     | 107.057      | 0.22428 |
| ± <sup>™</sup> [loop in runOMPRawLoops\$omp\$r | 1.172s J  | AVX              | 811%       | 3.22x | 4     | 63.354 🛈     | 0.07500 |



#### 2. Guidance: detect problem and recommend how to fix it



| Counts |            | FLOPS        |         |  |  |  |
|--------|------------|--------------|---------|--|--|--|
| ge     | Call Count | Self GFLOPS▼ | Self Al |  |  |  |
|        | 5712000    | 427.516      | 0.22794 |  |  |  |
|        |            |              |         |  |  |  |



# Improve Vectorization

## Memory Access pattern analysis



2.1 Check Memory Access Patterns

Collect

Select loops of interest

Run Memory Access Patterns analysis, just to check how memory is used in the loop and the called function

# **Advisor Memory Access Pattern (MAP)**:

know your access pattern

#### Unit-Stride access

for (i=0; i<N; i++)
A[i] = C[i]\*D[i]

#### Constant stride access

for (i=0; i<N; i++)
point[i].x = x[i]</pre>

#### Variable stride access

for (i=0; i<N; i++)
A[B[i]] = C[i]\*D[i]



# Find vector optimization opportunities

## Memory Access pattern analysis



All Advisor-detectable issues: C++ | Fortran

# Recommendation: Refactor code with detected regular stride access patterns

The Memory Access Patterns Report shows the following regular stride access(es):

| Variable                                                         |           |  |  |  |  |
|------------------------------------------------------------------|-----------|--|--|--|--|
| block 0x2e23c404b80 allocated at cache aligned allocator.cpp:196 | Invariant |  |  |  |  |

See details in the Memory Access Patterns Report Source Details view.

To improve memory access: Refactor your code to alert the compiler to a regular stride access. Sometimes, it might be beneficial to use the ipo/Qipo compiler option to enable interprocedural optimization (IPO) between files.



# **ENABLING VECTORIZATION**

| Vector Issues                  | Self Time▼ | Total Time | Туре            | w |
|--------------------------------|------------|------------|-----------------|---|
| • 2 Assumed dependency present | 20.030s1   | 20.030s1   | Scalar Versions | = |
|                                | 13.508s1   | 13.508s1   | Scalar          |   |
|                                | 6.895s1    | 27.750s1   | Scalar          |   |

Check dependencies

Use #pragma simd



| Vactor Issues   | Colf Time- | Total Time  | Tuno           | Vectorized        | Loops      |       | ≫     |
|-----------------|------------|-------------|----------------|-------------------|------------|-------|-------|
| Vector Issues   | Self Time▼ | lotal filme | Туре           | Vector ISA Effici | Efficiency | Gain  | VL (V |
|                 | 10.507sI   | 22.989s1    | Scalar         |                   |            |       |       |
| 2 Possible inef | 1.762s     | 3.190s      | Vectorized Ver | AVX512            | 73%        | 5.84x | 8     |

## Is It Safe to Vectorize?

## Loop-carried dependencies analysis verifies correctness



## Correctness – Is It Safe to Vectorize?

## Loop-carried dependencies analysis





Received recommendations to force vectorization of a loop:

- 1. Mark-up loop and check for REAL dependencies
- 2. Explore dependencies with code snippets

In this example 3 dependencies were detected:

- RAW Read After Write
- WAR Write After Read
- WAW Write After Write

This is NOT a good candidate to force vectorization!

# Data Dependencies – Tough Problem #1

Is it safe to force the compiler to vectorize?



# ROOFLINE

## Questions to answer with Roofline: for your loops / functions

Am I doing well? How far am I from the peak?

(do I utilize hardware well or not?)

Where is the final bottleneck?

(where will be my limit after all optimizations?)

Long-term ROI, optimization strategy





## Automated Roofline Chart Generation in Advisor - CARM



Summarized memory-compute efficiency picture for the application



# Roofline picture



Roof configuration



Switch to grid represenation

# Chart configuration



## Integrated Roofline Memory Traffic Data in Survey Grid

Review memory level and loads/stores distribution to see memory traffic for specific memory level



# Integrated Roofline. What is my current limit?

Performance is limited by minimum of intercepts (L2, LLC, DRAM, CPU)

In this case: by DRAM





Intel Confidential

# **NEW\***: Selecting Integer, Float or Mixed operations



## Integer Operations in Survey Grid and Loop Analytics





# Compare results

# Loaded results for two versions



Easy to check optimization progress

## Share with others



Standalone *interactive* HTML (limited functionality)

Share roofline by email! - with colleagues or your manager



#### Use the rest of the Advisor



#### A few words about callstacks



#### Exporting Integer and Integrated Roofline as HTML





#### **Command line:**

advixe-cl -report roofline
-data-type=float
-memory-level=L2
-memory-operation-type=load
-project-dir/path/to/project/dir

#### Possible

data types: float, int, mixed memory levels: L1, L2, L3, DRAM memory operation types: load, store, all

- Export Roofline from command line does not need GUI sub-system on clusters
- Useful for rooflines quick exchange

#### In a few words

#### Using Intel® Advisor, you can

- Collect the data for the Hierarchical and Integrated Roofline
- Analyze the roofline picture
- Focus on data you are interested in
- Compare roofline for different runs
- Share roofline results
- and more



## **COLLECTORS**

#### Collections vs Analysis

| Analysis                     | Collections                                        |
|------------------------------|----------------------------------------------------|
| Vectorization (basic)        | Survey + Trip Counts                               |
| Vectorization (advanced)     | As above + MAP + Dependencies                      |
| Roofline (CARM)              | Survey + Trip Counts with FLOP                     |
| Roofline (Integrated)        | Survey + Trip Counts with FLOP and Cache Simulator |
| Threading                    | Survey + Suitability + Dependencies                |
| Custom Analysis (Python API) | Depends                                            |

Mix and match as you wish

More data come with a cost

## HANDS-ON EXERCISE

#### **Activities**

- ➤ Activity 0: Building Stencil
- ➤ Activity 1: Doing Survey
- ➤ Activity 2: Dealing with data type conversions
- ➤ Activity 3: Checking for dependencies
- >Activity 4: Adding threading and trying to enable vectorization
- ➤ Activity 5: Checking Memory Access Patterns
- >Activity 6: Making unit stride explicit
- ➤ Activity 7: Doing Roofline analysis
- ➤ Activity 8: Splitting task **to tiles**
- ➤ Activity 9: Enabling AVX512
- ➤ Activity 10: Comparing roofline charts



# STENCIL



#### STENCIL CODE EXAMPLE

Consider solving differential equation with finite-difference method on 3-dimensional grid

Example: calculating Laplace operator of some field

```
float * in = (float *)
    malloc(DIM*DIM*DIM * sizeof(float));
float * out = (float *)
    malloc(DIM*DIM*DIM * sizeof(float));

uint64_t iStride = 1;
uint64_t jStride = DIM;
uint64_t kStride = DIM * DIM;
```

We encourage you to try the following steps on your own code, using the slides as a guide



# Activity 0: Building STENCIL



#### **Build & Run**

#### Setup environment:

\$ source /soft/compilers/intel-2019/compilers\_and\_libraries/linux/bin/compilervars.sh intel64

#### Copy and unpack stencil sources:

\$ cp /projects/ATPESC19 Instructors/advisor/advisor lab.tar.gz ~ && cd && tar xzf advisor lab.tar.gz

#### Go to working directory

\$ cd ~/advisor\_lab/src && git checkout ver0

#### **Build application**

\$ make

#### Run application

\$ ./stencil



#### Activity 0. Screenshot

```
[dayl@clx-2 src]$ source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
[dayl@clx-2 src]$ cd ~/advisor_lab/src && git checkout ver0
HEAD is now at 92efb0f... Initial commit
[dayl@clx-2 src]$ make
icc -0fast -qopenmp -no-ipo -fno-inline-functions -g main.c bench_stencil.c -o stencil
[dayl@clx-2 src]$ ./stencil
    Naive: Dim= 512, nIterations= 10, Time= 4.102s, Useful GB/s= 5.297
[dayl@clx-2 src]$ ■
```



# Activity 1: Doing Survey



#### Launch Advisor

#### Purpose: Run Survey analysis in Advisor to get the baseline version

#### Setup environment:

```
$ source /soft/compilers/intel-2019/advisor_2019/advixe_vars.sh
```

\$ export ADVIXE\_EXPERIMENTAL=int\_roofline,roofline\_guidance

#### **Launch Advisor GUI:**

\$ advixe-gui &

```
[day1@clx-2 src]$ source /opt/intel/advisor_2019/advixe-vars.sh
Copyright (C) 2009-2019 Intel Corporation. All rights reserved.
Intel(R) Advisor 2019 (build 591490)
[day1@clx-2 src]$ export ADVIXE_EXPERIMENTAL=int_roofline,roofline_guidance
[day1@clx-2 src]$ advixe-gui &
[1] 5336
```



#### **Create Advisor Project**





#### Set UP Project

Set the application to launch: ~/advisor\_lab/src/stencil

Press OK button





#### **Start Survey Analysis**

Press "Collect" button in "1. Survey Target" section





#### Activity 1. Screenshot





#### Create a snapshot





# Activity 2: Dealing with data type conversions



#### LOOK AT THE RECOMMENDATIONS





#### Activity 2

#### Purpose: Dealing with data type conversions

Build a version with fixed conversions

```
$ git checkout ver1
```

\$ make

Re-run Survey analysis

Create a snapshot

Compare with previous activity



#### Activity 2. Screenshots







# Activity 3: Doing roofline analysis



#### Activity 3. Collect data to GET ROOFLINE CHART

## Purpose: Characterize the application using roofline model

Select "With Callstacks" and "For all memory levels"

Press "Collect" button in "Run Roofline" section ~ 4 minutes

Create a snapshot

OFF Batch mode Run Roofline Collect D ✓ With Callstacks For Integrated ✓ For All Memory Levels Roofline (NEW!) 1. Survey Target Collect I Mark Loops for Deeper Analysis Select checkboxes in the Survey & Roofline tab to mark loops for other Advisor analyses. -- There are no marked loops --1.1 Find Trip Counts and FLOP

Vectorization

Workflow

Collect 1 1

Threading Workflow



#### Activity 3. Screenshot





#### Activity 3. Screenshot





# Activity 4: Checking for dependencies



#### Activity 4. Collect data to GET Dependencies

### Purpose: Find loop-carried dependencies

Select "loop in bench\_stencil at bench\_stencil.c:23"

Press "Collect" button in "2.2 Check Dependencies" section ~ 1 minute

Create a snapshot





#### Activity 4. Screenshot







Activity 5: Adding threading and enabling vectorization



#### Activity 5

#### Purpose: Add threading and enable vectorization

Build a version with threading and vectorization

```
$ git checkout ver3
```

\$ make

Re-run Roofline analysis

Create a snapshot

Compare with previous activity



#### Activity 5. Screenshots





# Activity 6: Checking memory access patterns



#### Types OF MEMORY Access patterns

#### **Unit-Stride access**

```
for (i=0; i<N; i++)
A[i] = C[i]*D[i]
```

#### Constant stride access

```
for (i=0; i<N; i++)
    point[i].x = x[i]</pre>
```

#### Variable stride access



# Activity 6. Collect data to GET Memory Access

**Patterns** 

**Purpose: Calculate strides** 

Select "loop in bench\_stencil\$omp\$parallel\_for@23 at bench\_stencil.c:24"

Press "Collect" button in "2.1 Check Memory Access Patterns" section ~ 1 minute

Create a snapshot





#### Activity 6. Screenshot





# Activity 7: Splitting task to files



## Activity 7

#### Purpose: Improve memory access pattern

Build a version with tiling

\$ git checkout ver4

\$ make

Re-run Roofline analysis

Create a snapshot

Compare with previous activity





#### Activity 6. Screenshot





# Activity 8: Enabling AVX512



### **Activity 8**

#### Purpose: Fix compilation options to use the highest available ISA

Build a version with new compilation flags

```
$ git checkout ver5
```

\$ make clean && make

Re-run Survey analysis

Create a snapshot

Compare with previous activity

Review recommendations



#### Activity 8. Screenshots







#### Activity 8. Screenshots





# Activity 9: Disabling dynamic alignment



### **Activity 9**

#### Purpose: Exclude loop peel/reminder execution

Build a version with new compilation flags

```
$ git checkout ver6
```

\$ make

Re-run Roofline analysis

Create a snapshot

Compare with previous activity



#### Activity 9. Screenshots





# Activity 10: Comparing roofline charts



#### **ACTIVITY 10**

Purpose: Graph roofline chart for optimized version, and compare with initial chart

Turn off "Show different memory level relationships" at Guidance tab

Compare with results for versions of source code "ver3" and "ver0"







#### Activity 10. Screenshot



Speedup: ~12x





#### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804