
================================================================================
 Intel(R) Xeon Phi(TM) Processor X200 Product Family Performance Workloads: micp
================================================================================
Version:  2.2.0+xpps

Disclaimer and Legal Information:

You may not use or facilitate the use of this document in connection with
any infringement or other legal analysis concerning Intel products described
herein. You agree to grant Intel a non-exclusive, royalty-free license to
any patent claim thereafter drafted which includes subject matter disclosed
herein.

No license (express or implied, by estoppel or otherwise) to any intellectual
property rights is granted by this document.
All information provided here is subject to change without notice. Contact your
Intel representative to obtain the latest Intel product specifications and
roadmaps. The products described may contain design defects or errors known as
errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this
document may be obtained by calling 1-800-548-4725 or by visiting:
http://www.intel.com/design/literature.htm (http://www.intel.com/design/literature.htm)
Intel technologies' features and benefits depend on system configuration and
may require enabled hardware, software or service activation. Learn more at
http://www.intel.com/ (http://www.intel.com/) or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Xeon, Xeon Phi and the Intel logo are trademarks of Intel Corporation
in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright 2012-2017, Intel Corporation, All Rights Reserved.

================================================================================
 Table of Contents
================================================================================
  1. Introduction

  2. Workloads
  2.1  (D/S)GEMM
  2.2  Linpack
  2.3  HPLinpack*
  2.4  HPCG*
  2.5  STREAM
  2.6  SHOC**
  2.7  Deepbench*
  2.8  Fio*
  2.9  (I)GEMM***

  3. The micp Python Package
  3.1  Features
  3.2  Executable Scripts
  3.2.1  micprun
  3.2.2  micpprint
  3.2.3  micpplot
  3.2.4  micpinfo
  3.2.5  micpcsv
  3.3  Reference Data

  4. Distributed Executable Binaries

  5. Source Code for Distributed Executables

  6. Additional Documentation

  *   Only available for the Intel(R) Xeon Phi(TM) Processor X200 Product Family
  **  Only available for the Intel(R) Xeon Phi(TM) Coprocessor X200 Product Family
      and the Intel(R) Xeon Phi(TM) Coprocessor X100 Product Family
  *** Only available for products pertained for the machine learning market

================================================================================
1.  Introduction
================================================================================

This software package provides users with industry standard benchmarks for
measuring the performance of the Intel(R) Xeon Phi(TM) Processor X200 Product
Family (hereby referred to as the processor), the Intel(R) Xeon Phi(TM)
Coprocessor X200 Product Family and the Intel(R) Xeon Phi(TM) Coprocessor X100
Product Family (hereby referred to as the coprocessor). In addition to providing
compiled executable versions of the benchmarks, there is also a Python automation
infrastructure that will run the benchmarks and provide analysis of the results.
Some benchmark source code is provided.

As distributed, the micperf workloads are comprised of four different
benchmarks: GEMM, Linpack, STREAM, and SHOC.  GEMM and Linpack both
exercise basic dense matrix operations targeting floating point
performance on the processor or coprocessor.
STREAM is a test of memory bandwidth targeting memory performance on
the processor and the coprocessor.
SHOC tests the performance of the PCIe bus for transferring data between
the host system and the coprocessor.


================================================================================
2.  Workloads
================================================================================

2.1  (D/S)GEMM

Benchmark based on the Basic Linear Algebra Subroutines (BLAS) Level 3
operations SGEMM and DGEMM as implemented by the Intel(R) Math Kernel
Library (Intel(R) MKL).  These routines perform the multiplication of
two matrices in single (SGEMM) and double (DGEMM) precision.  Source
code for this benchmark is provided.

For the processor, micperf does *not* provide any offload categories since
the benchmarks are executed directly on the host.

For the coprocessor the native, pragma offload, and auto offload categories
are available for GEMM in the distributed package. The native version executes
on the coprocessor only and does not exchange computational data with the host.
The pragma offload version is executed on the host and offloaded
to the coprocessor with the pragma compiler option. The automatic offload version is
executed on the host and some computations are offloaded to the coprocessor with the
automatic compiler option.

The micprun scaling parameter category for GEMM performs data scaling
by computing on a range of matrix sizes while running on all available
Intel(R) Xeon Phi(TM) processor cores.  The micprun optimal category runs
on all cores with a matrix size that yields high performance.


2.2  Linpack

The Linpack benchmark performs an in place matrix inversion by LU factorization
which is equivalent to solving a system of linear equations.  The Linpack
implementation included in this package is based on the Intel(R) Math Kernel
Library (Intel(R) MKL).  The computational efficiency of the Linpack benchmark
continues to improve as the problem size grows.

For the processor, micperf does *not* provide any offload categories since the
benchmarks are executed directly on the host.

For the coprocessor the native and auto offload categories are the only methods
available for Linpack in the distributed package. The native version executes
on the coprocessor only and does not exchange computational with the host.
The automatic offload version is executed on the host and some computations
are offloaded to the coprocessor with the automatic compiler option.

The micprun scaling parameter category for Linpack runs on a range of matrix
sizes while running on all available Intel(R) Xeon Phi(TM) processor cores.
The micprun optimal category runs on all cores while inverting a matrix that
fills nearly all of the memory available on the coprocessor or processor.

The Linpack executable is part of the Intel(R) MKL distribution and as
such cannot be bundled with the Intel(R) MIC Performance Workloads
package. In order to run this benchmark through the micp Python
package, the Intel(R) Composer XE compilervars.sh or compilervars.csh
must be sourced. This will define the MKLROOT environment variable
that micp uses to locate the Intel(R) MKL SMP Linpack binary. Alternatively
Linpack is distributed as part of the Intel(R) MKL Benchmarks test suite
which is freely available online, see the INSTALL.txt file for further
instructions on where to download the Intel(R) MKL Benchmarks test suite
and how to define the MKLROOT environment variable.


2.3 HPLinpack

The High-Performance Linpack (HPL) benchmark solves a random dense system
of linear equations (Ax=b) in real*8 precision, measures the amount of time
it takes to factor and solve the system, converts that time into a performance
rate, and tests the results for accuracy. The Intel(R) Optimized MP LINPACK
Benchmark for Clusters (Intel(R) Optimized MP LINPACK Benchmark) is based on
modifications and additions to HPL 2.1 (http://www.netlib.org/benchmark/hpl)
from Innovative Computing Laboratories (ICL) at the University of Tennessee,
Knoxville.

micprun HPLinpack support is limited to the processor.

The micprun scaling parameter category for HPlinpack runs on a range of matrix
sizes while running on all available physical Intel(R) Xeon Phi(TM) processor
cores. The micprun optimal parameter category runs on all cores with a matrix
size that yields high performance. The micprun scaling_core parameter category
keeps the matrix size constant while the number of Intel(R) Xeon Phi(TM)
processor cores is gradually increased.

In order to execute the HPLinpack benchmark, micprun creates the proper
configuration file for the benchmark, such configuration file is never exposed,
for reference a copy of the file is provided below. Parameters have been chosen
to get the best performance on the second generation of Intel(R) Xeon Phi(TM)
processors.

    HPLinpack benchmark input file
    Innovative Computing Laboratory, University of Tennessee
    HPL.out      output file name (if any)
    6            device out (6=stdout,7=stderr,file)
    1            # of problems sizes (N)
    {problem_size}    Ns
    1            # of NBs
    {block_size}     NBs
    1            PMAP process mapping (0=Row-,1=Column-major)
    1            # of process grids (P x Q)
    1            Ps
    1            Qs
    16.0         threshold
    1            # of panel fact
    1            PFACTs (0=left, 1=Crout, 2=Right)
    1            # of recursive stopping criterium
    4            NBMINs (>= 1)
    1            # of panels in recursion
    2            NDIVs
    1            # of recursive panel fact.
    1            RFACTs (0=left, 1=Crout, 2=Right)
    1            # of broadcast
    6            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM,6=Psh,7=Psh2)
    1            # of lookahead depth
    0            DEPTHs (>=0)
    0            SWAP (0=bin-exch,1=long,2=mix)
    1           swapping threshold
    1            L1 in (0=transposed,1=no-transposed) form
    1            U  in (0=transposed,1=no-transposed) form
    0            Equilibration (0=no,1=yes)
    8            memory alignment in double (> 0)

To get the best performance micprun will carefully adjust {problem_size}
and {block_size}. Interested users are encouraged to visit <REFERENCE>
for further details on how to configure the HPLinpack benchmark.

The HPLinpack executable is part of the Intel(R) MKL distribution and as
such cannot be bundled with the Intel(R) MIC Performance Workloads
package. In order to run this benchmark through the micp Python
package, the Intel(R) Composer XE compilervars.sh or compilervars.csh
must be sourced. This will define the MKLROOT environment variable
that micp uses to locate the Intel(R) HPLinpack binary. Alternatively
HPLinpack is distributed as part of the Intel(R) MKL Benchmarks test suite
which is freely available online, see the INSTALL.txt file for further
instructions on where to download the Intel(R) MKL Benchmarks test suite
and how to define the MKLROOT environment variable.


2.4 HPCG

The HPCG benchmark implementation is based on a 3D regular 27-point discretization
of an elliptic partial differential equation. The 3D domain is scaled to fill
a 3D virtual process grid for all of the available MPI ranks. The preconditioned
conjugate gradient method (CG) is used to solve the intermediate systems of
equations and incorporates a local and symmetric Gauss-Seidel preconditioning
step that requires a triangular forward solve and a backward solve. A synthetic
multigrid V-cycle is used on each preconditioning step to make the benchmark
more similar to real world applications. The multiplication of matrices is
implemented locally with an initial halo exchange between neighboring processes.
The benchmark exhibits irregular accesses to memory and fine-grain recursive
computations that dominate many scientific workloads
(http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf).

To execute HCPG, micprun creates the proper configuration file for the benchmark
in a temporary directory. After execution the configuration file is removed.
The contents of this file will change depending on which parameters are given
to micprun. For instance parameters:

           --problem_size 32 --time 60 --omp_num_threads 32 will

result in the following configuration:

    HPCG benchmark input file
    Sandia National Laboratories; University of Tennessee, Knoxville
    160 160 160
    60

In addition to the configuration file micprun also sets the required environment
variables (OMP_NUM_THREADS, KMP_AFFINITY and KMP_PLACE_THREADS), MPI and numactl
arguments to get the best processor performance.

The Intel(R) Optimized High Performance Conjugate Gradient Benchmark provides an
early implementation of the HPCG benchmark (http://hpcg-benchmark.org) optimized
for Intel(R) Advanced Vector Extensions (Intel(R) AVX), Intel(R) Advanced Vector
Extensions 2 (Intel(R) AVX2) enabled Intel(R) processors and Intel(R) Xeon Phi(TM)
coprocessors. The HPCG Benchmark is intended to complement the High Performance
LINPACK benchmark used in the TOP500 (http://www.top500.org) system ranking by
providing a metric that better aligns with a broader set of important cluster
applications.


2.5  STREAM

The STREAM benchmark measures sustainable bandwidth for data transfers
between off die memory and on die processor cache.  This memory
bandwidth is the performance limiting factor for low flop density
computational kernels (e.g. BLAS level 1).  STREAM measures the data
transfer rate for some of these simple vector kernels.  The
computational kernel reported in the rolled up statistics by micprun
for STREAM is "triad", which performs a = b + q*c where a, b, and c
are vectors and q is a scalar.  This operation is multi-threaded by
using OpenMP.  For best performance there is, at most, one thread
affinitized to each core.

For the processor, micperf does *not* provide any offload categories since
the benchmarks are executed directly on the host.

For the coprocessor the native offload category is the only method available
for STREAM in the distributed package. This executes on coprocessor only,
and does not exchange computational data with the host.

The micprun scaling parameter category for STREAM varies the number of
OpenMP threads used to perform the computational kernel while
otherwise solving the same problem (strong scaling).  The micprun
optimal parameter category runs with the number of threads which
maximize performance on tested processors and coprocessors SKUs.


2.6  SHOC

** Only available for the Intel(R) Xeon Phi(TM) Coprocessor X200 Product
   Family and the Intel(R) Xeon Phi(TM) Coprocessor X100 Product Family **

SHOC is a collection of benchmarks used to test heterogeneous
computing platforms.  Of this large suite of applications, only two
are included in the micperf workloads package: BusSpeedDownload and
BusSpeedReadback.  These applications measure the PCIe bus bandwidth
from host to coprocessor (BusSpeedDownload) and from device to host
(BusSpeedReadback) by transferring messages that range in size from
1KB to 64 MB.

For coprocessor there are two offload categories available in the
distributed package: scif and pragma. The scif offload method uses Intel(R)
Symmetric Communications Interface (Intel(R) SCI) library calls directly for
data transfer. The pragma offload method uses the Intel(R) Composer
XE pragma offload method for transferring data. In order to use the
pragma offload versions of the SHOC and GEMM benchmarks, the Intel(R)
Composer XE package must be installed and the user must source
compilervars.sh or compilervars.csh (located in the Intel(R) Composer
XE bin directory) before running.

Note:
  o It is important to use the version of Intel(R) Composer that is
    released in conjunction with the Intel(R) Manycore Platform Software
    Stack (Intel(R) MPSS) or the Intel(R) Xeon Phi(TM) Processor Software.
    For version 2.2.0+xpps for Linux, use Intel(R) Composer version
    l_comp_lib_2017.1.132_comp.cpp_redist.tgz.


Both the scaling and optimal categories call the executables without
any command line parameters and these executables always run through a
range of message sizes to transfer.  The difference between running
the scaling and optimal categories with micprun is that if using
verbosity level 1 or higher (or when running micpprint for post
processing) and the optimal parameter category is specified, only the
message size with the highest transfer rate is reprinted in the ROLLED
UP section of the output.


2.7 Deepbench (libxsmm_conv, mkl_conv)

* Only available for the Intel(R) Xeon Phi(TM) Processor X200 Product
Family

The Deepbench benchmark suite is designed to test hardware using operations
that are extensively utilized when executing deep learning workloads.
The whole suite contains a wide set of tests that involve basic operations
that exist in vast majority of deep neural networks, whereas the Micperf
tool only incorporates a subset of all tests that are not already covered
by other kernels mentioned in this section. Particularly, Micperf is able
to run MKL and libxsmm convolutions tests that are part of the Deepbench
suite. It has to be noted that convolutions are operations that have major
impact on performance when image, video, speech and similar workloads are
considered.

Two tests were chosen for Micperf. Both perform convolutions, but use
different libraries - the 'mkl_conv' kernel utilizes the Intel(R)
MKL-DNN library, while the 'libxsmm_conv' kernel makes use of the libxsmm
library.

Micperf provides these benchmarks only for Intel(R) Xeon Phi(TM) Processor
thus offloading these benchmarks is not possible.

For more information about the benchmarks and libraries described in this
section please follow the links below:

https://github.com/baidu-research/DeepBench
https://github.com/01org/mkl-dnn
https://github.com/hfp/libxsmm

2.8 Fio (Flexible IO)

* Only available for the Intel(R) Xeon Phi(TM) Processor X200 Product
Family

Fio (Flexible IO) is a versatile IO benchmark that was designed to simulate a
variety of cases that are useful when examining performance of write and read
operations.

Micperf uses fio to estimate drive read bandwidth. The configuration of the
test performed by micperf is described by the configuration file below:

[global]
filename=T*
iodepth=32
stonewall
direct=1
numjobs=N**
size=S***
thread
[Read-4k-bw]
rw=read
bs=4k

The example above configures each of N** threads to read S*** kilobytes of
memory in 4 kilobyte chunks from a drive defined by T*. By default N = 10,
S = 1MB and T is set to a partition on which the fio binary is installed.
Default values of those parameters can be changed with the -p argument in
micperf.

The fio benchmark requires "/dev/sdx" write and read access, thus in order to
run this benchmark the "--sudo" argument has to be added to the micprun command
line:
	$ micprun -k fio --sudo

The executing user has to be added to the sudoers list. Note that depending
on system settings user may be prompted for root password. In such case the
micprun execution will halt until the password is provided.

For more details please visit:

https://github.com/axboe/fio

2.9 (I)GEMM

Benchmark utilizing the integer matrix multiplication GEMM operation from Intel
(R) Math Kernel Library (Intel(R) MKL). This routine performs the multiplication
of two matrices defined by C += A * B equation. Source code for this benchmark
is provided in the source package.

This benchmark is provided only for selected devices.

================================================================================
3.  The micp Python Package
================================================================================

The micp Python Package includes a collection of executable scripts and
reference data files.

3.1  Features

  o Execute an individual workload with particular parameters.

  o Execute an individual workload with predefined parameter
    categories.

  o Aggregate execution of workloads with predefined parameter
    categories.

  o Collect and display performance results.

  o Plot statistics for scaling runs.

  o Record hardware and performance data to file.

  o Compare performance with previous runs.


3.2  Executable Scripts

The functionality of the package can be accessed by using the five
executable scripts: micprun, micpprint, micpcsv, micpplot, and
micpinfo.  For usage and extensive help information for any of the
scripts, pass --help as the command line parameter to the script, e.g:

user_prompt> micprun --help

Note that on Windows these scripts have a .py extenstion:
e.g. micprun.py, micpprint.py, etc.  This documentation will refer to
these scripts without the .py extension, as they are distributed for
Linux.


3.2.1  micprun

The micprun script is used to execute the benchmarks.  In simple use,
micprun executes the benchmark and displays the standard output from
that benchmark.  In its most elaborate use, micprun repeats a sequence
of runs of a benchmark suite previously executed by the performance
validation team at Intel, and then compares the results to the Intel
reference data.  In this mode, micprun returns a non-zero return code
if a performance regression is detected, and displays gtest styled
pass/fail output.

If an output directory is specified during execution, micprun will
produce pickle files containing data collected during the run.  Use
the micpprint, micpcsv, micpplot, and micpinfo applications to inspect
the pickle files.


3.2.2  micpprint

The micpprint application prints the performance measurements from data
stored in one or more pickle files.


3.2.3  micpplot

The micpplot application plots the performance statistics from one or
more runs, from pickle files produced during the run.


3.2.4  micpinfo

The micpinfo application obtains and displays system information. When
run without parameters, micpinfo displays information about the
current system.  When run with a pickle file, it displays information
about the system that created the pickle file.


3.2.5  micpcsv

The micpcsv application extracts the performance data from a pickle
file and prints it to standard output in comma separated value (CSV)
format.  If it is run without specifying a pickle file, micpcsv prints
a table summarizing the results in the reference data included in the
distribution.  If an output directory is specified with the -o
parameter, then a set of csv files is created.


3.3  Reference Data

This package includes Python pickle files, which contain recorded
performance information measured by Intel on Intel systems.  Pickle
files are serialized Python objects.  In particular, these files
serialize the micp.stats.StatsCollection object, which is produced by
micprun.  The reference files begin with the name micp_run_stats and
are located in:

/usr/share/micperf/micp

All of the micp executable scripts can use a micp_run_stats pickle
file as input.  There is an additional README.txt provided in the data
directory.


================================================================================
4.  Distributed Executable Binaries
================================================================================

The Intel(R) MIC Performance Workloads package comes with compiled workload
executables.  For the Linux distribution these are installed to:

/usr/libexec/micperf


There is a sub-directory (x86_64) for binaries compiled for the host, and
a sub-directory ('k1om' the Intel(R) Xeon Phi(TM) Coprocessor X100 Product
Family and 'x86_64_AVX512' for the Intel(R) Xeon Phi(TM) Coprocessor X200
Product Family) for binaries compiled for coprocessor.
These executables generally bear the name given by the original developers
of the benchmark.


================================================================================
5.  Source Code for Distributed Executables
================================================================================

The source code for benchmarks is distributed with the package, in the case
of Windows, sources are installed with the package (details below). For the Linux
distribution sources are packaged in a source RPM following the Linux standard
convention. To inspect the source code "install" the source RPM, change to the
SOURCES directory and untar the corresponding tarball, as shown below.
Below is an example for the processor. For the coprocessor replace 'xppsl'
with 'mpss', please note all steps are performed as a non-root user:

    $ rpm -ihv  xppsl-micperf-<version>-<release>.src.rpm
    $ cd ~/rpmbuild/SOURCES
    $ tar -xf xppsl-micperf-<version>.tar.gz
    $ cd xppsl-micperf-<version>


These directories contain the source code for the SHOC, GEMM, and
STREAM workloads. In the Linux distribution the sub-directory for each
workload contains a Makefile designed to work with the Intel(R)
Composer XE 2017 compiler package.
This version of the Intel(R) Composer XE will compile binaries for the
Intel(R) Xeon Phi(TM) Processor X200 Product Family.

The stream source directory includes a separate README.txt file, which
describes how to download the source for the STREAM benchmark, and the
steps to build with and without support for 2MB pages.  The STREAM
binary distributed with the package is built without 2MB page support
via libhugetlbfs.

To rebuild the source RPM on Linux and execute micprun with these
rebuilt workloads, the following commands can be used:

    $ source /PATH/TO/COMPOSER_XE_2017/bin/compilervars.sh intel64

For the processor:

    $ rpmbuild --rebuild xppsl-micperf-<version>-<release>.src.rpm

or in case of the coprocessor:

    $ rpmbuild --rebuild mpss-micperf-<version>-<release>.src.rpm

<version> and <release> vary depending on the current version of the package,
for instance: xppsl-micperf-1.5.0-0.src.rpm or mpss-micperf-4.3.0-0.src.rpm.
New binary rpm will be stored in ~/rpmbuild/RPMS, to know the exact name and
location look for the line:

    Wrote: /PATH/TO/BINARY/RPM.rpm

in rpmbuild's output, finally install the new binary rpm. For more details on
how to rebuild the source RPM please refer to the micperf_users_guide.pdf.

================================================================================
6.  Additional Documentation
================================================================================

The /usr/share/doc/micperf/ directory contains this README.txt and individual
license agreements for the open source benchmarks.

This package, also known as the "OEM Workloads", is licensed under the
Intel(R) MPSS license agreement that is on Linux installed at:

/usr/share/doc/micperf/EULA

