================================================================================
 Intel(R) Xeon Phi(TM) Processor X200 Product Family Performance Workloads: micp
================================================================================

Disclaimer and Legal Information:

You may not use or facilitate the use of this document in connection with
any infringement or other legal analysis concerning Intel products described
herein. You agree to grant Intel a non-exclusive, royalty-free license to
any patent claim thereafter drafted which includes subject matter disclosed
herein.

No license (express or implied, by estoppel or otherwise) to any intellectual
property rights is granted by this document.
All information provided here is subject to change without notice. Contact your
Intel representative to obtain the latest Intel product specifications and
roadmaps. The products described may contain design defects or errors known as
errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this
document may be obtained by calling 1-800-548-4725 or by visiting:
http://www.intel.com/design/literature.htm. Intel technologies features and
benefits depend on system configuration and may require enabled hardware,
software or service activation. Learn more at http://www.intel.com/ or from the
OEM or retailer.

No computer system can be absolutely secure.

Intel, Xeon, Xeon Phi and the Intel logo are trademarks of Intel Corporation
in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright 2012-2017, Intel Corporation, All Rights Reserved.

================================================================================
 Table of Contents
================================================================================
  1. Introduction

  2. Workloads
     2.1. Intel(R) MKL DGEMM and SGEMM
     2.2. Intel(R) MKL Linpack
     2.3. Intel(R) MKL HPLinpack
     2.4. Intel(R) MKL HPCG
     2.5. Intel(R) MKL STREAM
     2.6. DeepBench Convolutions
     2.7. Fio

  3. The micp Python Package
     3.1. Features
     3.2. Executable Scripts
          3.2.1. micprun
          3.2.2. micpprint
          3.2.3. micpplot
          3.2.4. micpinfo
          3.2.5. micpcsv
     3.3. Reference Data

  4. Distributed Executable Binaries

  5. Source Code for Distributed Executables

  6. Additional Documentation

================================================================================
1.  Introduction
================================================================================

This software package provides users with industry standard benchmarks for
measuring the performance of the Intel(R) Xeon Phi(TM) product family (hereby
referred to as the processor) In addition to providing compiled executable
versions of the benchmarks, there is also a Python automation infrastructure
that will run the benchmarks and provide analysis of the results. Some benchmark
source code is provided.

As distributed, the micperf workloads are comprised of four different
benchmarks: GEMM, Linpack, STREAM, and SHOC.  GEMM and Linpack both
exercise basic dense matrix operations targeting floating point
performance on the processor or coprocessor.
STREAM is a test of memory bandwidth targeting memory performance on
the processor and the coprocessor.
SHOC tests the performance of the PCIe bus for transferring data between
the host system and the coprocessor.

================================================================================
2.  Workloads
================================================================================
--------------------------------------------------------------------------------
2.1  Intel(R) MKL DGEMM and SGEMM
--------------------------------------------------------------------------------
Benchmark based on the Basic Linear Algebra Subroutines (BLAS) Level 3
operations SGEMM and DGEMM as implemented by the Intel(R) Math Kernel
Library (Intel(R) MKL).  These routines perform the multiplication of
two matrices in single (SGEMM) and double (DGEMM) precision.  Source
code for this benchmark is provided.

The micprun scaling parameter category for GEMM performs data scaling
by computing on a range of matrix sizes while running on all available cores.
The micprun optimal category runs on all cores with a matrix size that yields
high performance.
--------------------------------------------------------------------------------
2.2  Intel(R) MKL Linpack
--------------------------------------------------------------------------------
The ntel(R) MKL Linpack benchmark performs an in place matrix inversion by LU
factorization which is equivalent to solving a system of linear equations. The
implementation included in this package is based on the Intel(R) Math Kernel
Library (Intel(R) MKL). The computational efficiency of the Linpack benchmark
continues to improve as the problem size grows.

The micprun scaling parameter category for ntel(R) MKL Linpack runs on a range
of matrix sizes while running on all available cores. The micprun optimal
category runs on all cores while inverting a matrix that fills nearly all of the
memory available on the processor.

The ntel(R) MKL Linpack executable is part of the Intel(R) MKL distribution and
as such cannot be bundled with the Intel(R) MIC Performance Workloads package.
In order to run this benchmark through the micp Python package, the Intel(R)
Parallel Studio XE compilervars.sh or compilervars.csh scripts must be sourced.
This will define the MKLROOT environment variable that micp uses to locate the
Intel(R) MKL SMP Linpack binary. Alternatively Linpack is distributed as part of
the Intel(R) MKL Benchmarks test suite which is freely available online, see the
INSTALL.txt file for further instructions on where to download the Intel(R) MKL
Benchmarks test suite and how to define the MKLROOT environment variable.
--------------------------------------------------------------------------------
2.3 Intel(R) MKL HPLinpack
--------------------------------------------------------------------------------
The High-Performance Linpack (HPL) benchmark solves a random dense system of
linear equations (Ax=b) in real*8 precision, measures the amount of time it
takes to factor and solve the system, converts that time into a performance
rate, and tests the results for accuracy. The Intel(R) Optimized MP LINPACK
Benchmark for Clusters (Intel(R) Optimized MP LINPACK Benchmark) is based on
modifications and additions to HPL 2.1 (http://www.netlib.org/benchmark/hpl)
from Innovative Computing Laboratories (ICL) at the University of Tennessee,
Knoxville.

The micprun scaling parameter category for HPlinpack runs on a range of matrix
sizes while running on all available cores. The micprun optimal parameter
category runs on all cores with a matrix size that yields high performance. The
micprun scaling_core parameter category keeps the matrix size constant while the
number of processor cores is gradually increased.

In order to execute the HPLinpack benchmark, micprun creates the proper
configuration file for the benchmark, such configuration file is never exposed,
for reference a copy of the file is provided below. Parameters have been chosen
to get the best performance on Intel(R) Xeon Phi(TM) processors.

    HPLinpack benchmark input file
    Innovative Computing Laboratory, University of Tennessee
    HPL.out      output file name (if any)
    6            device out (6=stdout,7=stderr,file)
    1            # of problems sizes (N)
    {problem_size}    Ns
    1            # of NBs
    {block_size}     NBs
    1            PMAP process mapping (0=Row-,1=Column-major)
    1            # of process grids (P x Q)
    1            Ps
    1            Qs
    16.0         threshold
    1            # of panel fact
    1            PFACTs (0=left, 1=Crout, 2=Right)
    1            # of recursive stopping criterium
    4            NBMINs (>= 1)
    1            # of panels in recursion
    2            NDIVs
    1            # of recursive panel fact.
    1            RFACTs (0=left, 1=Crout, 2=Right)
    1            # of broadcast
    6            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM,6=Psh,7=Psh2)
    1            # of lookahead depth
    0            DEPTHs (>=0)
    0            SWAP (0=bin-exch,1=long,2=mix)
    1           swapping threshold
    1            L1 in (0=transposed,1=no-transposed) form
    1            U  in (0=transposed,1=no-transposed) form
    0            Equilibration (0=no,1=yes)
    8            memory alignment in double (> 0)

To get the best performance micprun will carefully adjust {problem_size} and
{block_size}. Interested users are encouraged to visit <REFERENCE> for further
details on how to configure the Intel(R) MKL HPLinpack benchmark.

The benchamrk executable is part of the Intel(R) MKL distribution and as such
cannot be bundled with the Intel(R) MIC Performance Workloads package. In order
to run this benchmark through the micp Python package, the Intel(R) Parallel
Studio XE compilervars.sh or compilervars.csh scripts must be sourced. This will
define the MKLROOT environment variable that micp uses to locate the Intel(R)
HPLinpack binary. Alternatively Intel(R) MKL HPLinpack is distributed as part of
the Intel(R) MKL Benchmarks test suite which is freely available online, see the
INSTALL.txt file for further instructions on where to download the Intel(R) MKL
Benchmarks test suite and how to define the MKLROOT environment variable.
--------------------------------------------------------------------------------
2.4 Intel(R) MKL HPCG
--------------------------------------------------------------------------------
The Intel(R) MKL HPCG benchmark implementation is based on a 3D regular 27-point
discretization of an elliptic partial differential equation. The 3D domain is
scaled to fill a 3D virtual process grid for all of the available MPI ranks. The
preconditioned conjugate gradient method (CG) is used to solve the intermediate
systems of equations and incorporates a local and symmetric Gauss-Seidel
preconditioning step that requires a triangular forward solve and a backward
solve. A synthetic multigrid V-cycle is used on each preconditioning step to
make the benchmark more similar to real world applications. The multiplication
of matrices is implemented locally with an initial halo exchange between
neighboring processes. The benchmark exhibits irregular accesses to memory and
fine-grain recursive computations that dominate many scientific workloads
(http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf).

To execute this benchmark, micprun creates a proper configuration file for it in
a temporary directory. After execution the configuration file is removed. The
contents of this file will change depending on which parameters are given to
micprun. For instance, the following parameters: --problem_size 32 --time 60
--omp_num_threads 32 result in the following configuration:

    HPCG benchmark input file
    Sandia National Laboratories; University of Tennessee, Knoxville
    160 160 160
    60

In addition to the configuration file micprun also sets the required environment
variables (OMP_NUM_THREADS, KMP_AFFINITY and KMP_PLACE_THREADS), MPI and numactl
arguments to get the best processor performance.

The Intel(R) Optimized High Performance Conjugate Gradient Benchmark provides an
early implementation of the HPCG benchmark (http://hpcg-benchmark.org) optimized
for Intel(R) Advanced Vector Extensions (Intel(R) AVX) and Intel(R) Advanced
Vector Extensions 2 (Intel(R) AVX2) enabled Intel(R) processors. The HPCG
Benchmark is intended to complement the High Performance LINPACK benchmark used
in the TOP500 (http://www.top500.org) system ranking by providing a metric that
better aligns with a broader set of important cluster applications.
--------------------------------------------------------------------------------
2.5  STREAM
--------------------------------------------------------------------------------
The STREAM benchmark measures sustainable bandwidth for data transfers between
the off die memory and on die processor cache.  This memory bandwidth is the
performance limiting factor for low flop density computational kernels (e.g.
BLAS level 1). STREAM measures the data transfer rate for some of these simple
vector kernels. The computational kernel reported in the rolled up statistics by
micprun for STREAM is "triad", which performs a = b + q*c where a, b, and c are
vectors and q is a scalar.  This operation is multi-threaded by using OpenMP.
For best performance there is, at most, one thread affinitized to each core.

The micprun scaling parameter category for STREAM varies the number of OpenMP
threads used to perform the computational kernel while otherwise solving the
same problem (strong scaling). The micprun optimal parameter category runs with
the number of threads which maximize performance on tested processors.
--------------------------------------------------------------------------------
2.6 Deepbench (libxsmm_conv, mkl_conv)
--------------------------------------------------------------------------------
The Deepbench benchmark suite is designed to test hardware using operations that
are extensively utilized when executing deep learning workloads. The whole suite
contains a wide set of tests that involve basic operations that exist in vast
majority of deep neural networks, whereas the micperf tool only incorporates a
subset of all tests that are not already covered by other kernels mentioned in
this section. Particularly, Micperf is able to run Intel(R) MKL and libxsmm
convolutions tests that are part of the Deepbench suite. It has to be noted that
convolutions are operations that have major impact on performance when image,
video, speech and similar workloads are considered.

Two tests were chosen for micperf. Both perform convolutions, but use different
libraries - the 'mkl_conv' kernel utilizes the Intel(R) MKL-DNN library, while
the 'libxsmm_conv' kernel makes use of the libxsmm library.

For more information about the benchmarks and libraries described in this
section please follow the links below:

https://github.com/baidu-research/DeepBench
https://github.com/01org/mkl-dnn
https://github.com/hfp/libxsmm
--------------------------------------------------------------------------------
2.7 Fio (Flexible IO)
--------------------------------------------------------------------------------
Fio (Flexible IO) is a versatile IO benchmark that was designed to simulate a
variety of cases that are useful when examining performance of write and read
operations.

Micperf uses fio to estimate drive read bandwidth. The configuration of the
test performed by micperf is described by the configuration file below:

[global]
filename=T*
iodepth=32
stonewall
direct=1
numjobs=N**
size=S***
thread
[Read-4k-bw]
rw=read
bs=4k

The example above configures each of N** threads to read S*** kilobytes of
memory in 4 kilobyte chunks from a drive defined by T*. By default N = 10,
S = 1MB and T is set to a partition on which the fio binary is installed.
Default values of those parameters can be changed with the -p argument to
micperf.

The fio benchmark requires "/dev/sdx" write and read access, thus in order to
run this benchmark the "--sudo" argument has to be added to the micprun command
line:

    $ micprun -k fio --sudo

The executing user has to be added to the sudoers list. Note that depending on
system settings user may be prompted for root password. In such case the micprun
execution will halt until the password is provided.

For more details please visit:

https://github.com/axboe/fio

================================================================================
3.  The micp Python Package
================================================================================

The micp Python Package includes a collection of executable scripts and
reference data files.
--------------------------------------------------------------------------------
3.1  Features
--------------------------------------------------------------------------------
  o Execute an individual workload with particular parameters.
  o Execute an individual workload with predefined parameter categories.
  o Aggregate execution of workloads with predefined parameter categories.
  o Collect and display performance results.
  o Plot statistics for scaling runs.
  o Record hardware and performance data to file.
  o Compare performance with previous runs.
--------------------------------------------------------------------------------
3.2  Executable Scripts
--------------------------------------------------------------------------------
The functionality of the package can be accessed by using the five
executable scripts: micprun, micpprint, micpcsv, micpplot, and
micpinfo.  For usage and extensive help information for any of the
scripts, pass --help as the command line parameter to the script, e.g:

    $ micprun --help

3.2.1  micprun

The micprun script is used to execute the benchmarks.  In simple use, micprun
executes the benchmark and displays the standard output from that benchmark. In
its most elaborate use, micprun repeats a sequence of runs of a benchmark suite
previously executed by the performance validation team at Intel, and then
compares the results to the reference data. In this mode, micprun returns a
non-zero return code if a performance regression is detected, and displays gtest
styled pass/fail output.

If an output directory is specified during execution, micprun will produce
pickle files containing data collected during the run. Use the micpprint,
micpcsv, micpplot, and micpinfo applications to inspect these pickle files.

3.2.2  micpprint

The micpprint application prints the performance measurements from data stored
in one or more pickle files.

3.2.3  micpplot

The micpplot application plots the performance statistics from one or more runs
based on data from pickle files produced during the run.

3.2.4  micpinfo

The micpinfo application obtains and displays system information. When run
without parameters, micpinfo displays information about the current system. When
run with a pickle file, it displays information about the system that created
the pickle file.

3.2.5  micpcsv

The micpcsv application extracts the performance data from a pickle file and
prints it to standard output in comma separated value (CSV) format. If it is run
without specifying a pickle file, micpcsv prints a table summarizing the results
in the reference data included in the distribution. If an output directory is
specified with the -o parameter, then a set of csv files is created.

3.3  Reference Data

This package includes Python pickle files, which contain recorded performance
information measured by Intel on Intel systems. Pickle files are serialized
Python objects. In particular, these files serialize the
micp.stats.StatsCollection object, which is produced by micprun. The reference
files begin with the name micp_run_stats and are located in the
/usr/share/micperf/micp directory.

All of the micp executable scripts can use a micp_run_stats pickle file as
input. There is an additional README.txt file provided in the data directory.

================================================================================
4.  Distributed Executable Binaries
================================================================================

The micperf package comes with compiled workload executables that can be found
in /usr/libexec/micperf.

================================================================================
5.  Source Code for Distributed Executables
================================================================================

The source code for benchmarks is distributed with the package. Sources are
packaged in a source RPM following the Linux standard convention. To inspect the
source code "install" the source RPM, change to the SOURCES directory and untar
the corresponding tarball. Refer to the example below. Please note that all
steps are performed as a non-root user:

    $ rpm -ihv  xppsl-micperf-<version>-<release>.src.rpm
    $ cd ~/rpmbuild/SOURCES
    $ tar -xf xppsl-micperf-<version>.tar.gz
    $ cd xppsl-micperf-<version>

The sub-directory for each workload contains a Makefile designed to work with
the Intel(R) Parallel Studio XE compiler package.

This version of the Intel(R) Parallel Studio XE will compile binaries for the
Intel(R) Xeon Phi(TM) Processor X200 Product Family.

The stream source directory includes a separate README.txt file, which describes
how to download the source for the STREAM benchmark, and the steps to build it
with and without support for 2MB pages. The STREAM binary distributed with the
package is built without 2MB page support via libhugetlbfs.

To rebuild the source RPM and execute micprun with these rebuilt workloads, the
following commands can be used:

    $ source <Intel_Parallel_Studio_XE_path>/bin/compilervars.sh intel64
    $ rpmbuild --rebuild xppsl-micperf-<version>-<release>.src.rpm

<version> and <release> vary depending on the current version of the package,
New binary rpm will be stored in ~/rpmbuild/RPMS, to know the exact name and
location look for the line shown below in the rmpbuild's output.

    Wrote: <path_to_binary>/RPM.rpm

Finally, install the new binary rpm. For more details on how to rebuild the
source RPM please refer to the micperf_users_guide.pdf.

================================================================================
6.  Additional Documentation
================================================================================

The /usr/share/doc/micperf/ directory contains this README.txt and individual
license agreements for the open source benchmarks.

This package, also known as the "OEM Workloads", is licensed under the
Intel(R) MPSS license agreement that is on Linux installed at:

/usr/share/doc/micperf-1.5.4/EULA