 
Package Contents
---------------
 
This package contains benchmarks for MKL GEMM routines per the End
User License Agreement for the Intel(R) Software Development Products
(license.txt).


Dependencies
------------
For the Intel(R) Xeon Phi(TM) Coprocessor X100 Product Family:
    1. Needs parallel_studio_xe_2017_composer_edition  or later.
    2. MPSS 2430-9 or later (scripts were tested only with this MPSS version)

For the Intel(R) Xeon Phi(TM) Coprocessor X200 Product Family:
    1. Needs parallel_studio_xe_2017_composer_edition or later
    2. MPSS 4.0.0 or above

For the Intel(R) Xeon Phi(TM) Processor X200 Product Family:
    1. Needs parallel_studio_xe_2017_composer_edition or later
    2. micperf-4.0.0 or above. No MPSS is required.

Source
------
The sources are available in this directory.

To build,
---------
1. Set the environment variables using
   source /opt/intel/compilers_and_libraries_2016/linux/bin/compilervars.sh intel64
2. Invoke 'make clean' and 'make all'
3. The binaries are available in this directory.

To Run
-------
Prebuilt binary are available under bin directory.

A) Native

To run the benchmarks simply invoke from /bin directory:

./rundgemm_mic  : Runs mkl based mic native dgemm benchmark for various default
                  matrix sizes
./runsgemm_mic  : Runs mkl based mic native sgemm benchmark for various default
                  matrix sizes
./runzgemm_mic  : Runs mkl based mic native zgemm benchmark for various default
                  matrix sizes
./runcgemm_mic  : Runs mkl based mic native cgemm benchmark for various default
                  matrix sizes

For custom native runs, upload a binary and MIC libiomp5.so to the card, set up
LD_LIBRARY_PATH in the card's shell, and execute the binary.

B) Offload (w/ data transfer)

To run the benchmarks set up environment variables using
source /opt/intel/compilers_and_libraries_2016/linux/bin/compilervars.sh intel64, and simply invoke from
/bin directory:

./dgemm_mkl_full_ofl.x  : Runs mkl based mic offload dgemm benchmark for
                          various default matrix sizes
./sgemm_mkl_full_ofl.x  : Runs mkl based mic offload sgemm benchmark for
                          various default matrix sizes
./zgemm_mkl_full_ofl.x  : Runs mkl based mic offload zgemm benchmark for
                          various default matrix sizes
./cgemm_mkl_full_ofl.x  : Runs mkl based mic offload cgemm benchmark for
                          various default matrix sizes

C) Automatic Offload / CPU

To run the benchmarks set up environment variables using
source /opt/intel/compilers_and_libraries_2016/linux/bin/compilervars.sh intel64, set MKL_MIC_ENABLE=1
if AO benchmarking is desired, and simply invoke from /bin directory:

./dgemm_mkl_native_cpu.x  : Runs mkl based cpu / mic automatic offload dgemm
                            benchmark for various default matrix sizes
./sgemm_mkl_native_cpu.x  : Runs mkl based cpu / mic automatic offload sgemm
                            benchmark for various default matrix sizes
./zgemm_mkl_native_cpu.x  : Runs mkl based cpu / mic automatic offload zgemm
                            benchmark for various default matrix sizes
./cgemm_mkl_native_cpu.x  : Runs mkl based cpu / mic automatic offload cgemm
                            benchmark for various default matrix sizes

Best performance is achieved when hyperthreading is turned off and the
following OpenMP settings are used:

* OMP_NUM_THREADS=<# of physical cores on the node>

* KMP_AFFINITY=scatter,granularity=fine

Dataset
--------
Built in defaults, starts with Square Matrix Size N=512, step size 512 and Ends
at 10240 for DGEMM and 16384 for SGEMM

Benchmark command line options
------------------------------
-n <number of threads>
	Default value = -1

-i <minimum of iterations>
	Default value = 2

-t <time minimum>
	Default value = 2.0

-f <initial size of testing matrix>
	Default value = 512

-l <final size of testing matrix>
	Default value = 16384 for SINGLE
	Default value = 10240 for DOUBLE
	Default value = 10240 for COMPLEX
	Default value =  8192 for DOUBLE COMPLEX

-s <step size>
	Default value = 512

