ReproMPI Benchmark

Introduction

The ReproMPI Benchmark is a tool designed to accurately measure the run-time of MPI blocking collective operations. It provides multiple process synchronization methods and a flexible mechanism for predicting the number of measurements that are sufficient to obtain statistically sound results.

Installation

Prerequisites

an MPI library
CMake (version >= 2.6)
GSL libraries

Basic installation

cd $BENCHMARK_PATH
./cmake .
make

For specific configuration options check the Benchmark Configuration section.

Running the ReproMPI Benchmark

The ReproMPI code is designed to serve two specific purposes:

Benchmarking of MPI collective calls

The most common usage scenario of the benchmark is to specify an MPI collective function to be benchmarked, a (list of) message sizes and the number of measurement repetitions for each test, as in the following example.

mpirun -np 4 ./bin/mpibenchmark --calls-list=MPI_Bcast,MPI_Allgather 
             --msizes-list=8,1024,2048  --nrep=10

Estimating the number of measurement iterations

In this scenario, the user can generate an estimation of the number of measurements required for a stable result with respect to one or multiple prediction methods.

More details about the various methods that are supported and their usage can be found in:

S. Hunold, A. Carpen-Amarie, F.D. Lübbe and J.L. Träff, “Automatic Verification of Self-Consistent MPI Performance Guidelines”, EuroPar (2016)

This is an example of how to perform such an estimation:

mpirun -np 4 ./bin/mpibenchmarkPredNreps --calls-list=MPI_Bcast,MPI_Allgather 
              --msizes-list=8,1024,2048 --rep-prediction=min=1,max=200,step=5

Command-line Options

Common Options

-h print help
-v print run-times measured for each process
--msizes-list=<values> list of comma-separated message sizes in Bytes, e.g., --msizes-list=10,1024
--msize-interval=min=<min>,max=<max>,step=<step> list of power of 2 message sizes as an interval between 2^min and 2^max, with 2^step distance between values, e.g., --msize-interval=min=1,max=4,step=1
--calls-list=<args> list of comma-separated MPI calls to be benchmarked, e.g., --calls-list=MPI_Bcast,MPI_Allgather
--root-proc=<process_id> root node for collective operations
--operation=<mpi_op> MPI operation applied by collective operations (where applicable), e.g., --operation=MPI_BOR.
Supported operations: MPI_BOR, MPI_BAND, MPI_LOR, MPI_LAND, MPI_MIN, MPI_MAX, MPI_SUM, MPI_PROD
--datatype=<mpi_type> MPI datatype used by collective operations, e.g., --datatype=MPI_CHAR.
Supported datatypes: MPI_CHAR, MPI_INT, MPI_FLOAT, MPI_DOUBLE
--shuffle-jobs shuffle experiments before running the benchmark
--params=k1:v1,k2:v2 list of comma-separated key:value pairs to be printed in the benchmark output.
-f | --input-file=<path> input file containing the list of benchmarking jobs (tuples of MPI function, message size, number of repetitions). It replaces all the other common options.

Options Related to the Window-based Synchronization

--window-size=<win> window size in microseconds for Window-based synchronization

Specific options for synchronization methods based on a linear model of the clock drift

--fitpoints=<nfit> number of fitpoints (default: 20)
--exchanges=<nexc> number of exchanges (default: 10)

Specific Options for the ReproMPI Benchmark

--nrep=<nrep> set number of experiment repetitions
--summary=<args> list of comma-separated data summarizing methods (mean, median, min, max), e.g., --summary=mean,max

Specific Options for Estimating the Number of Repetitions

--rep-prediction=min=<min>,max=<max>,step=<step> set the total number of repetitions to be estimated between <min> and <max>, so that at each iteration i, the number of measurements (nrep) is either nrep(0) = <min>, or nrep(i) = nrep(i-1) + <step> * 2^(i-1), e.g., --rep-prediction=min=1,max=4,step=1
--pred-method=m1,m2 comma-separated list of prediction methods, i.e., rse, cov_mean, cov_median (default: rse)
--var-thres=thres1,thres2 comma-separated list of thresholds corresponding to the specified prediction methods (default: 0.01)
--var-win=win1,win2 comma-separated list of (non-zero) windows corresponding to the specified prediction methods; rse does not rely on a measurement window, however a dummy window value is required in this list when multiple methods are used (default: 10)

Supported Collective Operations:

MPI Collectives

MPI_Allgather
MPI_Allreduce
MPI_Alltoall
MPI_Barrier
MPI_Bcast
MPI_Exscan
MPI_Gather
MPI_Reduce
MPI_Reduce_scatter
MPI_Reduce_scatter_block
MPI_Scan
MPI_Scatter

Mockup Functions of Various MPI Collectives

GL_Allgather_as_Allreduce
GL_Allgather_as_Alltoall
GL_Allgather_as_GatherBcast
GL_Allreduce_as_ReduceBcast
GL_Allreduce_as_ReducescatterAllgather
GL_Allreduce_as_ReducescatterblockAllgather
GL_Bcast_as_ScatterAllgather
GL_Gather_as_Allgather
GL_Gather_as_Reduce
GL_Reduce_as_Allreduce
GL_Reduce_as_ReducescatterGather
GL_Reduce_as_ReducescatterblockGather
GL_Reduce_scatter_as_Allreduce
GL_Reduce_scatter_as_ReduceScatterv
GL_Reduce_scatter_block_as_ReduceScatter
GL_Scan_as_ExscanReducelocal
GL_Scatter_as_Bcast

Benchmark Configuration

Process Synchronization Methods

MPI_Barrier

This is the default synchronization method enabled for the benchmark.

Dissemination Barrier

To benchmark collective operations acorss multiple MPI libraries using the same barrier implementation, the benchmark provides a dissemination barrier that can replace the default MPI_Barrier to synchronize processes.

To enable the dissemination barrier, the following flag has to be set before compiling the benchmark (e.g., using the ccmake command).

ENABLE_BENCHMARK_BARRIER

Both barrier-based synchronization methods can alternatively use a double barrier before each measurement.

ENABLE_DOUBLE_BARRIER

Window-based Synchronization

The ReproMPI benchmark implements a window-based process synchronization mechanism, which estimates the clock offset/drift of each process relative to a reference process and then uses the obtained global clocks to synchronize processes before each measurement and to compute run-times.

It relies on one of the following clock synchronization methods:

HCA synchronization: this is the clock synchronization algorithm we propose in []. It computes a linear model of the clock drift of each process. The HCA method can be configured by setting the following flags before compilation.

ENABLE_WINDOWSYNC_HCA 
ENABLE_LOGP_SYNC

The ENABLE_LOGP_SYNC flag determines which variant of the HCA algorithm is used, i.e., either HCA1 (which computes the clock models in O(p) steps) or HCA2 (which requires only O(log p) rounds).

SKaMPI synchronization: it implements the SKaMPI clock synchronization algorithm. To enable it, set the following flag before compilation.

ENABLE_WINDOWSYNC_SK

Jones and Koenig synchronization: it implements the clock synchronization algorithm introduced by Jones and Koenig~[]. To enable it, set the following flag before compilation.

ENABLE_WINDOWSYNC_JK

Timing procedure

The MPI operation run-time is computed in a different manner depending on the selected clock synchronization method. If global clocks are available, the run-times are computed as the difference between the largest exit time and the first start time among all processes.

If a barrier-based synchronization is used, the run-time of an MPI call is computed as the largest local run-time across all processes.

However, the timing proceduce that relies on global clocks can be used in combination with a barrier-based synchronization when the following flag is enabled:

ENABLE_GLOBAL_TIMES

More information regarding the timing procedure can be found in [].

Clock resolution

The MPI_Wtime cll is used by default to obtain the current time. To obtain accurate measurements of short time intervals, the benchmark can rely on the high resolution RDTSC/RDTSCP instructions (if they are available on the test machines) by setting on of the following flags:

ENABLE_RDTSC
ENABLE_RDTSCP

Additionally, setting the clock frequency of the CPU is required to obtain accurate measurements:

FREQUENCY_MHZ                    2300

The clock frequency can also be automatically estimated (as done by the NetGauge tool) by enabling the following variable:

CALIBRATE_RDTSC

However, this method reduces the results accuracy and we advise to manually set the highest CPU frequency instead. More details about the usage of RDTSC-based timers can be found in our research report[].

List of Compilation Flags

This is the full list of compilation flags that can be used to control all the previously detailed configuration parameters.

CALIBRATE_RDTSC                  OFF   
COMPILE_BENCH_TESTS              OFF          
COMPILE_PRED_BENCHMARK           ON                
COMPILE_SANITY_CHECK_TESTS       OFF               
ENABLE_BENCHMARK_BARRIER         OFF             
ENABLE_DOUBLE_BARRIER            OFF             
ENABLE_GLOBAL_TIMES              OFF             
ENABLE_LOGP_SYNC                 OFF             
ENABLE_RDTSC                     OFF             
ENABLE_RDTSCP                    OFF           
ENABLE_WINDOWSYNC_HCA            OFF            
ENABLE_WINDOWSYNC_JK             OFF        
ENABLE_WINDOWSYNC_SK             OFF      
FREQUENCY_MHZ                    2300

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
cmake_modules		cmake_modules
platform_files		platform_files
src		src
tools/convert_tags_to_bench_code		tools/convert_tags_to_bench_code
.gitignore		.gitignore
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
ChangeLog		ChangeLog
Changes.txt		Changes.txt
LICENSE		LICENSE
README.org		README.org
create_package.sh		create_package.sh

License

duzhuqi/reprompi

Folders and files

Latest commit

History

Repository files navigation