Girih

This is a pre-alpha release. Contact me if you have issues or questions at tareq.m.malas at gmail.com

(Inspired from http://en.wikipedia.org/wiki/Girih)

Introduction

This tool serves as a test harness for different optimization techniques to improve stencil computations performance in shared and distributed memory systems. Spatial and temporal blocking techniques are implemented across all implemented stencil operators. The tool is mainly used to develop and analyze the performance of Multi-core Wavefront Diamond (MWD) tiling techniques, which are used to perform temporal blocking. Naive, halo-first, and diamond tiling options are implemented for MPI communication in distributed memory setup. The project supports star stencil operators at different orders in space and time and with constant/variable coefficients and different coefficients' symmetry options. It provides means to verify the results' correctness of the optimized stencil kernels using the results of unoptimized serial reference implementations. Timing routines are inserted manually in the code to measure the performance of the main components of the code, including computation, communication, and idle time. LIKWID performance tool is used to perform measurements through hardware counters at regions of interest in the code. This enables the user to accurately measure the memory bandwidth, energy consumption, TLB misses, and other supported groups at LIKWID.

Installation & compilation

Use conf/make.conf file to set the Makefile variables to the desired compiler binaries and flags. The main make targets are:

dp: Used to set the problem data to double-precision.
debug: Used for debugging and verification builds.

The make command creates build directory according to the selected precision for the executable "mwd_kernels" and the object files. dp make target creates "build_dp" directory and the default build creates "build" directory.

Usage

The list of available parameters can be printed by passing the argument --help. To see all available stencil operators and optimization techniques, pass the argument --list.

Examples

Run 7-point constant-coefficient stencil using a grid of size 512^3 and 500 time steps, using the relaxed-synchronzation wavefront MWD algorithm (Auto-tuning may take time, given that no tile size nor thread group size are selected)

export OMP_NUM_THREADS=12; ./build_dp/mwd_kernel --nx 512 --ny 512 --nz 512 --nt 500 --target-kernel 1 --mwd-type 2 --target-ts 2

Performance output summaries

Manual timing routines (using MPI_Wtime) are used to collect the time spent in:

Computation
Communication (including waiting time)
Waiting at MPI_Wait/MPI_Waitall and MPI_Barrier
Total time of the time stepper, including the waiting time at the barrier that is used to synchronize with other MPI processes after completing the time stepper work
Other parts of the code, where this value =total-computation-communication. This value is useful to check if any significant time is spent elsewhere

Additional timing measurements are printed at the MWD implementation. They include details about the time distribution of each thread and thread group during the runtime.

A successful experiment prints out statistics about the timing results above. The maximum, minimum, and mean values are printed along with Rank0 timing results.

Helpful scripts

Several Python scripts (available under scripts/) are used to run experiments for performance measurements and verification regression tests. verification/run_verification.py 0 1 2 Runs a regression test over the standard and MWD implementations. It runs a combination of topology sizes and domain sizes and other parameters with verification

verification/run_short_verification_0to1.py Runs a short version of the regression test over the naive and halo-first implementations

verification/run_short_verification_intra_diamond.py Runs a short version of the regression test over the MWD implementations

parse.py exper-file-1 exper-file-2 ...exper-file-n This script takes multiple files, each contains the output of individual experiment. The parsed results are written in CSV format in summary.csv

Experiment submission scripts Most of the remaining scripts are used to run experiments at systems with job scheduler (for example, Shaheen at KAUST). These scripts should be executed from the root directory of the project

For questions and inquiries please contact tareq.m.malas@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 409 Commits
conf		conf
pluto_examples		pluto_examples
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
url_barcode.gif		url_barcode.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

pluto_examples

pluto_examples

scripts

scripts

src

src

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

url_barcode.gif

url_barcode.gif

Repository files navigation

Girih

Introduction

Installation & compilation

Usage

Examples

Run 7-point constant-coefficient stencil using a grid of size 512^3 and 500 time steps, using the relaxed-synchronzation wavefront MWD algorithm (Auto-tuning may take time, given that no tile size nor thread group size are selected)

Performance output summaries

Helpful scripts

About

Releases

Packages

Languages

License

tareqmalas/girih

Folders and files

Latest commit

History

Repository files navigation

Girih

Introduction

Installation & compilation

Usage

Examples

Run 7-point constant-coefficient stencil using a grid of size 512^3 and 500 time steps, using the relaxed-synchronzation wavefront MWD algorithm (Auto-tuning may take time, given that no tile size nor thread group size are selected)

Performance output summaries

Helpful scripts

About

Resources

License

Stars

Watchers

Forks

Languages