Theano-MPI

Theano-MPI is a distributed framework for training deep learning models built in Theano based on data-parallelism. The data-parallelism is implemented in two ways: Bulk Synchronous Parallel and Elastic Averaging SGD. This project is an extension to theano_alexnet, aiming to scale up the training framework to more than 8 GPUs and across nodes. Please take a look at this technical report for an overview of implementation details. To cite our work, please use the following bibtex entry.

@article{ma2016theano,
  title = {Theano-MPI: a Theano-based Distributed Training Framework},
  author = {Ma, He and Mao, Fei and Taylor, Graham~W.},
  journal = {arXiv preprint arXiv:1605.08325},
  year = {2016}
}

Theano-MPI is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as theano shared variables. See lib/base/models/ for details. Or you can build your own models from scratch using basic theano tensor operations and expose your model parameters as theano shared variables. See wiki for a tutorial on building customized neural networks.

Dependencies

OpenMPI 1.8.7 or an MPI-2 standard equivalent that supports CUDA.
mpi4py built on OpenMPI 1.8.7
numpy
Theano 0.8 or higher
Pylearn2
zeromq
hickle
CUDA 7.0
cuDNN
PyCUDA built on CUDA 7.0

How to run

Prepare image data batches

Follow the precedure in theano_alexnet README.md for downloading image data from ImageNet, shuffling training images, generating data batches, computing the mean image and generating label files. The preprocessed data files will be in hickle format. Each file contains 128 or more images. This is the file batch size B. Any divisor of B can be used as batch size during training. Set dir_head, train_folder, val_folder in run/config.yaml to reflect the location of your preprocessed data.

Run training sessions on copper

1. ssh copper.sharcnet.ca
1. ssh to one computing node e.g., cop3
1. set ~/.theanorc to the following:

[global]

mode = FAST_RUN

floatX = float32

base_compiledir = /home/USERNAME/.theano

[cuda]

root=/opt/sharcnet/cuda/7.0.28/toolkit

1. cd into run/ and configure each section in the config.yaml. Configure the yaml file corresponding to the chosen model, e.g., alexnet.yaml, googlenet.yaml, vggnet.yaml or customized.yaml.
to start a BSP training session:
- 1. In the "weight exchange" section in config.yaml, choose as follows:
```
sync_rule: BSP
```
- 1. choose a parameter exchanging strategy from "ar", "asa32", "asa16" and "copper", where "ar" means using Allreduce() from mpi4py, "asa32" and "asa16" mean using the Alltoall-sum-Allgather strategy with float32 and float16 respectively, "copper" means using the binary reduction strategy designed for copper GPU topology.
- 1. execute "./run_bsp_workers.sh N", in which N is the desired number of workers. N can only be a power of 2 if chosing strategies like "asa" and "copper".
to start an EASGD training session:
- 1. If you want to start server and workers in one communicator, configure config.yaml file as follows:
```
sync_rule: EASGD
sync_start: True 
avg_freq: 2 or desired value

# randomness
shuffle: True
```
- 1. check the example ./run_easgd_4w_sync_start.sh (or ./run_easgd_4w.sh if sync_start==False), decide how many workers you want to run and which hosts and GPUs you want to use for each worker and the server, make your customized run.sh script.
- 1. execute your ./run.sh.

Performance Testing

###BSP Time per 5120 images in seconds: [allow_gc = True]

Model	1GPU	2GPU	4GPU	8GPU	16GPU
AlexNet-128b	31.20	15.65	7.78	3.90
GoogLeNet-32b	134.90	67.38	33.60	16.81
VGGNet-32b	410.3	216.0	113.8	64.7	38.5

How to customize your model

See wiki for a tutorial of customizing a MLP model in the framework.

Also check out an example incoperation of the 16-layer VGGNet from Lasagne model zoo to get an idea of how to import Lasagne models into Theano-MPI.

Note

To get the best running speed performance, the memory cache may need to be cleaned before running.

To get deterministic and reproducible results, turn off all randomness in the config 'random' section and use cudaconvnet from pylearn2.

Shuffling training examples before asynchronous training makes the loss surface a lot smoother during model converging.

Some known bugs and possible enhancement are listed in Issues. We welcome all kinds of participation (bug reporting, discussion, pull request, etc) in improving the framework.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
lib		lib
run		run
show		show
test		test
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

run

run

show

show

test

test

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Theano-MPI

Dependencies

How to run

Prepare image data batches

Run training sessions on copper

Performance Testing

How to customize your model

Note

About

Releases

Packages

Languages

License

adeelzaman/Theano-MPI

Folders and files

Latest commit

History

Repository files navigation

Theano-MPI

Dependencies

How to run

Prepare image data batches

Run training sessions on copper

Performance Testing

How to customize your model

Note

About

Resources

License

Stars

Watchers

Forks

Languages