The data files are on cambridge.csail.mit.edu
at /home/amirf/amir_superurop/machine-learning-flow-sizes
The cachenet_experiments
directory houses tcpdump
and iteration measurements conducted on distributed learning of ResNet50
, VGG19
, and GPT-2
using KungFu
.
The cerberus_experiments
directory houses tcpdump
and iteration measurements conducted on distributed learning of MobileNet
, DenseNet121
, InceptionV3
, ResNet50
, and VGG19
using Horovod
.
The paper uses data from training experiments done using Horovod
as part of the cerberus_experiments
.
This repository does not directly contain processed data files used for plotting. We favored instead to open source of data collection and processing scripts to enable others to replicate the experiment themselves. Each experiment and model has a run_steps.sh
bash
script which descibres the precise experiment to be run on Google Cloud GPU servers. All details about these machines and the arrangements can be found in the paper.
For any questions, please feel free to reach out to amirf at mit.edu
:)