Skip to content

amirfarhat/machine-learning-flow-sizes

Repository files navigation

The data files are on cambridge.csail.mit.edu at /home/amirf/amir_superurop/machine-learning-flow-sizes

The cachenet_experiments directory houses tcpdump and iteration measurements conducted on distributed learning of ResNet50, VGG19, and GPT-2 using KungFu.

The cerberus_experiments directory houses tcpdump and iteration measurements conducted on distributed learning of MobileNet, DenseNet121, InceptionV3, ResNet50, and VGG19 using Horovod.

The paper uses data from training experiments done using Horovod as part of the cerberus_experiments.

This repository does not directly contain processed data files used for plotting. We favored instead to open source of data collection and processing scripts to enable others to replicate the experiment themselves. Each experiment and model has a run_steps.sh bash script which descibres the precise experiment to be run on Google Cloud GPU servers. All details about these machines and the arrangements can be found in the paper.

For any questions, please feel free to reach out to amirf at mit.edu :)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published