Skip to content

tusharkalecam/DistributedDeepLearning

 
 

Repository files navigation

Introduction

This repo contains a cookiecutter template for running distributed training of deep learning models using Azure Machine Learning. You can create clusters with 0 nodes which will incur no cost and scale this up to hundreds of nodes. It is also possible to use low priority nodes to reduce costs even further.

The project contains the following:

Tensorflow Benchmark

This is a demo template that allows you to easily run tf_cnn_benchmarks on Azure ML. This is a great way to test performance as well as compare to other platforms

Tensorflow Imagenet

This is another demo template that shows you how to train a ResNet50 model using Imagenet on Azure. We include scripts for processing the Imagenet data, transforming them to TF Records as well as leveraging AzCopy to quickly upload the data to the cloud.

Tensorflow Template

This is a blank template you can use for your own distributed training projects. It allows you to leverage all the tooling built around the previous two demos to speed up the time it takes to run your model in a distributed fashion on Azure.

PyTorch Benchmark

This is a demo template that allows you to easily run a simple PyTorch benchmarking script on Azure ML. This is a great way to test performance as well as compare to other platforms

PyTorch Imagenet

This is another demo template that shows you how to train a ResNet50 model using Imagenet on Azure. We include scripts for processing the Imagenet data as well as leveraging AzCopy to quickly upload the data to the cloud.

PyTorch Template

This is a blank template you can use for your own distributed training projects. It allows you to leverage all the tooling built around the previous two demos to speed up the time it takes to run your model in a distributed fashion on Azure.

Prerequisites

Before you get started you need a PC running Ubuntu and the following installed:
Docker installed
Nvidia runtime for docker [Required for local execution]
Python>=3.6
Cookiecutter installed
Git installed

Note: You will need to run docker without sudo, to do this run:

sudo usermod -aG docker $USER
newgrp docker 

Setup

Using the template

Once you have Cookiecutter installed you can either directly invoke project creation as follows:

cookiecutter gh:Microsoft/distributeddeeplearning

or clone locally and then invoke

git clone https://github.com/Microsoft/distributeddeeplearning.git
cookiecutter distributeddeeplearning

Cookiecutter will then ask you about a number of fields which it will use to construct your project. If you simply want to select the defaults don't write or select anything just press enter. Many of them can be left at the default values, the ones that are absolutely necessary are highlighted

project_title: The title of your project

project_name: The folder in which your project will be created. Make sure it is a valid linux folder name

resource_group: The name of the resource group in Azure under which all the resources will be created. It is fine if it already exists

workspace: The AML workspace that the project will use. If it doesn't already exist it will create it

sub_id: The subscription id for your project, you can look this up on the portal or run a command on the cloud shell to get it. It isn't mandatory though, the application will give you an option to select it later.

vm_size: The VM type to use for distributed training

minimum_number_nodes: The minimum number of nodes in the cluster. Set to 0 if you want it to scale down after use to reduce costs
maximum_number_nodes: The maximum number of nodes in the cluster

cluster_name: The name of the cluster to use. It will create it if it doesn't exist

container_registry: The name of your dockerhub or other account which you may want to push your control plane docker container. If you don't have one or don't want to push the container to it simply leave as default

type: The type of project you want:

  • all: All of them
  • template: Just create a template for distributed training
  • benchmark: Create project that will run the Tensorflow benchmarks
  • imagenet: Create an example project that will run against the imagenet data. (You will need to download the imagenet data)

region: Which region to create Azure resources in

experiment_name: The name of the experiment

data: The absolute path on your computer where you will store the imagenet data. The location needs to have around 400GB of space

image_name: The name to give the control plane docker image

datastore_name: Name of the datastore that will be created as part of the project

container_name: The name of the container in your storage account that will hold the data

Once the project is created you will still be able to change many of the above options as they will be present in .env file that will be created.

Building environment

Distributed training is complex and often has a number of moving parts. To reduce the overhead of installing packages and managing environments we use a docker container to encapsulate our enviroment. So once you have created the project simply navigate to the root folder created by cookiecutter and run:

make build

This will build your docker container. Inside your docker container will be an appropriately set up conda environment a number of utilities such as AzCopy as well as everything you will need to run your distributed training job. Once your container is built run:

make run

This will put you in an environment inside your container in a tmux session (for a tutorial on tmux see here). The tmux control key has been mapped to ctrl+a rather than the standard ctrl+b so as not to interfere with outer tmux session if you are already a tmux user. You can alter this in the tmux.conf file in the Docker folder. The docker container will map the location you launched it from to the location /workspace inside the docker container. Therefore you can edit files outside of the container in the project folder and the changes will be reflected inside the container.

Imagenet data

If you have selected all, tensorflow_imagenet or pytorch_imagenet in the type question during cookiecutter invocation then you will need to have ILSVRC2012_img_train.tar and ILSVRC2012_img_val.tar present in the direcotry you specified as your data directory. Go to the download page (you may need to register an account), and find the page for ILSVRC2012. You will need to download the two files mentioned earlier.

Template selection

Based on the option you selected for type during the cookiecutter invocation you will get all or one of the options below. Cookiecutter will create your project folder which will contain the tempalte folders. When inside your project folder make sure you have run the make build and make run commands as mentioned in building environment section above. Once you run the run command you will be greeted by a prompt, this is now your control plane. First you will need to set everything up. To do this run

inv setup 

It will ask you to log in so follow the prompts in the terminal. If you selected all in the template type it will also prepare the imagenet data. Now you will be ready to run the templates.

Tensorflow Benchmark

This is a demo template allows you to easily run tf_cnn_benchmarks on Azure ML. This is a great way to test performance as well as compare to other platforms. To use this you must either select benchmark or all when invoking cookiecutter. Once setup is complete then simply run:

inv tf-benchmark.submit.local.synthetic

to run things locally on a single GPU. Note that the first time you run things you will have to build the environment. To run things on a cluster simply run:

inv tf-benchmark.submit.remote.synthetic

Note that this will create the cluster if it wasn't created earlier and create the appropriate environment.

Tensorflow Imagenet

This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data, image data as well as tfrecords. To use this you must either select tensorflow_imagenet or all when cookiecutter asks what type of project you want to create. The run things locally using synthetic data simply run:

inv tf-imagenet.submit.local.synthetic

To run things on a remote cluster with real data in tfrecords format simply run:

inv tf-imagenet.submit.remote.tfrecords

This only covers a small number of commands, to see the full list of commands simply run inv --list.

Tensorflow Experiment

This is the option that you should use if you want to run your own training script. It is up to you to add the appropriate training scripts and modify the tensorflow_experiment.py file to run the appropriate commands. If you want to see how to invoke things simply look at the other examples.

Pytorch Benchmark

This is a demo template allows you to easily run a simple PyTorch benchmarking script on Azure ML. To use this you must either select benchmark or all when invoking cookiecutter. Once setup is complete then simply run:

inv pytorch-benchmark.submit.local.synthetic

to run things locally on a single GPU. Note that the first time you run things you will have to build the environment. To run things on a cluster simply run:

inv pytorch-benchmark.submit.remote.synthetic

Note that this will create the cluster if it wasn't created earlier and create the appropriate environment.

PyTorch Imagenet

This is the second demo template that will train a ResNet50 model on imagenet. It allows the options of using synthetic data or image data. To use this you must either select pytorch_imagenet or all when cookiecutter asks what type of project you want to create. The run things locally using synthetic data simply run:

inv pytorch-imagenet.submit.local.synthetic

To run things on a remote cluster with real data in tfrecords format simply run:

inv pytorch-imagenet.submit.remote.tfrecords

Pytorch Experiment

This is the option that you should use if you want to run your own training script. It is up to you to add the appropriate training scripts and modify the pytorch_experiment.py file to run the appropriate commands. If you want to see how to invoke things simply look at the other examples.

Architecture

Below is a diagram that shows how the project is set up.

The docker container you created using make build is the control plane and from there we can invoke jobs to execute either locally or in the cloud. Local execution is meant for debugging and will run on a single GPU. The mapping of data locations is handled by the control scripts. During local execution the appropriate location is mapped to the container. During remote execution the data store created during setup will be mounted on to each of the VMs in the cluster.

Project structure

The original project structure is as shown below.

├── cookiecutter.json  <-- Cookiecutter json that holds all the variables for the projects  
├── hooks  
│  ├── post_gen_project.py  
│  └── pre_gen_project.py  
├── images  
│  └── demo.svg  
├── LICENSE  
├── README.md <-- This readme  
└── {{cookiecutter.project_name}}  
   ├── _dotenv_template <-- Template that is read and translated into .env file  
   ├── control <-- Holds all files for the control plane  
   │  ├── Docker <-- Contains the files used to build the control plane docker container
   │  │  ├── azure_requirements.txt <-- Azure python requirements
   │  │  ├── bash.completion <-- Completion script for invoke
   │  │  ├── dockerfile
   │  │  ├── environment.yml <-- Conda environment specification for control plane
   │  │  ├── jupyter_notebook_config.py 
   │  │  └── tmux.conf <-- Tmux configuration
   │  └── src
   │     ├── aml_compute.py <-- Module that holds methods for creating cluster and submitting experiments using Azure ML
   │     ├── config.py <-- Module for loading and working with .env config
   │     └── logging.conf <-- Logging configuration for control plane
   ├── Makefile <-- Makefile to build and run control plane
   ├── scripts
   │  ├── convert_imagenet_to_tf_records.py <-- Script for transforming imagenet data to tf records
   │  ├── image.py <-- Invoke module for working with images
   │  ├── imagenet_nounid_to_class.json <-- Imagenet nounid lookup
   │  ├── prepare_imagenet.py <-- Script for preparing imagenet data
   │  ├── storage.py <-- Invoke module for using Azure storage
   │  └── tfrecords.py <-- Invoke module for working with tf records
   ├── tasks.py <-- Main invoke module
   ├── PyTorch_benchmark<-- Template for running PyTorch benchmarks
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_benchmark.py<-- Invoke module for running benchmarks
   │  └── src
   │     └── pytorch_synthetic_benchmark.py
   ├── PyTorch_imagenet
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_imagenet.py<-- Invoke module for running benchmarks
   │  └── src
   │     ├── imagenet_pytorch_horovod.py
   │     ├── logging.conf
   │     └── timer.py
   ├── PyTorch_experiment<-- PyTorch distributed training template [Put your code here]
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml<-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── pytorch_experiment.py<-- Invoke module for running benchmarks
   │  └── src
   │     └── train_model.py
   ├── TensorFlow_benchmark <-- Template for running Tensorflow benchmarks
   │  ├── environment_cpu.yml 
   │  ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── src <-- Folder where tensorflow benchmarks code will be cloned into
   |  └── tensorflow_benchmark.py <-- Invoke module for running benchmarks
   ├── TensorFlow_experiment <-- Tensorflow distributed training template [Put your code here]
   │  ├── environment_cpu.yml
   │  ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
   │  ├── src
   │  │  ├── logging.conf
   │  │  └── train_model.py <-- Template file
   │  └── tensorflow_experiment.py <-- Invoke module for running template
   └── TensorFlow_imagenet
      ├── environment_cpu.yml
      ├── environment_gpu.yml <-- Conda specification file used by Azure ML to create environment to run project in
      ├── src <-- Code for training ResNet50 model on imagenet
      │  ├── data
      │  │  ├── __init__.py
      │  │  ├── images.py
      │  │  ├── synthetic.py
      │  │  └── tfrecords.py
      │  ├── defaults.py
      │  ├── imagenet_preprocessing.py
      │  ├── logging.conf
      │  ├── resnet_main.py <-- Main entry script
      │  ├── resnet_model.py
      │  ├── resnet_run_loop.py
      │  ├── timer.py
      │  └── utils.py
      └── tensorflow_imagenet.py <-- Invoke module for running imagenet experiment

Depending on the options chosen only certain branches will be moved over to your project.

Options

These are the options when using the template. These can differ depenting on the type of project you choose to create. To see this list youself simply run:

inv --list
  delete                                     Delete the resource group and all associated resources
  experiments                                Prints list of experiments
  interactive (i)                            Open IPython terminal and load in modules to work with AzureML
  login                                      Log in to Azure CLI
  runs                                       Prints information on last N runs in specified experiment
  select-subscription                        Select Azure subscription to use
  setup                                      Setup the environment and process the imagenet data
  tensorboard                                Runs tensorboard in a seperate tmux session
  pytorch-benchmark.submit.local.synthetic    Submit PyTorch training job using synthetic data for local execution
  pytorch-benchmark.submit.remote.synthetic   Submit PyTorch training job using synthetic data to remote cluster
  pytorch-imagenet.submit.local.images        Submit PyTorch training job using real imagenet data for local execution
  pytorch-imagenet.submit.local.synthetic     Submit PyTorch training job using synthetic imagenet data for local execution
  pytorch-imagenet.submit.remote.images       Submit PyTorch training job using real imagenet data to remote cluster
  pytorch-imagenet.submit.remote.synthetic    Submit PyTorch training job using synthetic imagenet data to remote cluster
  storage.create-resource-group
  storage.store-key                          Retrieves premium storage account key from Azure and stores it in .env file
  storage.image.create-container             Creates container based on the parameters found in the .env file
  storage.image.download-data                Download training and validation data from blob container specified in .env file
  storage.image.download-training            Download training data from blob container specified in .env file
  storage.image.download-validation          Download validation data from blob container specified in .env file
  storage.image.prepare-imagenet             Prepare imagenet data found in download_dir and push results to target_dir
  storage.image.upload-data                  Upload training and validation data to container specified in .env file
  storage.image.upload-training-data         Upload training data to container specified in .env file
  storage.image.upload-validation-data       Upload validation data to container specified in .env file
  storage.create-container                   Creates container based on the parameters found in the .env file
  storage.create-premium-storage             Creates premium storage account. By default the values are loaded from the local .env file
  storage.tfrecords.upload-validation-data   Upload tfrecords validation data to container specified in .env file
  tf-benchmark.submit.local.synthetic        Submits TensorFlow benchmark job using synthetic data for local execution
  tf-benchmark.submit.remote.synthetic       Submits TensorFlow benchmark job using synthetic data on remote cluster
  tf-experiment.submit.local.images          This command isn't implemented please modify to use.
  tf-experiment.submit.local.synthetic       This command isn't implemented please modify to use.
  tf-experiment.submit.remote.images         This command isn't implemented please modify to use.
  tf-experiment.submit.remote.synthetic      This command isn't implemented please modify to use.
  tf-imagenet.submit.local.images            Submit TensorFlow training job using real imagenet data for local execution
  tf-imagenet.submit.local.synthetic         Submit TensorFlow training job using synthetic imagenet data for local execution
  tf-imagenet.submit.local.tfrecords         Submit TensorFlow training job using real imagenet data as tfrecords for local execution
  tf-imagenet.submit.remote.images           Submit TensorFlow training job using real imagenet data to remote cluster
  tf-imagenet.submit.remote.synthetic        Submit TensorFlow training job using synthetic imagenet data to remote cluster
  tf-imagenet.submit.remote.tfrecords        Submit TensorFlow training job using real imagenet data as tfrecords to remote cluster

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

About

Distributed Deep Learning using AzureML

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.3%
  • Other 1.7%