1. Matrix Multiplication

###How to run? - ./launch_bigmatrixmultiplication.sh

Time recorded for the complete multiplication : 145 sec (average on 5 runs)

Formula used

  tr(A2)=∑i∑jtr(AijAji)

Implementation trick:

We observed the fact that A(ij) and A(ji) is on the same machine then the operation would be faster. Hence did the following things:

Always keep A(ij) and A(ji) on the same machine
Do not burden a particular machine and hence distribute the trace evenly across the cluster.

We created a cache using hashmap to keep track of the traces and used that in our implementation.

2. Synchronous SGD

###How to run? - ./launch_synchronoussgd.sh Learning rate : 0.01

Number of parallel tasks used : 20 (4 on each VM)

Time per iteration in sync mode with 20 tasks : 1.5 sec

In local mode time for 100 iteration was 5 sec

NOTE: We also have an implementation where we used 5 workers and the time per iteration in that was 1.2 secs

##Implementation details and ooptimizations done:

Used 20 workers to parallize the job. This gave us 4X performance benefit in terms of number of training example trained on.
Model 'w' was not tranferred everytime but the index filter was transferred to vm-1(task-0) and gather was done there to fetch the filtered 'w'. This reduced the network traffic earlier caused due to this transfer of dense vector w.
Each example read from file were not transposed to any dense tensor but all the operations were done on the sparse tensor. We extracted the dense values from the Sparse tensors obtained and did all the calculations. This fasten our calculated by more than 100 fold.
After all the gradient calculation the output was tranformed into a sparse-tensor and send back to vm-1. Using sparse tensor we saved a lot of network bandwidth.
On vm-1 sparse add was used to add all the sparse tensor, since the output of sparse add gives a sparse tensor if both the argument is a sparse tensor we saved a lot of computation by avoiding a dense tensor addition.
Finally we used scatter add to optimize the addition of a dense tensor 'w' and the gradient received as a sparse tensor.

##Test example and calculating accuracy

Read test example one row at a time
Placed this calculation in vm-1 for faster fetch of 'w'
It was able to test on 2000 example in some secs.

3. Asynchronous SGD

###How to run? - ./launch_asyncsgd.sh

Learning rate : 0.01

Number of parallel tasks used : 20 (4 on each VM)

Time per iteration on each worker in async mode : ~5 sec

##Implementation details and optimizations done:

All the decision taken in Sync mode applies here. In addition we did:

All the session were executed in vm-1 task-0. This was done because if we will uniformly distribute session across all workers then we saw degradation in the time obtained per iteration. Specifically:

Session on different vms - time obtained = 6.65 sec per iteration (aggregated average)

Session on vm-1 task-0 - time obtained = 3.45 sec per iteration (aggregated average)

The explanation for the above has been stated in our report submitted.

###Accuracy and how to keep track of iterations done in async? We kept note of number of iterations locally this is the approx method and worked quiet well for us. With the global tracker we faced issue of race conditions. We distributed the task of calculating error rate to different tasks. This gave all the task to run for the same amount of time as - if we run all the test on single worker that worker will finish last and also the last few error rates will be on approx same number of iterations.

4. Batch synchronous SGD

###How to run? - ./launch_batchsynchronoussgd.sh Learning rate : 0.01

Number of parallel tasks used : 5 (1 on each VM)

Time per iteration in sync mode with 20 tasks : 1.6 sec

##Implementation details and optimizations done:

All the decision taken in Sync mode applies here. In addition we did:

Read a batch of example and in a loop slice each example to calculate local_gradient
All the operations were done on sparse tensor
Added all the local gradients and calculated the total_gradient and passed to the vm-1 for aggregation
Aggregation was done similar to what we did on sync mode (with all optimizations)

4. Batch Asynchronous SGD

###How to run? - ./launch_asyncsgd.sh Learning rate : 0.01

Number of parallel tasks used : 20 (4 on each VM)

Time per iteration on a worker in async mode : 5 sec

##Implementation details and optimizations done: Similar to the combination of async (without batch) and sync (batch).

Name		Name	Last commit message	Last commit date
Latest commit History 175 Commits
.idea		.idea
async_stats		async_stats
data/criteo-tfr-tiny		data/criteo-tfr-tiny
graphs		graphs
output		output
output_new		output_new
submission		submission
.asynchronoussgd_distributed_optimized.py.swp		.asynchronoussgd_distributed_optimized.py.swp
README.md		README.md
asynchronoussgd_distributed.py		asynchronoussgd_distributed.py
asynchronoussgd_distributed_optimized.py		asynchronoussgd_distributed_optimized.py
asynclog-0.out		asynclog-0.out
asynclog-1.out		asynclog-1.out
asynclog-10.out		asynclog-10.out
asynclog-11.out		asynclog-11.out
asynclog-12.out		asynclog-12.out
asynclog-13.out		asynclog-13.out
asynclog-14.out		asynclog-14.out
asynclog-15.out		asynclog-15.out
asynclog-16.out		asynclog-16.out
asynclog-17.out		asynclog-17.out
asynclog-18.out		asynclog-18.out
asynclog-19.out		asynclog-19.out
asynclog-2.out		asynclog-2.out
asynclog-20.out		asynclog-20.out
asynclog-21.out		asynclog-21.out
asynclog-3.out		asynclog-3.out
asynclog-4.out		asynclog-4.out
asynclog-5.out		asynclog-5.out
asynclog-6.out		asynclog-6.out
asynclog-7.out		asynclog-7.out
asynclog-8.out		asynclog-8.out
asynclog-9.out		asynclog-9.out
attempt1.py		attempt1.py
bigmatrixmultiplication.py		bigmatrixmultiplication.py
checkInstallation.py		checkInstallation.py
exampleAsync.sh		exampleAsync.sh
exampleAsynchronousUpdate.py		exampleAsynchronousUpdate.py
exampleDistributed.py		exampleDistributed.py
exampleMatmulFailure.py		exampleMatmulFailure.py
exampleMatmulSingle.py		exampleMatmulSingle.py
exampleReadCriteoData.py		exampleReadCriteoData.py
exampleSynchronousUpdate.py		exampleSynchronousUpdate.py
exampleTensorboard.py		exampleTensorboard.py
exampleVariablePersistence.py		exampleVariablePersistence.py
nohup.out		nohup.out
run_mat_mul_distributed.sh		run_mat_mul_distributed.sh
runs.txt		runs.txt
sampleDistributed.sh		sampleDistributed.sh
sampleDistributed_async_sgd_v0.sh		sampleDistributed_async_sgd_v0.sh
sampleDistributed_async_sgd_v1.sh		sampleDistributed_async_sgd_v1.sh
sampleDistributed_async_sgd_v1_bak.sh		sampleDistributed_async_sgd_v1_bak.sh
sampleDistributed_sgd_extra.sh		sampleDistributed_sgd_extra.sh
sampleDistributed_sgd_v0.sh		sampleDistributed_sgd_v0.sh
sampleDistributed_sync_optimized.sh		sampleDistributed_sync_optimized.sh
sampleTensorboard.sh		sampleTensorboard.sh
serverlog-1.out		serverlog-1.out
serverlog-10.out		serverlog-10.out
serverlog-11.out		serverlog-11.out
serverlog-12.out		serverlog-12.out
serverlog-13.out		serverlog-13.out
serverlog-14.out		serverlog-14.out
serverlog-15.out		serverlog-15.out
serverlog-16.out		serverlog-16.out
serverlog-17.out		serverlog-17.out
serverlog-18.out		serverlog-18.out
serverlog-19.out		serverlog-19.out
serverlog-2.out		serverlog-2.out
serverlog-20.out		serverlog-20.out
serverlog-21.out		serverlog-21.out
serverlog-22.out		serverlog-22.out
serverlog-3.out		serverlog-3.out
serverlog-4.out		serverlog-4.out
serverlog-5.out		serverlog-5.out
serverlog-6.out		serverlog-6.out
serverlog-7.out		serverlog-7.out
serverlog-8.out		serverlog-8.out
serverlog-9.out		serverlog-9.out
startserver.py		startserver.py
startserver_async.py		startserver_async.py
startserver_sync.py		startserver_sync.py
sync_multiple_client.sh		sync_multiple_client.sh
synchronoussgd.py		synchronoussgd.py
synchronoussgd_distributed.py		synchronoussgd_distributed.py
synchronoussgd_distributed_extra.py		synchronoussgd_distributed_extra.py
synchronoussgd_distributed_optimized.py		synchronoussgd_distributed_optimized.py
synchronoussgd_extra.py		synchronoussgd_extra.py
synchronoussgd_multiple_client.py		synchronoussgd_multiple_client.py
synchronoussgd_optimized.py		synchronoussgd_optimized.py
synchronoussgd_optimized_extra_naive.py		synchronoussgd_optimized_extra_naive.py
synchronoussgd_test.py		synchronoussgd_test.py
test_bed.py		test_bed.py
tfdefs.sh		tfdefs.sh
tfdefs_async.sh		tfdefs_async.sh
tfdefs_sync.sh		tfdefs_sync.sh

alihitawala/TensorFlow_google

Folders and files

Latest commit

History