Multi Node Multi GPU TensorFlow 2.0 Distributed Training Example

Romeo Kienzler ( 1 s t n a m e . l a s t n a m e a t c h . i b m . c o m )

Jerome Nilmeier ( l a s t n a m e a t us . i b m . c o m )

Description

Ported the TensorFlow example to run on Satori

Login to Satori Login Node
wget https://raw.githubusercontent.com/mit-satori/getting-started/master/tutorial-examples/tensorflow-2.x-multi-gpu-multi-node/multi_worker_with_keras_runner.py
chmod 755 multi_worker_with_keras_numpyArrays.py
wget https://raw.githubusercontent.com/mit-satori/getting-started/master/tutorial-examples/tensorflow-2.x-multi-gpu-multi-node/multi_worker_with_keras_runner.py
chmod 755 multi_worker_with_keras_runner.py
bsub -W 3:00 -q normalx -x -n 8 -gpu “num=4” -R “span[ptile=4]” -I “while (true) do ls > /dev/null; done” (replace 2586 with a number smaller equals than 256 :)
login to a new shell
nodes=`bjobs |grep 4*node |awk -F”*” ‘{print $2}’ |awk -F”.” ‘{print $1}’`
echo $nodes |python multi_worker_with_keras_runner.py

Wait until training starts, please run different new terminals on your worker nodes to observe what’s happening

watch -n 0.1 nvidia-smi

All scrips running on all nodes start a Service component which communicates with the other scripts in the background for parameter averaging.