In a previous post, we showed examples of using multiple GPUs to train a deep neural network (DNN) using the Torch machine learning library. In this post, we will focus on performing multi-GPU training using TensorFlow.

In particular, we will explore data-parallel GPU training with multi-GPU and multi-node configurations on Rescale. We will leverage Rescale’s existing MPI configured clusters to easily launch TensorFlow distributed training workers. For a basic example of training with TensorFlow on a single GPU, see this previous post.

Preparing Data
To make our multi-GPU training sessions more interesting, we will be using some larger datasets. Later, we will show a training job on the popular ImageNet image classification dataset. Before we start with this 150 GB dataset, we will prepare a smaller dataset to be in the same format as ImageNet and test our jobs with that in order to make sure the TensorFlow trainer is working properly. In order to keep data local during training, Rescale syncs the dataset to local storage on the GPU nodes before training starts. Waiting for a large dataset like ImageNet to sync when iteratively developing a model using just a few examples is wasteful. For this reason, we will start with the smaller Flowers dataset and then move to ImageNet once we have a working example.

TensorFlow processes images that are formatted as TFRecords so first let’s download and preprocess the pngs from the flowers dataset to be in this format. All the examples we will be showing today come out of the inception module in the tensorflow/models repository on GitHub, so we start by cloning that repository:

Now we use the bazel build tool to make the flowers download-and-preprocess script and then run that script:

This should download a ~220MB archive and create something like this:

We assemble this archive and then upload it to Rescale. Optionally, you can delete the raw-data and archive file since all the necessary information is now encoded as TFRecords.

We have assembled all these operations in a preprocessing job on Rescale here for you to clone and run yourself.

Next, let’s take the flowers.tar.gz file we just produced and convert it to an input file for the next step:

Now we have our preprocessed flower image TFRecords ready for training.

Single Node – Multiple GPUs
The next step is to take this input dataset and train a model with it. We will be using the Inception v3 DNN architecture from the tensorflow/models repository as mentioned above. Training on a single node with multiple GPUs looks something like this:


We will first create a Rescale job that runs on a single node, since that has fewer moving parts than the multi-node case. We will actually run 3 processes in the job:

  • Main GPU-based model training
  • CPU-based evaluation of checkpointed models on the validation set
  • TensorBoard visualization tool

So let’s get started! First build the training and evaluation scripts.

Next, create some output directories and start the main training process: 

$RESCALE_GPUS_PER_SLOT is a variable set on all Rescale job environments. In this command line, we point to the flowers directory with our training images and the empty out/train directory where TensorFlow will output logs and models files.

Evaluation of accuracy on the validation set can be done separately and does not need GPU acceleration:

imagenet_eval sits in a loop and wakes up every eval_interval_secs to evaluate the accuracy of the most recently trained model checkpoint in out/train against validation TFRecords in the flowers directory. Accuracy results are logged to out/eval. CUDA_VISIBLE_DEVICES is an important parameter here. TensorFlow will by default always load itself into GPU memory, even if it is not going to make use of the GPU. Without this parameter, both the training and evaluation processes will together exhaust all the memory on the GPU and cause the training to fail.

Finally, TensorBoard is a handy tool for monitoring TensorFlow’s progress. TensorBoard runs its own web server to show plots of training progress, a graph of the model, and may other visualizations. To start it, we just have to point it to the out directory where our training and evaluation processes are outputting:

TensorBoard will pull in logs in all subdirectories of logdir so it will show training and evaluation data together.

Putting this all together:

Since this all runs in a single shell, we background the TensorBoard and evaluation processes. We also delay start of the evaluation process since the training process needs a few minutes to initialize and create the first model checkpoint.

You can run this training job on Rescale here.

Since TensorBoard runs its own web server without any authentication, access is blocked by default on Rescale. The easiest way to get access to TensorBoard is to open an SSH tunnel to the node and forward port 6006:

Now navigate to http://localhost:6006 and you should see something like this:

Multiple Nodes
The current state-of-the-art limits the total GPU cards that can fit on a node to something around 8. Additionally, the CUDA peer-to-peer system, the mechanism a TensorFlow process uses to distribute work amongst GPUs is currently limited to 8 GPU devices. While these numbers will continue to increase, it is still convenient to have a mechanism to scale your training out for large models and datasets. TensorFlow distributed training synchronizes updates between different training processes over the network, so it can be used with any network fabric and not be limited by CUDA implementation details. Distributed training consists of some number of workers and parameter servers as shown here:


Parameter servers provide model parameters which are then used to evaluate an input batch. After the batch on each worker is complete, the error gradients are fed back into the parameter server which then uses them to produce new model parameters. In the context of a GPU cluster, we could run a worker process to use each GPU in our cluster and then pick enough parameter servers to keep up with processing of the gradients.

Following the instructions here, we set up a worker per GPU and a parameter server per node. We will take advantage of the MPI configuration that comes with every Rescale cluster.

To start, we need generate the host strings that will be passed to each parameter server and worker, each process getting a unique hostname:port combination, for example:

We want a single entry per host for the parameter servers and a single entry per GPU for the workers. We take advantage of the machine files that are automatically set up on every Rescale cluster. $HOME/machinefile just has a list of hosts in the cluster and $HOME/machinefile.gpu has a list of hosts annotated with the number of GPUs on each host. We parse them to generate our host strings in a python script:

Next we have a script that takes these host strings and launches the imagenet_distributed_train script with the proper task ID and GPU whitelist, will be run with OpenMPI mpirun so $OMPI* environment variables are automatically injected. We use $OMPI_COMM_WORLD_RANK to get a global task index and $OMPI_COMM_WORLD_LOCAL_RANK to get a node local GPU index.

Now, putting it all together:

We start with a bunch of the same directory creation and bazel build boilerplate. The 2 exceptions are:

1. We move all the input directories to the shared/ subdirectory so it is shared across nodes.
2. We now call the bazel build command with the --output_base so that bazel doesn’t symlink the build products to $HOME/.cache and instead makes them available on the shared filesystem.

Next we launch TensorBoard and imagenet_eval locally on the MPI master. These 2 processes don’t need to be replicated across nodes with mpirun.

Finally, we launch the parameter servers with ps and the single entry per node machinefile and then the workers with worker with GPU-ranked machinefile.gpu.

Here is an example job performing the above distributed training on the flowers dataset using 2 Jade nodes (8 K520 GPUs). Note that since we are using the MPI infrastructure already set up on Rescale, we can use this same example for any number of nodes or GPUs-per-node. Using the appropriate machinefiles, the number of workers and parameter servers are set automatically to match the resources.

Training on ImageNet
Now that we have developed the machinery to launch a TensorFlow distributed training job on the smaller flowers dataset, we are ready to train on the full ImageNet dataset. Downloading of ImageNet requires permission here. You can request access and upon acceptance, you will be given a username and password to download the necessary tarballs.

We can then run a preparation job similar to the flowers job above to download the dataset and format the images into TFRecords:

You can clone and run this preparation job on Rescale here.

If you have already downloaded the 3 necessary inputs from the ImageNet site (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar, and ILSVRC2012_bbox_train_v2.tar.gz) and have placed them somewhere accessible to via HTTP (like an AWS S3 bucket), you can customize models/inception/inception/data/ in the tensorflow/models repository to download from your custom location:

Clone and run this version of the preparation job here.

Finally, we can make some slight modifications to our multi-GPU flowers jobs to take the imagenet-data dataset directory instead of the flowers directory by changing the $DATASET variable at the top:

And the distributed training case:

We have gone through all the details for performing multi-GPU, single and multi-node model training with TensorFlow. In an upcoming post, we will discuss the performance ramifications of distributed training, and look at how well it scales on different server configurations.

Rescale Jobs
Here is a summary of the Rescale jobs used in this example. Click the links to import the jobs to your Rescale account.

Flowers dataset preprocessing:

Single node flowers training:

Multiple nodes flowers training:

ImageNet ILSVRC2012 download and preprocessing:

ImageNet ILSVRC2012 download from existing S3 bucket:

This article was written by Mark Whitney.


While still in its infancy, Deep Learning has already significantly advanced fields such as autonomous driving, robotics control, machine translation, and facial recognition. By being able to abstract large volumes of data, recognize patterns, model and classify them, Deep Learning looks poised to drive disruption and innovation in the years ahead. So what questions do you need to ask to evaluate whether deep learning and deep neural networks are the right solution for your complex task?

Do I have enough data? There are a variety of deep learning model types and architectures, but the common theme across all of them is deep, layered structure. Deep means many interdependent model parameters.  In order for your optimizer to come up with good values of all these parameters, it needs multiple training examples of the task you want it to do. With some exceptions applied to the case of transfer learning, if you do not have a large quantity of data or are unable to generate a large number of examples quickly, you are better off training a smaller, “shallow” model.

Supervised, semi-supervised, or unsupervised? Is the data you have collected labeled with the desired result of the task or not? If the data is all labeled, then you can take advantage of supervised learning techniques. This is the “traditional” application for deep neural networks, and is used for tasks like image recognition, natural language translation, and voice recognition. Convolutional networks are typically used for image-based tasks whereas recurrent networks are used for language-based tasks.

If none of your data is labeled, you can still take advantage of unsupervised learning to learn hidden features and structure within your data. Denoising autoencoders are an example of an unsupervised deep learning model.

The final category, semi-supervised or reinforcement deep learning is newest in the space, pioneered by the work at DeepMind. In this case, your data is sparsely labeled and your model might be able to test new inputs to the system to get feedback. Defining example was learning to play Atari games, but since then applications have been made to robotics and autonomous vehicles.

 Does your system have to explain its decisions? Deep learning models have traditionally been considered black boxes with respect to the predictions it makes. Given the number of parameters that are trained in a deep model, it is generally impossible to reconstruct the “reasoning” behind the answer a model gives. If you need to provide a “why?” along with a “what?”, you are better off choosing a model like decision trees or random forests, which give a set of decisions made to provide a particular answer.

If deep learning models are the right choice, you now face a host of new challenges:

  • Deep learning requires specialized GPU hardware and a lot of it. Many IT organizations new to machine learning just have not yet made the capital investment to have appropriate GPU resources on-premise.
  • Deep learning libraries are evolving very quickly, resulting in the need for frequent updates to stay current. Your deep learning pipeline can quickly become a stack of interdependent software packages that are difficult to keep in sync.
  • How do you manage your large datasets? Where does all that data live?

Rescale’s Deep Learning Cloud, in partnership with IBM, provides an integrated platform to solve the above problems. Leveraging IBM Cloud’s bare metal NVIDIA K80 servers, Rescale’s interactive Desktops provide you with powerful hardware to visualize and explore large datasets and design deep neural network models. When you are ready to scale up and train on large datasets, you can get instant access to GPU compute clusters and only pay for what you use with hourly pricing.

Deep Learning Cloud comes configured with the latest versions of popular deep learning software like TensorFlow and Torch, as well as IBM licensed analytics products such as SPSS. All software is already configured to take full advantage of NVIDIA GPUs via CUDA and cuDNN.

Finally, Rescale’s workflow management and collaboration tools combined with IBM storage and data transfer technology ease the burdens of migrating large datasets to the cloud and managing that data once it is there.

So what does running a deep learning task on Rescale look like? Here are the steps taken by a user to train a new deep neural network from scratch:

Dataset: Upload your image dataset using our optimized data transfer tools, or if your data is already hosted in IBM cloud, you can attach it directly.

Configuration: Set up a cluster through the Rescale web interface, configure the number of IBM Cloud GPUs you want to train on, the deep learning software you want to use, and the training script you want to run.

Launch: Within 30 minutes, your training cluster will be available and running your training script.

Monitor: View training progress via the web or direct SSH access, connect to GUIs such as TensorBoard (part of TensorFlow), and stop your training cluster whenever you want.

Review: Training results are automatically synced back to persistent storage. You can review results from the Rescale portal, download models to use, sync back to your own IBM Cloud storage account, or just use Rescale to run further inference and training on the existing model.

Try Rescale powered by IBM Cloud for free today at

This article was written by Mark Whitney.

With the recent release of Rescale Deep Learning Cloud, we will present an example here that makes use of our new interactive notebook feature to develop deep neural networks. This feature enables an iterative workflow alternating between interactive data preprocessing and analysis, and batch training of neural networks.


In this article we will start with an image classification data set (CIFAR10), try a few different neural network designs in our interactive notebook, and then launch a batch training cluster to train that network for more epochs.

Starting a Jupyter Notebook
To get started, you first need to start up a Rescale Linux Desktop with a NVIDIA K80 GPU:

Here we have chosen a desktop configuration with a single NVIDIA K80 GPU. While you wait for the notebook to finish booting, you can clone and save the job that holds the CIFAR10 image dataset and the notebook code you will run. Follow this link and then save the job it creates (you do not need to submit it to run, you will just use the job to stage the notebook and dataset input files): CIFAR10 TensorFlow notebook.

Once the desktop finishes booting, attach TensorFlow software and the job with the notebook code.



Once the software and job are attached, open the notebook URL and enter the password when prompted:

Next, navigate into the attach_jobs directory, then the directory of the job you attached, and then to .ipynb file.

The code in this notebook was adapted from the TensorFlow CIFAR10 training example.
We have already added another inference function to the example: inference_3conv, with a 3rd convolutional layer. You can try training the 2 convolution layer network by running all the cells as-is. To run the 3 conv layer version, replace the call to inference_2conv with inference_3conv,restart the kernel (ESC-0-0), and then run all the cells again.


You can also access TensorBoard, TensorFlow’s built-in GUI, on the desktop via SSH tunnel. To configure your own SSH keys follow the instructions here. Just download one of the connection scripts in the Node Access section of the Desktop panel:


and take the username and IP address out of the script. Then forward port 6006 to your localhost and run TensorBoard:

ssh -L 6006:localhost:6006 @ tensorboard –logdir=/tmp/cifar10_train

You should now be able to access it from your local browser at http://localhost:6006. The particular training example we are using already set /tmp/cifar10_train as the default location for training logs. Here are the 2 network graphs as they appear in TensorBoard. Two convolutional layers:


Three convolutional layers:


Batch Training
If you train the 2-layer and 3-layer convolutional networks on the notebook GPU for 10-20 epochs, you will see the loss does indeed drop faster for the 3-layer network. We would now like to see whether the deeper network yields better accuracy when trained longer or if it reaches the same accuracy in less training time.

You can launch a batch training job with your updated 3-convolutional-layer code directly from the notebook. First, save your notebook (Ctrl-S), then there is a shell command shortcut which will automatically export your notebook to regular python and launch a job with all the files in the same directory as the notebook. For example:


The syntax is as follows:

This can be run from the command line on the desktop or within the notebook with the IPython shell magic ! syntax.

Some GPU core types you can choose from when launching from the notebook:

Jade: NVIDIA Kepler K520s
Obsidian: NVIDIA Tesla K80s

Once the job starts running, you can attach it to your desktop and the job files will be accessible on the notebook as part of a shared filesystem. First, the attach:

Then, on the desktop, in addition to opening and viewing files, you can also open a terminal:

From the terminal, you can tail files, etc.

Alternatively, you can navigate to the job in the Rescale web portal and live tail files in your browser. This allows you to shut down your Rescale desktop and still monitor training progress, or enables monitoring of your batch job on a mobile device while you are away from your workstation.


Iterative Development
Above, you have just completed a single development iteration of our CIFAR10 training example, but you do not need to stop once the batch training is done. You can stop the batch training job anytime, review training logs in more depth from your notebook, then submit new training jobs.

The advantage here is that you can develop and test your code on similar hardware, the same software configuration, and the same training data as the batch training cluster we used. This eliminates the headache of bugs due to differences in software or hardware configuration between testing you might do on your local workstation and the training cluster in the cloud.

Additionally, if you prefer to do more compute heavy workloads directly in the notebook environment, we have Rescale Desktop configurations available with up to 8 K80 GPUs (4 K80 cards), email for access to those.

To try out the workflow above, sign up here and immediately start doing deep learning on Rescale today.

Edit (2016-10-31): Added link for setting user SSH keys.

This article was written by Mark Whitney.


Today we will discuss how to make use of multiple GPUs to train a single neural network using the Torch machine learning library. This is the first in a series of articles on techniques for scaling up deep neural network (DNN) training workloads to use multiple GPUs and multiple nodes.

In this series, we will be focusing on parallelizing the training of a single network. For more about the embarassingly parallel problem of training multiple networks efficiently to optimize configuration parameters, see this earlier post on hyper-parameter optimization.

About Torch
Torch is a lightweight, flexible tensor library built on top of the Lua programming language. Torch is popular with machine learning researchers, so many new deep neural network ideas are first implemented in Torch and made available as open source extensions. Thus, the state-of-the-art in deep learning is often first available to use in Torch.

The downside of this is that Torch documentation often falls behind implementation, so unless you find an example on github for exactly what you want to do, it can be a challenge to figure out which Torch modules you should be using and how to use them.

One example of this is how to get Torch to train your neural networks using multiple GPUs. Searching for “multi gpu torch” on the internet yields this github issue as one of the top results. From this, we know we can access more than one GPU from the torch environment, but how do we use this low-level construct to train a complex network?

Data vs. Model Parallelism
When parallelizing the work to train a single neural network, we have 2 choices on how to split up the work: Model Parallelism and Data Parallelism.


With Model Parallelism, each GPU runs a chunk of the nodes in the network for a given batch of data.


With Data Parallelism, each GPU runs the entire network for different batches of data.

This distinction is discussed in detail in this paper, but the choice between using one or the other impacts what kind of synchronization is required between GPUs. Data parallelism requires synchronization of model parameters, model parallelism requires synchronizing input and output values between the chunks.

Simple Torch Example
We will now look at a simple example of training a convolutional neural network based on a unit test in Torch itself. This network has 2 convolutions layers and 2 rectifier layers. We do a simple forward and backward pass over the network. Instead of actually computing error gradients for training, we just set them to a random vector to keep things simple.

Now let’s convert it to run on a GPU (this example will only run if you have a CUDA-compatible GPU):

To run this on a GPU, we call cuda()on the network and then make the input a CudaTensor.

Now let’s distribute the model across 2 GPUs (as an example of the model parallel paradigm). We iterate over the GPU device IDs and use the cutorch.withDevice to place each layer on a particular GPU.

This puts a convolutional layer and a ReLU layer on each GPU. The forward and backward passes must propagate the outputs between GPU 1 and GPU 2.

Next, we use nn.DataParallelTable to distribute batches of data to copies of the whole network running on multiple GPUs. DataParallelTable is a Torch Container that wraps multiple Containers and distributes the input across them.


So instead of running the forward and backward passes over the original Sequential container, we now run it on the DataParallelTable container and the data is distributed to copies of the network on each GPU.

Here is a job on Rescale you can clone and run yourself with all the above code:

A Larger Example
Let’s now look at a use of DataParallelTable in action when training a real DNN. We will be using Sergey Zagoruyko’s implementation of Wide Residual Networks on CIFAR10 on github.

In train.lua, we see all the parallelization of the base neural network is applied by a helper function:

Delving into makeDataParallelTable, we see a similar structure to our last example above using nn.DataParallelTable:add

You can clone these jobs and run the training yourself on Rescale:

CIFAR10 Wide ResNet, 1 GPU:

CIFAR10 Wide ResNet, 4 GPUs:

After running the training for 10 epochs, we see the 4 GPU job runs about 3.33 times faster than the single GPU job. Pretty good scale up!

In this article, we have given example implementations of model and data parallel DNN training using Torch. In future posts, we will cover multi-GPU training usage using other neural network libraries as well as multi-node scaling.

This article was written by Mark Whitney.