With the recent release of Rescale Deep Learning Cloud, we will present an example here that makes use of our new interactive notebook feature to develop deep neural networks. This feature enables an iterative workflow alternating between interactive data preprocessing and analysis, and batch training of neural networks.


In this article we will start with an image classification data set (CIFAR10), try a few different neural network designs in our interactive notebook, and then launch a batch training cluster to train that network for more epochs.

Starting a Jupyter Notebook
To get started, you first need to start up a Rescale Linux Desktop with a NVIDIA K80 GPU:

Here we have chosen a desktop configuration with a single NVIDIA K80 GPU. While you wait for the notebook to finish booting, you can clone and save the job that holds the CIFAR10 image dataset and the notebook code you will run. Follow this link and then save the job it creates (you do not need to submit it to run, you will just use the job to stage the notebook and dataset input files): CIFAR10 TensorFlow notebook.

Once the desktop finishes booting, attach TensorFlow software and the job with the notebook code.



Once the software and job are attached, open the notebook URL and enter the password when prompted:

Next, navigate into the attach_jobs directory, then the directory of the job you attached, and then to .ipynb file.

The code in this notebook was adapted from the TensorFlow CIFAR10 training example.
We have already added another inference function to the example: inference_3conv, with a 3rd convolutional layer. You can try training the 2 convolution layer network by running all the cells as-is. To run the 3 conv layer version, replace the call to inference_2conv with inference_3conv,restart the kernel (ESC-0-0), and then run all the cells again.


You can also access TensorBoard, TensorFlow’s built-in GUI, on the desktop via SSH tunnel. To configure your own SSH keys follow the instructions here. Just download one of the connection scripts in the Node Access section of the Desktop panel:


and take the username and IP address out of the script. Then forward port 6006 to your localhost and run TensorBoard:

ssh -L 6006:localhost:6006 @ tensorboard –logdir=/tmp/cifar10_train

You should now be able to access it from your local browser at http://localhost:6006. The particular training example we are using already set /tmp/cifar10_train as the default location for training logs. Here are the 2 network graphs as they appear in TensorBoard. Two convolutional layers:


Three convolutional layers:


Batch Training
If you train the 2-layer and 3-layer convolutional networks on the notebook GPU for 10-20 epochs, you will see the loss does indeed drop faster for the 3-layer network. We would now like to see whether the deeper network yields better accuracy when trained longer or if it reaches the same accuracy in less training time.

You can launch a batch training job with your updated 3-convolutional-layer code directly from the notebook. First, save your notebook (Ctrl-S), then there is a shell command shortcut which will automatically export your notebook to regular python and launch a job with all the files in the same directory as the notebook. For example:


The syntax is as follows:

This can be run from the command line on the desktop or within the notebook with the IPython shell magic ! syntax.

Some GPU core types you can choose from when launching from the notebook:

Jade: NVIDIA Kepler K520s
Obsidian: NVIDIA Tesla K80s

Once the job starts running, you can attach it to your desktop and the job files will be accessible on the notebook as part of a shared filesystem. First, the attach:

Then, on the desktop, in addition to opening and viewing files, you can also open a terminal:

From the terminal, you can tail files, etc.

Alternatively, you can navigate to the job in the Rescale web portal and live tail files in your browser. This allows you to shut down your Rescale desktop and still monitor training progress, or enables monitoring of your batch job on a mobile device while you are away from your workstation.


Iterative Development
Above, you have just completed a single development iteration of our CIFAR10 training example, but you do not need to stop once the batch training is done. You can stop the batch training job anytime, review training logs in more depth from your notebook, then submit new training jobs.

The advantage here is that you can develop and test your code on similar hardware, the same software configuration, and the same training data as the batch training cluster we used. This eliminates the headache of bugs due to differences in software or hardware configuration between testing you might do on your local workstation and the training cluster in the cloud.

Additionally, if you prefer to do more compute heavy workloads directly in the notebook environment, we have Rescale Desktop configurations available with up to 8 K80 GPUs (4 K80 cards), email support@rescale.com for access to those.

To try out the workflow above, sign up here and immediately start doing deep learning on Rescale today.

Edit (2016-10-31): Added link for setting user SSH keys.

This article was written by Mark Whitney.


Today we will discuss how to make use of multiple GPUs to train a single neural network using the Torch machine learning library. This is the first in a series of articles on techniques for scaling up deep neural network (DNN) training workloads to use multiple GPUs and multiple nodes.

In this series, we will be focusing on parallelizing the training of a single network. For more about the embarassingly parallel problem of training multiple networks efficiently to optimize configuration parameters, see this earlier post on hyper-parameter optimization.

About Torch
Torch is a lightweight, flexible tensor library built on top of the Lua programming language. Torch is popular with machine learning researchers, so many new deep neural network ideas are first implemented in Torch and made available as open source extensions. Thus, the state-of-the-art in deep learning is often first available to use in Torch.

The downside of this is that Torch documentation often falls behind implementation, so unless you find an example on github for exactly what you want to do, it can be a challenge to figure out which Torch modules you should be using and how to use them.

One example of this is how to get Torch to train your neural networks using multiple GPUs. Searching for “multi gpu torch” on the internet yields this github issue as one of the top results. From this, we know we can access more than one GPU from the torch environment, but how do we use this low-level construct to train a complex network?

Data vs. Model Parallelism
When parallelizing the work to train a single neural network, we have 2 choices on how to split up the work: Model Parallelism and Data Parallelism.


With Model Parallelism, each GPU runs a chunk of the nodes in the network for a given batch of data.


With Data Parallelism, each GPU runs the entire network for different batches of data.

This distinction is discussed in detail in this paper, but the choice between using one or the other impacts what kind of synchronization is required between GPUs. Data parallelism requires synchronization of model parameters, model parallelism requires synchronizing input and output values between the chunks.

Simple Torch Example
We will now look at a simple example of training a convolutional neural network based on a unit test in Torch itself. This network has 2 convolutions layers and 2 rectifier layers. We do a simple forward and backward pass over the network. Instead of actually computing error gradients for training, we just set them to a random vector to keep things simple.

Now let’s convert it to run on a GPU (this example will only run if you have a CUDA-compatible GPU):

To run this on a GPU, we call cuda()on the network and then make the input a CudaTensor.

Now let’s distribute the model across 2 GPUs (as an example of the model parallel paradigm). We iterate over the GPU device IDs and use the cutorch.withDevice to place each layer on a particular GPU.

This puts a convolutional layer and a ReLU layer on each GPU. The forward and backward passes must propagate the outputs between GPU 1 and GPU 2.

Next, we use nn.DataParallelTable to distribute batches of data to copies of the whole network running on multiple GPUs. DataParallelTable is a Torch Container that wraps multiple Containers and distributes the input across them.


So instead of running the forward and backward passes over the original Sequential container, we now run it on the DataParallelTable container and the data is distributed to copies of the network on each GPU.

Here is a job on Rescale you can clone and run yourself with all the above code:

A Larger Example
Let’s now look at a use of DataParallelTable in action when training a real DNN. We will be using Sergey Zagoruyko’s implementation of Wide Residual Networks on CIFAR10 on github.

In train.lua, we see all the parallelization of the base neural network is applied by a helper function:

Delving into makeDataParallelTable, we see a similar structure to our last example above using nn.DataParallelTable:add

You can clone these jobs and run the training yourself on Rescale:

CIFAR10 Wide ResNet, 1 GPU:

CIFAR10 Wide ResNet, 4 GPUs:

After running the training for 10 epochs, we see the 4 GPU job runs about 3.33 times faster than the single GPU job. Pretty good scale up!

In this article, we have given example implementations of model and data parallel DNN training using Torch. In future posts, we will cover multi-GPU training usage using other neural network libraries as well as multi-node scaling.

This article was written by Mark Whitney.