Today we will discuss how to make use of multiple GPUs to train a single neural network using the Torch machine learning library. This is the first in a series of articles on techniques for scaling up deep neural network (DNN) training workloads to use multiple GPUs and multiple nodes.

In this series, we will be focusing on parallelizing the training of a single network. For more about the embarassingly parallel problem of training multiple networks efficiently to optimize configuration parameters, see this earlier post on hyper-parameter optimization.

About Torch
Torch is a lightweight, flexible tensor library built on top of the Lua programming language. Torch is popular with machine learning researchers, so many new deep neural network ideas are first implemented in Torch and made available as open source extensions. Thus, the state-of-the-art in deep learning is often first available to use in Torch.

The downside of this is that Torch documentation often falls behind implementation, so unless you find an example on github for exactly what you want to do, it can be a challenge to figure out which Torch modules you should be using and how to use them.

One example of this is how to get Torch to train your neural networks using multiple GPUs. Searching for “multi gpu torch” on the internet yields this github issue as one of the top results. From this, we know we can access more than one GPU from the torch environment, but how do we use this low-level construct to train a complex network?

Data vs. Model Parallelism
When parallelizing the work to train a single neural network, we have 2 choices on how to split up the work: Model Parallelism and Data Parallelism.


With Model Parallelism, each GPU runs a chunk of the nodes in the network for a given batch of data.


With Data Parallelism, each GPU runs the entire network for different batches of data.

This distinction is discussed in detail in this paper, but the choice between using one or the other impacts what kind of synchronization is required between GPUs. Data parallelism requires synchronization of model parameters, model parallelism requires synchronizing input and output values between the chunks.

Simple Torch Example
We will now look at a simple example of training a convolutional neural network based on a unit test in Torch itself. This network has 2 convolutions layers and 2 rectifier layers. We do a simple forward and backward pass over the network. Instead of actually computing error gradients for training, we just set them to a random vector to keep things simple.

Now let’s convert it to run on a GPU (this example will only run if you have a CUDA-compatible GPU):

To run this on a GPU, we call cuda()on the network and then make the input a CudaTensor.

Now let’s distribute the model across 2 GPUs (as an example of the model parallel paradigm). We iterate over the GPU device IDs and use the cutorch.withDevice to place each layer on a particular GPU.

This puts a convolutional layer and a ReLU layer on each GPU. The forward and backward passes must propagate the outputs between GPU 1 and GPU 2.

Next, we use nn.DataParallelTable to distribute batches of data to copies of the whole network running on multiple GPUs. DataParallelTable is a Torch Container that wraps multiple Containers and distributes the input across them.


So instead of running the forward and backward passes over the original Sequential container, we now run it on the DataParallelTable container and the data is distributed to copies of the network on each GPU.

Here is a job on Rescale you can clone and run yourself with all the above code.

A Larger Example
Let’s now look at a use of DataParallelTable in action when training a real DNN. We will be using Sergey Zagoruyko’s implementation of Wide Residual Networks on CIFAR10 on github.

In train.lua, we see all the parallelization of the base neural network is applied by a helper function:

Delving into makeDataParallelTable, we see a similar structure to our last example above using nn.DataParallelTable:add

You can clone these jobs and run the training yourself on Rescale:

CIFAR10 Wide ResNet, 1 GPU

CIFAR10 Wide ResNet, 4 GPUs

After running the training for 10 epochs, we see the 4 GPU job runs about 3.33 times faster than the single GPU job. Pretty good scale up!

In this article, we have given example implementations of model and data parallel DNN training using Torch. In future posts, we will cover multi-GPU training usage using other neural network libraries as well as multi-node scaling.

This article was written by Mark Whitney.


Rescale’s Design-of-Experiments (DOE) framework is an easy way to optimize the performance of machine learning models.  This article will discuss a workflow for doing hyper-parameter optimization on deep neural networks.  For an introduction to DOEs on Rescale, see this webinar.

Deep neural networks (DNNs) are a popular machine learning model used today in many many applications including robotics, self-driving cars, image search, facial recognition, and speech recognition.  In this article we will train some neural networks to do image classification and show how to use Rescale to maximize the performance of your DNN models.


For an introduction on training an image classification DNN on Rescale, please see this previous article.  This article will expand on the basic model training example and show how to improve the performance of your network through a technique called hyper-parameter optimization.

Hyper-Parameter Optimization

A typical starting point for a task using DNNs is to select a model published in the literature and implement it in the neural network training framework of your choice, or even easier, download an already-implemented model from the Caffe Model Zoo.  You might then train the model on your training dataset and find that the performance (classification accuracy, training time, etc.) is not quite good enough.  At this point, you could go back and look for a whole new model to train, or you can try tweaking your current model to get the additional performance you desire.  The process of tweaking parameters for a given neural network architecture is known as hyper-parameter optimization.  Here is a brief list of hyper-parameters people often vary:

  • Learning rates
  • Batch size
  • Training epochs
  • Image processing parameters
  • Number of layers
  • Convolutional filters
  • Convolutional kernel size
  • Dropout fractions


Given the large list hyper-parameter choices above, even once we set the model architecture, there are still a large number of similar neural network variants.  Our task is to find a variant that performs well enough for our needs.

Randomized Hyper-Parameter Search DOE

Using the Rescale platform, we are now going to create a Design-of-Experiments job to sample hyper-parameters and then train these different network variants using GPUs.

For our first example, we will start with one of the example convolutional networks from the Keras github repository to train an MNIST digit classifier.  Using this network, we will vary the following network parameters:

  • nb_filters: Number of filters in our convolutional layer
  • nb_conv_size: Size of convolutional kernel

We modified the example to have the template values above and here is an excerpt from the script with the templated variables:

Note the nb_filters and nb_conv_size parameters surrounded by ${}.  We are now ready to create a DOE job on the Rescale platform that uses this template.

If you want to follow along, you can clone the example job:



First, we select the DOE job type in the top right corner and upload the mnist.pkl.gz dataset from the Keras github repository.


Next, we specify that we are running a Monte Carlo DOE (lower left), that we want to do 60 different training runs, and we specify our template variables.  We somewhat arbitrarily choose both parameters to be sampled from a uniform distribution.  The convolutional kernel size ranges from 1 to 10 (each image size is 28×28, so more than 28 would not work), and the number of convolutional filters ranges from 16 to 256.

Note, we could specify our template variable ranges with a CSV instead.  For this example, we would manually sample random values for our variables and the CSV would look like this:


Next, we need to upload the Keras script template that we created above.  At this point, we can also specify a new name for the script with the values inserted.  It is also fine if you just want to keep the name the same, as we have done here.  The materialized version will not overwrite the template.


Time to select the training library we want to use. Search for “Keras” in the search box and select it. Then, we select the K520/Theano compatible version.  Finally, we enter the command line to run the training script.  The first part of the command is just copying the MNIST dataset archive into a place where Keras can find it.  Then, we are just calling the python script.


Since we selected a version of Keras configured to work with K520s, our hardware type is already narrowed down to Jade.  We can now size each training cluster on the left.  The count here is in number of CPU cores.  For the Jade hardware type, there are 4 CPU cores for every GPU.  On the right, we set the number of separate training clusters we will provision.  In this case we are using 2 training clusters with a single GPU per cluster.


The last step is to specify the post-processing script.  This script is used to parse the output of our training jobs and make any metrics available to Rescale to show in the results page we will see later.  The expected output format is a line for each value:

Since the training script already prints the properly formatted accuracy as the last line of its output, result.out, our post-processing script just needs to parse out that last line.

You can now submit the job in the upper right and we move to the real-time cluster status page.


Once the clusters are provisioned and the job starts, we can view the progress of our training jobs.  In the live tailing window, select a run and select process_output.log.  If the training has already started, you can view the training progress and the current training accuracy.  Individual runs can be terminated manually by selecting the “x” next the select run number.  This enables the user to stop a run early if the parameters are clearly yielding inferior accuracy.


Once our job completes, the results page summarizes the hyper-parameters used for each run and the accuracy results.  In the above case, the initial model we took from the Keras examples had an accuracy of 99.1% and the best result we obtain has an accuracy of about 99.4%, a small improvement.  If we want to download the weights for the most accurate model, we can sort by accuracy and then select the run details.


Among other results files is mnist_model.json and mnist_model.h5, which are the model architecture and model weights files needed to reload the model back into Keras.  We can also download all the data from all runs as one big archive or download the results table as a CSV.


Our accuracy results can also be visualized in the charting tab.

Bring Your Own Hyper-Parameter Optimizer

Rescale supports use of 3rd party optimization software as detailed here.  We will now discuss creating a Rescale optimization job to run a black-box optimizer from the machine learning literature.

Using the Rescale optimization SDK, we choose to plug in the Sequential Model-based Algorithm Configuration (SMAC) optimizer from the University of British Columbia.  The SMAC optimizer builds a random forest model of our neural network performance, based on the hyper-parameter choices.  We use version 2.10.03, available here.

The optimizer is a Java application which takes the following configuration:

  • “scenario” file: specifies the command line for the optimizer to run, in this case this is our network training script
  • parameter file: specifies the names of our hyper-parameters and ranges of values

When SMAC runs the training script, it passes the current hyper-parameter selections to evaluate as command line flags.  It expects to receive results from the run in stdout, formatted as a string like this:

Result of this algorithm run: , , , , ,

To start, we create a parameter file for the hyper-parameters we will vary in this experiment:

Now, in addition to varying the convolutional filter count and convolutional kernel size, we are also varying the dropout fractions and the size of the pooling layer.

Now we look at modifications made to our previous training script to accommodate SMAC input and results:

Rather than injecting hyper-parameter values in as a template, we now just use argparse to parse the flags that are provided by SMAC.

Since we are feeding errors into an optimizer that then selects new parameters based on that error, we now keep separate validation and test datasets.  Above we are splitting the original training data into a training and validation set.  It is important to hold out a separate test dataset so we have an error metric to evaluate that is not in danger of being overfit by the optimizer.  The optimization algorithm only gets to see the validation error.

Here we train the model, evaluate it on the validation and test datasets, and then output the results in the SMAC-specific format (“Result of algorithm…”).  Note, we also catch any errors in training and validation and mark that run as “UNSAT” to SMAC so that SMAC knows that it is an invalid combination of parameters.


In order for SMAC to call the Rescale python SDK, we write a wrapper script, which we will call, and specify for SMAC to call it in the scenario file.  The wrapper then submits the training script to be run.

Some important excerpts from the wrapper script:

To start, we take all the command line flags passed into the script and pass them directly to our objective function.  This is the objective function that will be called for each set of hyper-parameters.

To start the objective function, we package up the input files for this training run into a .zip file.

Here, we format the training script command we will run.  Note, that we are just passing the flags from SMAC (variable X) through to the training script.  Then, we call the submit command which sends the input files to the training cluster and starts training.  We also now call the format_results script ourselves.

We parse the output file from the training script to get the expected SMAC results line (“Result of algorithm run…”), as well as the error on the test dataset.

Finally, we need to specify the scenario file that tells SMAC to call our wrapper script.

The important parts here are:

pcs-file: specifies the parameters
algo: wrapper script to run
numberOfRunsLimit: sets the number of training runs
check-sat-consistency: tells the optimizer that for the same training dataset, different parameter selections might lead to a feasible or infeasible model

So now that we have all of our input files, we are ready to create a job.

You can clone the job to run yourself here:

Keras MNIST SMAC Optimizer


The input.tar.gz archive consists of an input/ directory with the our training script and the post-processing script.

We select the same software as before, Keras configured for Theano on a K520.


Hardware selection is roughly the same as well. We again select 2 task slots to train 2 networks in parallel.


For the optimizer, we select “Custom Optimization” and then enter the command line to run SMAC.  This command is made complicated by the fact that each SMAC process only runs one iteration at a time.  To run multiple trainings in parallel, we must use SMAC’s “shared model mode”.  Turning on this mode tells SMAC to periodically check its current directory for optimizer results from other SMAC processes and incorporate those results.  This mode requires we set the “-seed” to a different value for each SMAC process.

Since we need to run multiple optimizer processes at once, we background all the calls and then sleep for the maximum amount of time we would want to run this optimizer for.  In this case, we are waiting for 4 hours.

This job is now ready to submit.  Once running, you can live tail current iterations, and at the end, view results just as in the previous randomized case.  The results will now show both the validation and test errors for each set of hyper-parameters.


In this article, we showed 2 ways to search the space of hyper-parameters for a neural network.  One used randomized search and the other used a more sophisticated optimization tool, making use of the Rescale optimization SDK.

This article was written by Mark Whitney.


Rescale now supports running a number of neural network software packages including the Theano-based Keras. Keras is a Python package that enables a user to define a neural network layer-by-layer, train, validate, and then use it to label new images. In this article, we will train a convolutional neural network (CNN) to classify images based on the CIFAR10 dataset. We will then use this trained model to classify new images.

We will be using a modified version of the Keras CIFAR10 CNN training example and will start by going step-by-step through our modified version of the training script.

CIFAR10 Dataset

The CIFAR10 image classification dataset can be downloaded here. It consists of approximately 60000 32×32 pixel images, each given one of 10 categories. You can either download the python version of the dataset directly or use Keras’ built-in dataset downloader (more on this later).

We will load this dataset with the following code:

We are using the cifar10 data loader here, converting the category labels to a one-hot encoding, then scaling the 8-bit RGB values to a 0-1.0 range.

The X_train and X_test outputs are numpy matrices of RGB pixel values for every image in the training and test set. Since there are 50000 training images and 10000 test images and each image is 32×32 pixels with 3 color channels, the shape of each matrix is as follows:

X_train (50000, 3, 32, 32)
Y_train (50000)
X_test (10000, 3, 32, 32)
Y_test (10000)

The Y matrices correspond to an ordinal value representing one of the 10 image classes for the 50000 and 10000 image groups:

airplane 0
automobile 1
bird 2
cat 3
deer 4
dog 5
frog 6
horse 7
ship 8
truck 9

For the sake of simplicity, we do no further pre-processing on the correctly sized images in this example. In a real image recognition problem, we would do some sort of normalization, ZCA whitening, and/or jittering. Keras integrates some of this pre-processing with the ImageDataGenerator class.

Defining the Network

The next step is to define the neural network architecture we wish to train:

This network has 4 convolutional layers followed by 2 dense layers. Additional layers can be added and layers can be removed or changed, but the first layer must have the same size as an input image (3, 32, 32) and the last dense layer must have the same number of outputs as number of classes we are using as labels (10). After the final dense layer is a softmax layer that squashes the output to a (0, 1) range that sums to 1.

Training and Testing

Next we train and test the network:

Here we have chosen stochastic gradient descent as our optimization method with a cross entropy loss. Then we train the model using the fit() method. We specify the number of training epochs (times we iterate through the data) and the size of our batches (the number of inputs to evaluate on the network at once). Larger batch sizes correspond to more memory usage while training. After our network is trained, we evaluate the model against the test data set and print the accuracy.

Saving the Model

Finally, we save our trained model to files so that we can later re-use it:

Keras distinguishes between saving the model architecture (in our case, what is output from make_network()) and the trained weights. The weights are saved in HDF5 format.

Note that Keras does not guarantee that the saved model is compatible across different versions of Keras and Theano. We recommend you try to load saved models with the same version of Keras and Theano if possible.

Rescale Training Job

Now that we have explained the contents of the training script we will be running, we will create a Rescale job to train on a GPU node which is already optimized to run on NVIDIA GPUs. This job is publicly available on Rescale. First, we upload the training script and CIFAR10 dataset:


Here we are uploading the pre-processed version of the CIFAR10 images as downloaded by Keras to avoid re-downloading the dataset from the CIFAR site every time the job is run. This step is optional and we could instead just upload the script.

Next, we select Keras and specify the command line:


We select Keras from the software picker and then the Theano-backed K520 GPU version of Keras. The command line re-packs the dataset we uploaded and then moves the archive into the default Keras dataset location at ~/.keras/datasets. Then it calls the training script. If we opted to not upload the CIFAR10 set ourselves, we could omit all the archive manipulation commands and just run the training script. The dataset would then be automatically downloaded to the job cluster.

In the last step, we select the GPU hardware we want to run on:


Here we have selected the Jade core type and the minimum 4 cores for this type. Finally, submit the job.

It will take about 15 minutes to provision the cluster and compile the network before training begins. Once it starts, you can view the progress by selecting process_output.log.


Once the job completes, you can use your trained model files. You can either download them from the job results page or use them in a new job, as we will show now.

Classifying New Images

We used a pre-processed numpy-formatted dataset for our training job, so what do we do if we want to take some real images off the internet and classify those? Since dog and cat are 2 of the 10 classes of images represented in CIFAR10, we pick a dog and cat image off the internet and try to classify them:

standing-cat        dog-face

We start by loading and scaling down the images:

We use scipy’s imread to load the JPGs and then resize the images to 32×32 pixels. The resulting image tensor has dimensions of (32, 32, 3) and we want the color dimension to be first instead of last, so we take the transpose. Finally, combine the list of image tensors into a single tensor and normalize the levels to be between 0-1.0 as we did before. After processing, the images are smaller:

standing-cat-small                                                                   dog-face-small

Note that we performed the simplest resizing here which does not even preserve the aspect ratio of the original image. If we had done any normalization on the training images, we would also want to apply these transformations to these images as well.

Loading the Model and Labeling

Assembling the saved model is a 2 step process shown here:

Putting it together, we take the model we loaded and call predict_classes to get the class ordinal values for our 2 images.

Rescale Labeling Job

Now to put our labeling script in a job and label our example images. This job is publicly available on Rescale. We start selecting the trained models we created. “Use files from cloud storage” and then select the JSON and HDF5 model files created by the training job:


Then upload our new labeling script and dog and cat images.


Select the Keras GPU software and run the labeling script. In this case, the dog and cat images are loaded from the current directory where the job is run from, so no files need to be moved around.


The labels will then be shown in process_output.log when the job completes.



The output is [3, 5] which corresponds to cat and dog from our image class table above.

That wraps up this tutorial. We successfully trained an image recognition convolutional neural network on Rescale and then used that network to label additional images. Coming soon in another post, we will talk about using more complex Rescale workflows to optimize network training.

This article was written by Mark Whitney.


Rescale is a valuable regression and performance testing resource for software vendors. Using our API and command line tools, we will discuss how you can run all or a subset of your in-house regression tests on Rescale. The advantages of testing on Rescale are as follows:

  1. Compute resources are on-demand, so you only pay when you are actually running tests
  2. Compute resources are also scalable, enabling you to run a large test suite in parallel and get feedback sooner
  3. Heterogeneous resources are available to test software performance on various hardware configurations (e.g. Infiniband, 10GigE, Tesla and Phi coprocessors)
  4. Testing your software on Rescale can then optionally enable you to provide private beta releases to other customers on Rescale

Test Setup

For the remainder of this article, we will assume you have the following sets of files:

  1. Full build tree of your software package, in a commonly supported archive format (tar.gz, zip, etc.)
  2. Archived set of reference test inputs and expected outputs
  3. Script or command line to run your software build against one or more test inputs
  4. Script to evaluate actual test output with expected output
  5. (Optional) Smaller set of incremental build products to overlay on top of the full build tree

In the examples below, we will be using our python SDK. A selection of examples below are available in the SDK repo here. The SDK just wraps our REST API, so you can port these examples to other languages by using the endpoints referenced in

Note that all these examples require:

  1. An account on the Rescale platform
  2. A local RESCALE_API_KEY environment variable set to your API key found in Settings->API from the main platform page

Running tests from a single job

We will start with the simplest example, uploading a full build and test reference data as job input files, running the tests serially, and comparing the results. Let’s start with some example “software” which we will upload and run. Here is a list of the software package and test files:

Each software build and test case is archived separately. Here are the steps to prepare and run our test suite job:

  1. Upload build, reference test data, and results comparison script using the Rescale python SDK:

RescaleFile uploads the local file contents to Rescale and returns metadata to reference that file. At this point, you can view these files at

      2. Create the test suite job:

RescaleJob creates a new job which you can now view at Note here we are running the job on a single Marble core. You can opt to run more cores by increasing coresPerSlot or change the core type by selecting a different core type code from RescaleConnect.get_core_types().

Note that the command and postProcessScriptCommand fields can be any valid bash script, so there is quite a bit of flexibility in how you run your test and evaluate the results. In our very simple example, the post-test command comparison just diffs the out and expected_out files in each test case directory.

  1. Submit the job for execution and wait for it to complete:

Once the job cluster is provisioned, the input files are transferred to the cluster, unencrypted, then uncompressed in the work directory. Next, the TEST_COMMAND is run, followed by the POST_RUN_COMPARE_COMMAND.

  1. Download the test results. All Rescale job commands have stdout redirected to process_output.log let’s just download that one file to get the test result summary.

It is important to note here that by doing our test result comparison as a post-processing step in our job, we avoid downloading potentially large output files, which would delay how long it takes to get test results. This doesn’t address the issue that we still need to upload the test reference outputs, which will often be similar size to the actual output. The key is that we only need to upload files to Rescale when they change, not for every test job we launch. Assuming our reference test cases do not change very often, we can now reuse the files we just uploaded to Rescale in later test runs.

You can find this example in full here.

Reusing reference test data

We will now modify the above procedure to avoid uploading reference test data for every submitted test job.

  1. (modified) Find test file metadata on Rescale and use as input file to subsequent jobs:

RescaleFile.get_newest_by_name just retrieves metadata for the test file that was already uploaded to Rescale. Note that if you uploaded multiple test archives with the same name, this will pick the most recently uploaded one.

Steps 2 through 4 are the same as the previous example.

Parallelize long running tests

The previous examples just run all your tests sequentially, let’s now run some in parallel. For this example, we assume our tests are partitioned into “short” and “long” tests. The short tests are in an archive called all_short_tests.tar.gz and each long test is in a separate archive called long_test_.tar.gz.

We will now launch a single job for all the short tests and a job per test for the long tests. We assume these test files have already been uploaded to Rescale, as was done in the first example.

In this example, we launched our short test job with a single Marble core and each of our long tests with a 32 core (2 nodes) Nickel MPI cluster.

This test job configuration is particularly appropriate for performance tests. To test that a particular build + test case combination scales, you might launch 4 jobs with 1, 2, 4, and 8 nodes respectively.

This example can be found here.

Incremental builds

In the above, we avoided re-uploading test data for each test run by reusing the same data already stored on Rescale. If we have a large software build we need to test, we would like to also reuse already uploaded data, but each build tested will generally be different. In many cases though, only a small subset of files from the whole package will change from build to build.

To leverage the similarity in builds, we can supply an incremental build delta that will be uncompressed on top of the base build tree we uploaded in the first job. There are just 2 requirements:

  1. The build delta must have the same directory structure as the base build
  2. We need to specify the build delta archive as an input file AFTER the base build archive

In the above, base_build_input comes from the file already on Rescale and incremental_build_input is uploaded each time.

Parallelism in Design-of-Experiment (DOE) jobs

Another way to run tests is to group multiple tests into a single DOE job. The number of tests that can run in parallel is then defined by the number of task slots you configure for your job. You would then structure your test runs so that they can be parameterized by a templated configuration file, as described in

This method has the advantage of eliminating job cluster setup time, compared to the multi-job case. The disadvantage is that each test run is limited to the same hardware configuration you define for a task slot. For an example on how to set up a DOE job with the python SDK, see

Large file uploads

In the above examples, we have uploaded our input files with a simple PUT request. This will be slow and/or often not work for multi-gigabyte files. An alternative is to use the Rescale CLI tool, which provides bandwidth optimized file uploads and downloads and can resume transfers if they are interrupted.

For more info on the Rescale CLI, see here:

Running tests on Rescale is a great way to reduce testing time and strain on internal compute resources for large regression and performance test suites. The Rescale API provides a very flexible way to launch groups of tests on diverse hardware configurations, including high-memory, high-storage, infiniband, and GPU-enabled clusters. If you are interested in doing your own testing on Rescale, check out our SDK and example scripts at or contact us at

This article was written by Mark Whitney.