In a previous post, we showed examples of using multiple GPUs to train a deep neural network (DNN) using the Torch machine learning library. In this post, we will focus on performing multi-GPU training using TensorFlow.

In particular, we will explore data-parallel GPU training with multi-GPU and multi-node configurations on Rescale. We will leverage Rescale’s existing MPI configured clusters to easily launch TensorFlow distributed training workers. For a basic example of training with TensorFlow on a single GPU, see this previous post.

Preparing Data
To make our multi-GPU training sessions more interesting, we will be using some larger datasets. Later, we will show a training job on the popular ImageNet image classification dataset. Before we start with this 150 GB dataset, we will prepare a smaller dataset to be in the same format as ImageNet and test our jobs with that in order to make sure the TensorFlow trainer is working properly. In order to keep data local during training, Rescale syncs the dataset to local storage on the GPU nodes before training starts. Waiting for a large dataset like ImageNet to sync when iteratively developing a model using just a few examples is wasteful. For this reason, we will start with the smaller Flowers dataset and then move to ImageNet once we have a working example.

TensorFlow processes images that are formatted as TFRecords so first let’s download and preprocess the pngs from the flowers dataset to be in this format. All the examples we will be showing today come out of the inception module in the tensorflow/models repository on GitHub, so we start by cloning that repository:

Now we use the bazel build tool to make the flowers download-and-preprocess script and then run that script:

This should download a ~220MB archive and create something like this:

We assemble this archive and then upload it to Rescale. Optionally, you can delete the raw-data and archive file since all the necessary information is now encoded as TFRecords.

We have assembled all these operations in a preprocessing job on Rescale here for you to clone and run yourself.

Next, let’s take the flowers.tar.gz file we just produced and convert it to an input file for the next step:


Now we have our preprocessed flower image TFRecords ready for training.

Single Node – Multiple GPUs
The next step is to take this input dataset and train a model with it. We will be using the Inception v3 DNN architecture from the tensorflow/models repository as mentioned above. Training on a single node with multiple GPUs looks something like this:

(from https://github.com/tensorflow/models/tree/master/inception#how-to-train-from-scratch)


We will first create a Rescale job that runs on a single node, since that has fewer moving parts than the multi-node case. We will actually run 3 processes in the job:

  • Main GPU-based model training
  • CPU-based evaluation of checkpointed models on the validation set
  • TensorBoard visualization tool

So let’s get started! First build the training and evaluation scripts.

Next, create some output directories and start the main training process: 

$RESCALE_GPUS_PER_SLOT is a variable set on all Rescale job environments. In this command line, we point to the flowers directory with our training images and the empty out/train directory where TensorFlow will output logs and models files.

Evaluation of accuracy on the validation set can be done separately and does not need GPU acceleration:

imagenet_eval sits in a loop and wakes up every eval_interval_secs to evaluate the accuracy of the most recently trained model checkpoint in out/train against validation TFRecords in the flowers directory. Accuracy results are logged to out/eval. CUDA_VISIBLE_DEVICES is an important parameter here. TensorFlow will by default always load itself into GPU memory, even if it is not going to make use of the GPU. Without this parameter, both the training and evaluation processes will together exhaust all the memory on the GPU and cause the training to fail.

Finally, TensorBoard is a handy tool for monitoring TensorFlow’s progress. TensorBoard runs its own web server to show plots of training progress, a graph of the model, and may other visualizations. To start it, we just have to point it to the out directory where our training and evaluation processes are outputting:

TensorBoard will pull in logs in all subdirectories of logdir so it will show training and evaluation data together.

Putting this all together:

Since this all runs in a single shell, we background the TensorBoard and evaluation processes. We also delay start of the evaluation process since the training process needs a few minutes to initialize and create the first model checkpoint.

You can run this training job on Rescale here.

Since TensorBoard runs its own web server without any authentication, access is blocked by default on Rescale. The easiest way to get access to TensorBoard is to open an SSH tunnel to the node and forward port 6006:


Now navigate to http://localhost:6006 and you should see something like this:


Multiple Nodes
The current state-of-the-art limits the total GPU cards that can fit on a node to something around 8. Additionally, the CUDA peer-to-peer system, the mechanism a TensorFlow process uses to distribute work amongst GPUs is currently limited to 8 GPU devices. While these numbers will continue to increase, it is still convenient to have a mechanism to scale your training out for large models and datasets. TensorFlow distributed training synchronizes updates between different training processes over the network, so it can be used with any network fabric and not be limited by CUDA implementation details. Distributed training consists of some number of workers and parameter servers as shown here:

(from https://github.com/tensorflow/models/tree/master/inception#how-to-train-from-scratch)


Parameter servers provide model parameters which are then used to evaluate an input batch. After the batch on each worker is complete, the error gradients are fed back into the parameter server which then uses them to produce new model parameters. In the context of a GPU cluster, we could run a worker process to use each GPU in our cluster and then pick enough parameter servers to keep up with processing of the gradients.

Following the instructions here, we set up a worker per GPU and a parameter server per node. We will take advantage of the MPI configuration that comes with every Rescale cluster.

To start, we need generate the host strings that will be passed to each parameter server and worker, each process getting a unique hostname:port combination, for example:

We want a single entry per host for the parameter servers and a single entry per GPU for the workers. We take advantage of the machine files that are automatically set up on every Rescale cluster. $HOME/machinefile just has a list of hosts in the cluster and $HOME/machinefile.gpu has a list of hosts annotated with the number of GPUs on each host. We parse them to generate our host strings in a python script: make_hoststrings.py

Next we have a script that takes these host strings and launches the imagenet_distributed_train script with the proper task ID and GPU whitelist, tf_mpistart.sh:

 tf_mpistart.sh will be run with OpenMPI mpirun so $OMPI* environment variables are automatically injected. We use $OMPI_COMM_WORLD_RANK to get a global task index and $OMPI_COMM_WORLD_LOCAL_RANK to get a node local GPU index.

Now, putting it all together:

We start with a bunch of the same directory creation and bazel build boilerplate. The 2 exceptions are:

1. We move all the input directories to the shared/ subdirectory so it is shared across nodes.
2. We now call the bazel build command with the --output_base so that bazel doesn’t symlink the build products to $HOME/.cache and instead makes them available on the shared filesystem.

Next we launch TensorBoard and imagenet_eval locally on the MPI master. These 2 processes don’t need to be replicated across nodes with mpirun.

Finally, we launch the parameter servers with tf_mpistart.sh ps and the single entry per node machinefile and then the workers with tf_mpistart.sh worker with GPU-ranked machinefile.gpu.

Here is an example job performing the above distributed training on the flowers dataset using 2 Jade nodes (8 K520 GPUs). Note that since we are using the MPI infrastructure already set up on Rescale, we can use this same example for any number of nodes or GPUs-per-node. Using the appropriate machinefiles, the number of workers and parameter servers are set automatically to match the resources.

Training on ImageNet
Now that we have developed the machinery to launch a TensorFlow distributed training job on the smaller flowers dataset, we are ready to train on the full ImageNet dataset. Downloading of ImageNet requires permission here. You can request access and upon acceptance, you will be given a username and password to download the necessary tarballs.

We can then run a preparation job similar to the flowers job above to download the dataset and format the images into TFRecords:

You can clone and run this preparation job on Rescale here.

If you have already downloaded the 3 necessary inputs from the ImageNet site (ILSVRC2012_img_train.tar, ILSVRC2012_img_val.tar, and ILSVRC2012_bbox_train_v2.tar.gz) and have placed them somewhere accessible to via HTTP (like an AWS S3 bucket), you can customize models/inception/inception/data/download_imagenet.sh in the tensorflow/models repository to download from your custom location:

Clone and run this version of the preparation job here.

Finally, we can make some slight modifications to our multi-GPU flowers jobs to take the imagenet-data dataset directory instead of the flowers directory by changing the $DATASET variable at the top:

And the distributed training case:

We have gone through all the details for performing multi-GPU, single and multi-node model training with TensorFlow. In an upcoming post, we will discuss the performance ramifications of distributed training, and look at how well it scales on different server configurations.

Rescale Jobs
Here is a summary of the Rescale jobs used in this example. Click the links to import the jobs to your Rescale account.

Flowers dataset preprocessing:
https://platform.rescale.com/tutorials/tensorflow-flowers-preprocess/clone/

Single node flowers training: 
https://platform.rescale.com/tutorials/tensorflow-flower-train-single-node/clone/

Multiple nodes flowers training: 
https://platform.rescale.com/tutorials/tensorflow-flower-train-distributed/clone/

ImageNet ILSVRC2012 download and preprocessing: 
https://platform.rescale.com/tutorials/tensorflow-imagenet-preprocess/clone/

ImageNet ILSVRC2012 download from existing S3 bucket: 
https://platform.rescale.com/tutorials/tensorflow-imagenet-noauth-preprocess/clone/

This article was written by Mark Whitney.

How does a large automotive manufacturer leverage the elasticity of the cloud, and what do they look for in a provider?

To find out, we asked a CAE manager at American Axle & Manufacturing, a Tier 1 automotive supplier of driveline and drivetrain systems that operates in 13 countries globally with annual revenues of $3.9 billion. Read below for his take on how Rescale has added value to his organization.

Rescale:  Alexy, can you start off by introducing yourself and American Axle & Manufacturing?

Alexy: My name is Alexy Kolesnikov. I manage computational fluid dynamics and thermal projects for American Axle & Manufacturing. American Axle & Manufacturing is a leading global Tier 1 automotive supplier. Our clients are all the major automotive companies. We supply driveline and drivetrain systems, for example front axles, rear axles, PTUs, and eDrive units.

Rescale: Tell us more about your role.

Alexy: I work in a CAE department, and we simulate all these components. In our products, we simulate oil flows, heating and cooling, those kinds of things. Simulation is like a design tool; instead of testing any given product 20 times and ordering parts, you pick an advanced optimizer to run simulations.

Rescale: Can you describe your simulation needs and the computing environment that you were operating with before you started using Rescale?

“What Rescale provides is essentially the on-demand capability to scale up your hardware resources almost infinitely. You can do large numbers of runs in parallel in a very limited number of days. That’s what Rescale gives us.”

– Alexy Kolesnikov, Manager CFD & Thermal Analytical Engineering

Alexy:  We use computational fluid dynamics to simulate the lubrication process. We want to optimize the way oil goes onto bearings or gears inside our products. These simulations are fairly complex. It’s fairly hard to simulate how oil is going to move. Because it is a multi-phased problem, high-fidelity, accurate simulation requires significant computing resources. Usually our jobs, our runs, our models require anywhere between 32 to 150 cores to run, which dictates the size of the server you need. Before Rescale, we had a single server with 64 cores.

Rescale: What were the pain points that led you to consider Rescale or another cloud HPC solution?

Alexy: Well, it was the scalability of hardware requirement. For us to optimize a specific axle, be it a front axle or an eDrive unit, we need to conduct several runs. (By runs, I mean different angles of inclination of a moving car, for example, or an SUV driving at different speeds.) You need to look at a number of these conditions, which dictate how the oil moves. Depending on how soon you need an answer, you might want to run a number of these jobs simultaneously, instead of waiting to run them one by one. What Rescale provides is essentially the on-demand capability to scale up your hardware resources almost infinitely. You can do large numbers of runs in parallel in a very limited number of days. That’s what Rescale gives us.

Rescale: Did you consider other cloud HPC solutions, and if so why did you choose Rescale?

“Rescale has an established relationship with CD-adapco, which allows us to use the software licenses provided by CD-adapco efficiently in conjunction with the hardware provided by Rescale. It’s called the Power-on-Demand license. We only utilize it when we have to run a large number of projects, and we are not billed when we are not running. That was a very efficient way for us to use our software and hardware resources.”

– Alexy Kolesnikov, Manager CFD & Thermal Analytical Engineering

Alexy: We did. One of the important factors in our decision was the availability of a code from CD-adapco called STAR-CCM+. We use it in-house to model lubrication and a number of other things. Rescale has an established relationship with CD-adapco, which allows us to use the software licenses provided by CD-adapco efficiently in conjunction with the hardware provided by Rescale. It’s called the Power-on-Demand license. We only utilize it when we have to run a large number of projects, and we are not billed when we are not running. That was a very efficient way for us to use our software and hardware resources.

Another thing we like about Rescale is the willingness of your company to install new software. For example, when we wanted to try different software packages that you guys didn’t have, you installed them for us. That kind of willingness to work with a client is pretty impressive.

Rescale: Could you give us an overview of how you use Rescale? Has Rescale changed the way you approach simulation?

Alexy: It has allowed us to efficiently deliver answers to internal or external clients. All clients and all problems and all projects are different. Sometimes you have weeks or months to deliver results, but sometimes all of a sudden an issue comes up and you have to have the answers tomorrow. Rescale gives us the flexibility to satisfy both needs. It fits into the structure of how we run our analysis. For example, we use Rescale anytime we need to run five jobs at the same time to deliver results in two days. We can run them in parallel on Rescale, which would require us otherwise to have five servers in-house. This on-demand scalability enables us to deliver results that are urgently needed.

“Another thing we like about Rescale is the willingness of your company to install new software. For example, when we wanted to try different software packages that you guys didn’t have, you installed them for us. That kind of willingness to work with a client is pretty impressive.”

– Alexy Kolesnikov, Manager CFD & Thermal Analytical Engineering

Rescale: How has using Rescale impacted the business overall?

Alexy: It clearly made CAE simulation way more efficient. It’s hard to estimate the advantage of having the ability to deliver results on-demand. We’ve been using Rescale for a year and a half, and at this point the main advantage is the ability to scale up and scale down. Going forward, we will probably use additional types of software for different types of analyses. But for right now, the main impact is this ability to quickly deliver results when an urgent problem comes up.

Rescale: How do you think you’ll use Rescale in the future?

Alexy: We are considering using Rescale not just for bursting, but also to substitute the in-house server a little bit. As we look forward to determine how many servers we need in-house to run 24/7 jobs, we might consider using Rescale for those jobs, in addition to on-demand, scale-up jobs. We have to look at how much it costs to maintain our server in-house compared to how much it costs to run on the cloud.

Click here for more information on using STAR-CCM+ on Rescale.

Sign up for a free trial to see how Rescale can accelerate your simulations or add complexity to your models.

This article was written by Rescale.