Three years ago we visited the Google’s IaaS service – Google Compute Engine (GCE) for its networking performance and Ryan posted the results in his blog post. Back then, the conclusion was that GCE instances were more suitable for a typical workload of hosting web services but there was still performance tuning space for HPC applications. Recently, we revisited the GCE’s instances with their latest offering again.

Benchmark Tools
To make the results somewhat comparable with the old ones, we’re still using the OSU Micro Benchmarks but with the latest version 5.3.2. And among all the benchmarking tools being offered, we pick two most critical ones: osu_latency for latency test and osu_bibw for bidirectional bandwidth test.

Test Environment
Operating System: Debian GNU/Linux 8 (jessie)

MPI Flavor: MPICH3

Test Instances
Since we are testing the interconnection performance between VM instances, we want to make sure the VM instances we launched are actually sitting on different physical hosts so the traffic actually goes through the underlying network but not the host machine’s memory.

So we picked the biggest instance of each series:

n1-standard-32, n1-highmem-32 and n-highcpu-32

Test Results
For latency (in microseconds):

Instance Type Trial #1 Trial #2 Trial #3 Average
n1-standard-32 45.68 47.03 48.46 47.06
n1-highmem-32 43.17 43.08 36.87 41.04
n1-highcpu-32 47.11 48.51 48.17 47.93

(size: 0-bytes)

For bidirectional bandwidth: (MB/s)

Instance Type Trial #1 Trial #2 Trial #3 Average
n1-standard-32 808.28 864.91 872.36 848.52
n1-highmem-32 1096.35 1077.33 1055.2 1076.29
n1-highcpu-32 847.68 791.16 900.32 846.39

(size: 1,048,576-bytes)

Summary of Results
For the network latency, we can see the average is around 40 ~ 45 microseconds, which is 4x faster than the previous result – around 180 microseconds. And the new latency is fairly consistent among other smaller instance types.

For bandwidth, we don’t have a previous result to compare to but among all the GCE instance types, we found n1-highmem-32 has the best performance which can be as high as 1070 MB/s. This result aligns with GCE’s official document https://cloud.google.com/compute/docs/networks-and-firewalls#egress_throughput_caps.

This article was written by Irwen Song.


Google released TensorFlow (www.tensorflow.org), an open source machine learning library, last November which attracted huge attention in the field of AI. TensorFlow is also known as “Machine Learning for Everyone” since it is relatively easy to hands on even for those who don’t have much experience in machine learning.  Today we are excited to announce that TensorFlow is now available on Rescale’s platform.  This means you can learn to create and train your machine learning models using TensorFlow with just a web browser.  I’ll walk you through how in this blog post.

Let’s Start With a Simple Case

We’ll start from the first official TensorFlow tutorial: MNIST for ML beginners.  It introduces what the MNIST is and how to model and train it with softmax regression, a basic machine learning method, in TensorFlow.  Here we’ll be focusing on how to set the job up and run it on the Rescale platform.

You can create the python script in a local editor mnist_for_beginners.py:

The script above is just putting all the snippets together.  Now, we need to run that on Rescale’s GPU hardware.

First, you need to create an account, if you still haven’t, click here to create one.

If you want to skip the hassle of setting up the job step-by-step, you can also click here to view the tutorial job and clone it into your own account.

After account registration, login to Rescale and click “+ New Job” button on the top left to create a new job.

Screen Shot 2016-04-15 at 1.28.08 PM

Click “upload from this computer” and upload your python script to Rescale.

Screen Shot 2016-04-15 at 1.29.47 PM

Click “Next” to go to the Software Settings page and choose TensorFlow from the software list.  Currently 0.71 is the only supported version on Rescale, so choose this version and type “python ./mnist_for_beginners.py” in the Command field.  Select “Next” to go to the Hardware Settings page.

Screen Shot 2016-04-15 at 1.39.15 PM

In Hardware Settings, choose core type Jade and select 4 cores.  This job is not very compute intensive, so we choose the minimum valid number of cores.  We can skip the post-processing for this example, and click “Submit” on the Review page to submit the job.

Screen Shot 2016-04-15 at 1.39.46 PM

Screen Shot 2016-04-15 at 2.01.08 PM

It will take 4 – 5 minutes to launch the server and 1 minute to run the job.  When the job is running, you can use Rescale’s  live tailing feature to monitor the files in the working directory.

After the job is finished, you can view the files from the results page.  Let’s take a look at process_output.log which is the output from that python script we uploaded.  At the third line from the bottom, we can verify that the accuracy is 91.45%.

Screen Shot 2016-04-15 at 2.06.17 PM

A More Advanced Model

In the second TensorFlow tutorial, a more advanced model is built with a multilayer convolutional network to increase the accuracy to 99.32%.

To run this advanced model on Rescale, you can simply repeat the process of the first one and replace the python script with the new model from the tutorial.  You can also view and clone an existing job from here.

Single GPU vs. Multiple GPU Performance Speedup Test

If you have more than one GPU on your machine, TensorFlow can utilize all of them for better performance.  In this section, we are going to do a performance benchmark on a single K520 GPU machine vs. a 4 K520 GPUs machine and test performance speedups.

The CIFAR10 Convolutional Neural Network example is used as our benchmarking job.  From the result below we can see that with 4 times the number of GPUs, the examples being processed per second are only 2.37 times the single GPU performance.


Work Ahead

TensorFlow has just released a new distributed version (v0.8) on 4/13/2016 which can distribute the workload across the GPUs on multiple machines.  It would be very interesting to see its performance under a multi-node-multi-GPU cluster.  Before that, we’ll make the process to launch a multi-node-multi-GPU cluster with TensorFlow support on Rescale as simple as possible.

This article was written by Irwen Song.