It has been about a year and a half since we released a reusable Azure Cloud Service for provisioning a simple Windows MS-MPI cluster without having to install HPC Pack. Azure has undergone a lot of changes since that time and we thought it would be worth revisiting this topic to see what the current landscape looks like for running Windows MPI applications in the cloud.

First, Cloud Services and the so-called “IaaS v1 Virtual Machines” have been relegated to “Classic” status in the Azure portal. Microsoft now recommends that all new deployments use Azure Resource Manager (ARM). Azure Resource Manager allows clients to submit a declarative template written in json that defines all of the cloud resources such as VMs, load balancers, and network interfaces, that need to be created as part of an application or cluster. Dependencies can be defined between resources and the Resource Manager is smart enough to parallelize resource deployment where it can. This can make deploying a new cluster or application much faster than the old model. Azure Resource Manager is essentially the equivalent of CloudFormation on AWS. There are some additional niceties here though such as being able to specify loops in the template. Dealing with conditional resource deployment however, is clunkier in ARM templates than in CloudFormation. Both services suffer from trying to support programming logic from within json. All in all however, ARM deployments are much easier to manage than Classic ones.

The Azure Quickstart Templates project on Github is a great resource for finding ARM templates. Deploying an application is literally as simple as clicking a Deploy to Azure button and filling in a few template parameter values. On the HPC front, there is a handy HPC Pack example available that can be used to provision and setup the scheduler.

However, as we touched on in our original blog post, using HPC Pack may not be the best choice if you are getting started with MPI and simply want to spin up a new MPI cluster, test your application, and then shut everything back down again. While HPC Pack provides the capabilities of a full blown HPC scheduler, this additional power comes at the cost of some resource overhead on the submit node (setting up Active Directory, installing SQL Server, etc). This can be overkill if you just want a one-off cluster to run a MPI application.

Another potentially lighter weight option for running Windows MPI applications in the cloud is the Azure Batch service. Recently, Microsoft announced support for running multi-instance MPI tasks on a pool of VMs. This looks to be a useful option for those who interested in automating the execution of MPI jobs however it does require some investment of developer resources to become familiar with the service before MPI jobs can be run.

We feel there is a still room for an Azure Resource Manager template that 1) launches a bare-bones Windows MPI cluster without the overhead of HPC Pack and 2) allows MPI jobs to be run from the command line or a batch script from any operating system.

On that second point above, another interesting development since our original post is that Microsoft has decided to officially support SSH for remote access. Since that announcement, the pre-release version of the code has been made available on GitHub.

So, given those pieces we decided to put together a simple ARM template to accomplish both of those goals. For someone getting started with MS-MPI, we feel this is a simpler option to getting your code running on a Windows cluster in Azure.

Here is a basic usage example:

  1. Click the Deploy To Azure button from the Github project. Fill in the template parameters. Here, a 2-node Standard_D2 cluster is being provisioned:

  2. Make a note of the public IP address assigned to the cluster when the deployment completes.
  3. The template will enable SSH and SFTP on all of the nodes. Upload your application to the first VM in the cluster (N0).  Here we are using the hello world application from this blog post.
  4.  SSH into N0, copy the MPI binary into the shared SMB directory (C:\shared), and run it. Enter your password as the
    argument to the -pwd switch (redacted below). The -savecreds command line argument will securely save your credentials on
    the compute nodes so you don’t have to specify the password in future mpiexec calls. See
    here for more details.

And that’s it! For those that are more GUI-inclined, RDP is also opened up to all of the instances in the MPI cluster. Head on over to the Github project page for more details.

This article was written by Ryan Kaneshiro.


Rescale now supports running a number of neural network software packages including the Theano-based Keras. Keras is a Python package that enables a user to define a neural network layer-by-layer, train, validate, and then use it to label new images. In this article, we will train a convolutional neural network (CNN) to classify images based on the CIFAR10 dataset. We will then use this trained model to classify new images.

We will be using a modified version of the Keras CIFAR10 CNN training example and will start by going step-by-step through our modified version of the training script.

CIFAR10 Dataset

The CIFAR10 image classification dataset can be downloaded here. It consists of approximately 60000 32×32 pixel images, each given one of 10 categories. You can either download the python version of the dataset directly or use Keras’ built-in dataset downloader (more on this later).

We will load this dataset with the following code:


We are using the cifar10 data loader here, converting the category labels to a one-hot encoding, then scaling the 8-bit RGB values to a 0-1.0 range.

The X_train and X_test outputs are numpy matrices of RGB pixel values for every image in the training and test set. Since there are 50000 training images and 10000 test images and each image is 32×32 pixels with 3 color channels, the shape of each matrix is as follows:

X_train (50000, 3, 32, 32)
Y_train (50000)
X_test (10000, 3, 32, 32)
Y_test (10000)

The Y matrices correspond to an ordinal value representing one of the 10 image classes for the 50000 and 10000 image groups:

airplane 0
automobile 1
bird 2
cat 3
deer 4
dog 5
frog 6
horse 7
ship 8
truck 9

For the sake of simplicity, we do no further pre-processing on the correctly sized images in this example. In a real image recognition problem, we would do some sort of normalization, ZCA whitening, and/or jittering. Keras integrates some of this pre-processing with the ImageDataGenerator class.

Defining the Network

The next step is to define the neural network architecture we wish to train:

This network has 4 convolutional layers followed by 2 dense layers. Additional layers can be added and layers can be removed or changed, but the first layer must have the same size as an input image (3, 32, 32) and the last dense layer must have the same number of outputs as number of classes we are using as labels (10). After the final dense layer is a softmax layer that squashes the output to a (0, 1) range that sums to 1.

Training and Testing

Next we train and test the network:

Here we have chosen stochastic gradient descent as our optimization method with a cross entropy loss. Then we train the model using the fit() method. We specify the number of training epochs (times we iterate through the data) and the size of our batches (the number of inputs to evaluate on the network at once). Larger batch sizes correspond to more memory usage while training. After our network is trained, we evaluate the model against the test data set and print the accuracy.

Saving the Model

Finally, we save our trained model to files so that we can later re-use it:

Keras distinguishes between saving the model architecture (in our case, what is output from make_network()) and the trained weights. The weights are saved in HDF5 format.

Note that Keras does not guarantee that the saved model is compatible across different versions of Keras and Theano. We recommend you try to load saved models with the same version of Keras and Theano if possible.

Rescale Training Job

Now that we have explained the contents of the cifar10_cnn.py training script we will be running, we will create a Rescale job to train on a GPU node which is already optimized to run on NVIDIA GPUs. This job is publicly available on Rescale. First, we upload the training script and CIFAR10 dataset:


Here we are uploading the pre-processed version of the CIFAR10 images as downloaded by Keras to avoid re-downloading the dataset from the CIFAR site every time the job is run. This step is optional and we could instead just upload the cifar10_cnn.py script.

Next, we select Keras and specify the command line:


We select Keras from the software picker and then the Theano-backed K520 GPU version of Keras. The command line re-packs the dataset we uploaded and then moves the archive into the default Keras dataset location at ~/.keras/datasets. Then it calls the training script. If we opted to not upload the CIFAR10 set ourselves, we could omit all the archive manipulation commands and just run the training script. The dataset would then be automatically downloaded to the job cluster.

In the last step, we select the GPU hardware we want to run on:


Here we have selected the Jade core type and the minimum 4 cores for this type. Finally, submit the job.

It will take about 15 minutes to provision the cluster and compile the network before training begins. Once it starts, you can view the progress by selecting process_output.log.


Once the job completes, you can use your trained model files. You can either download them from the job results page or use them in a new job, as we will show now.

Classifying New Images

We used a pre-processed numpy-formatted dataset for our training job, so what do we do if we want to take some real images off the internet and classify those? Since dog and cat are 2 of the 10 classes of images represented in CIFAR10, we pick a dog and cat image off the internet and try to classify them:

standing-cat        dog-face

We start by loading and scaling down the images:

We use scipy’s imread to load the JPGs and then resize the images to 32×32 pixels. The resulting image tensor has dimensions of (32, 32, 3) and we want the color dimension to be first instead of last, so we take the transpose. Finally, combine the list of image tensors into a single tensor and normalize the levels to be between 0-1.0 as we did before. After processing, the images are smaller:

standing-cat-small                                                                   dog-face-small

Note that we performed the simplest resizing here which does not even preserve the aspect ratio of the original image. If we had done any normalization on the training images, we would also want to apply these transformations to these images as well.

Loading the Model and Labeling

Assembling the saved model is a 2 step process shown here:

Putting it together, we take the model we loaded and call predict_classes to get the class ordinal values for our 2 images.

Rescale Labeling Job

Now to put our labeling script in a job and label our example images. This job is publicly available on Rescale. We start selecting the trained models we created. “Use files from cloud storage” and then select the JSON and HDF5 model files created by the training job:


Then upload our new labeling script dog_cat.py and dog and cat images.


Select the Keras GPU software and run the labeling script. In this case, the dog and cat images are loaded from the current directory where the job is run from, so no files need to be moved around.


The labels will then be shown in process_output.log when the job completes.



The output is [3, 5] which corresponds to cat and dog from our image class table above.

That wraps up this tutorial. We successfully trained an image recognition convolutional neural network on Rescale and then used that network to label additional images. Coming soon in another post, we will talk about using more complex Rescale workflows to optimize network training.

This article was written by Mark Whitney.