One of the things that we touched upon in an earlier blog post is the relative difficulty in setting up a Microsoft MPI cluster on Windows Azure because of the HPC Pack installation requirement that the official documentation recommends. As we discovered, it is possible to manually install and configure a Microsoft MPI cluster without HPC Pack, but this process is not well-documented.

Today, we are happy to release a self-contained Cloud Service package that installs and configures an InfiniBand Microsoft MPI cluster for Azure PaaS. We feel that this “MPI-in-a-box” functionality makes it much easier to spin up a cluster for one-off purposes without needing to make the investment in installing and maintaining a full HPC Pack deployment.

To spin up a connected cluster of A9 instances, you can simply download and deploy the pre-built package here and tweak a few settings in the accompanying .cscfg file. The package contains a startup script that will install and configure Microsoft MPI on the nodes. The startup script will handle the details around opening up the firewall ports for inter-node communication and building out a basic machinefile that can be used in mpiexec calls. In addition, the script will install a Cygwin OpenSSH server on every role instance in order to access the cluster remotely.

You will also need to configure a few values in the .cscfg file:

First, make sure to specify the number of A9 instances that should be launched in the Instances element. Next, at a minimum, you’ll need to provide values for the adminuser.publickey and jobuser.publickey settings so you can login to the machines after they boot. The different ConfigurationSettings are listed below:

adminuser The name of the user that will be created in the Administrators group
adminuser.publickey The SSH public key for the adminuser. Added to the ~/.ssh/authorized_keys list.
jobuser The name of the less-privileged user that will be created in the Users group. This is the user that should run mpiexec.
jobuser.publickey The SSH public key for the jobuser. Added to the ~/.ssh/authorized_keys list.
blob.storageurl The startup script will download programs from this location when booting up. The MS-MPI and Cygwin distributables are located here. Rescale hosts the necessary files so you shouldn’t need to modify this.

Once you’ve filled out the values in the .cscfg file, you can deploy the service through the Azure Management web page or script it out with the Management API.

After the instances are up and running, you can use SSH to connect to each of the role instances. The Cloud Service is setup to use Instance Internal Endpoints to allow clients to connect to individual role instances through the load balancer. The OpenSSH server running on port 22 on the first role instance is mapped to the external port 10106. The OpenSSH server on the second role instance is mapped to 10107, the third to 10108 and so on.

So, if you deployed a cloud service called, in order to login to the first to the first role instance in your cluster, you’ll want to use a command like:

ssh -i [jobuser-private-key-file] -p 10106

SCP can be used to transfer files into the cluster (though note that you’ll need to use -P instead of -p to specify the custom SSH port).

The startup script will launch the SMPD process on all of the machines as the user that is specified in the jobuser setting. This means that you will need to make sure to log in as this user in order to run mpiexec.

A machinefile is written out to the jobuser’s home directory, which can be used in the mpiexec call. For example, after SSHing into the first role instance as the jobuser, the following command will dump the hostnames of each machine in the cluster:

$ mpiexec -machinefile machinefile hostname

Finally, the startup script will also configure a basic Windows SMB file share amongst all the nodes in the cluster. The jobuser can access this folder from the ~/work/shared path. This is an easy way to distribute files amongst the nodes in the cluster. Please note however, that you will likely see better performance if you use something like azcopy to have each node download their input files from blob storage instead.

The source code for the Cloud Service is available on Github. Feel free to fork and contribute.

This article was written by Ryan Kaneshiro.

Rescale makes it easy to run a design of experiments, as previously discussed in this post. Here we will provide some quick tips on template generation so your variables are formatted to your liking.

We support two methods for specifying a range of variables:
1. Uploading a comma-separated values (CSV) file, where each row is a case–also known as a “run” in Rescale parlance
2. Specifying the variable ranges directly in your browser

Once your variables are defined, you can then create templates, where the basic placeholder for a variable, such as “x”, looks like this:

If you use this syntax and specify your variables in a CSV file, then we will replace the placeholder with the value from your CSV without any modification. This can be useful if you would like to include non-numeric data in a file specific to that case. For example, you might include a description of each case as a comment in an input file, which may be more meaningful than the identifier we automatically assign to that case. So if your CSV looks like this:

And your input file template looks like this:

Then the processed template for the first case would look like this:

You may prefer to have a specific consistent number format in the file, regardless of how the variable was specified in the CSV or how you specified it in the browser. For this situation, you can supply an additional format instruction, using a “0” or “#” for a digit, where trailing zeros are absent if the “#” symbol is used. Here are some examples:

x ${x?string(“0”)} ${x?string(“0.0”)} ${x?string(“0.00##”)}
0.9 1 0.9 0.90
1.49 1 1.5 1.49
-55.123 -55 -55.1 -55.123
9810 9810 9810.0 9810.00

Notice that when the format string has fewer digits to the right of the decimal place than the value, then the value will be rounded accordingly. We also support scientific notation:

x ${x?string(“0E0”)} ${x?string(“00E00”)} ${x?string(“0.0##E0”)}
0.9 9E-1 90E-02 9.0E-1
1.49 1E0 15E-01 1.49E0
-55.123 -6E1 -55E00 -5.512E1
9810 1E4 98E02 9.81E3

We hope this system makes it easier for you to run design of experiments on Rescale.

This article was written by Adam McKenzie.



Computational Fluid Dynamics (CFD) has undergone immense development as a discipline over the past several decades and is used routinely to complement empirical testing in product design of aircraft, automobiles, microelectronics, and several other industries. The vast majority of commercially available fluid flow solvers in use today exploit finite difference, finite volume, or finite element schemes to achieve second-order spatial accuracy. These low-order schemes have become both robust and affordable due to considerable efforts on the part of their original developers while offering suitable accuracy for many flow problems.

While second-order methods have become widespread throughout both industry and academia, there exist several important flow problems–requiring very low numerical dissipation–for which they are not well suited. Most of these flow problems involve vortex dominated flows as well as problems in aeroacoustics, and solutions to these fluid flow problems are otherwise intractable without the aid of unstructured high-order methods.

Other situations arise where second-order accuracy may not lead to an acceptable overall solution. For example, a suitable solution error in one variable (e.g. lift or pressure drag) may lead to an unacceptable solution error for another (e.g. shear stress). In short, there exist several fluid flow problems today where it may be advantageous to use a high-order spatial discretization. These higher order schemes may offer increased accuracy for a comparable computational cost.

Analysis Description

To help  demonstrate running unstructured high-order simulations across multiple GPU co-processors using Rescale’s cloud-based HPC infrastructure, the computation of flow over a NACA 0012 airfoil is simulated for viscous subsonic flow using PyFR1. While the simulation of laminar flow over a 2D  airfoil is by no means novel, configuring the discretized computational domain with curved mesh elements and solving via Huynh’s2 Flux Reconstruction framework together with extension to three dimensions and sub-cell shock capturing schemes3 describe the current state-of-the-art of CFD.

The governing equations are the Navier-Stokes equations with a constant ratio of specific heats equal to 1.4 and Prandtl number of 0.72. The viscosity coefficient is computed via Sutherland’s law. Here only a single flow condition is considered with M0= 0.5  and  α = 1 .  The  Reynolds number, Re = 5000, is based on the airfoil’s chord length. The NACA 0012 airfoil is defined in Eq. (1) as:


where x ∈ [0, 1]. The airfoil defined using this equation has a finite trailing edge of 0.252%. Various ways exist in the literature to modify this definition such that the trailing edge has a zero thickness. Here, one which modifies the x4 coefficient is adopted, such that


The airfoil shape is depicted in Fig. (1) below.

Figure 1: NACA 0012 airfoil section

The farfield boundary conditions are set for subsonic inflow and outflow, respectively; and the airfoil surface is set as a no-slip adiabatic wall.

A mesh consisting of 8,960 quad elements is used to define the fluid domain. Third-order curvilinear elements are generated using Gmsh4, an open source three-dimensional  finite element meshing package developed by  Christophe  Geuzaine and Jean-François Remacle.  The farfield boundary is a circle centered at the airfoil mid chord with a radius of more than 1,000 chord lengths away from the actual airfoil boundary in order to minimize issues associated with the effect of the farfield boundary on lift and drag coefficients, as illustrated in Fig. (2)


Figure 2: Computational mesh consisting of 8,960 3rd order curvilinear quad elements.

Simulation Solution

PyFR  is one of a select few open source projects that implements an efficient high-order advection-diffusion based framework for solving a range of governing systems on mixed unstructured grids containing various element types. PyFR is undergoing active development by a team of researchers at Imperial College London promoting Huynh’s Flux Reconstruction approach. PyFR leverages CUDA and OpenCL libraries for running on GPU clusters and other streaming architectures in addition to more conventional HPC clusters.

Rescale  has recently introduced our GPU  “Core Type” which allows end-users to configure their own GPU clusters to run their simulations on demand across multiple NVIDIA Tesla co-processor cards. This enables users to decompose their large computational discretized domains into smaller sub-domains with each running concurrently on its own dedicated Tesla co-processor.

The discretized computational domain shown in Fig. (2) was decomposed into four parts and the simulation run on Rescale using a GPU cluster consisting of two nodes and four NVIDIA Tesla co-processors. Distributing the simulation in this manner is purely illustrative as the simulation only requires 65 MB of memory which can run entirely on a single GPU. A fourth-order spatial discretization (p4) solution was advanced in time via an explicit Runge-Kutta time integration scheme for 20 seconds using a time step of 5.0e-05 seconds (i.e. 400,000 total steps).

Figures (3 & 4) show plots resulting from the simulation of the Mach contours and pressure coefficient, Cp distribution around the surface of the airfoil, respectively.



Figure 3: Mach contours for viscous subsonic flow around a NACA 0012 airfoil at t=20 seconds.


Figure 4: Pressure coefficient, Cp for NACA 0012 at α = 1 and t=20 seconds.

The coefficients of lift and drag were computed from the simulation’s results as:


with corresponding errors of 9.3876e-06 and 5.9600e-08, respectively. Here the error is calculated from a reference solution that was run using 143,360 quad elements.


It has  been shown5 that the Flux Reconstruction algorithm used in PyFR can recover other well-known high-order schemes.  As a result, it provides a unifying approach–or framework–for unstructured high-order CFD; one which  is also particularly well-suited to run on GPU and other streaming architectures. As current advances in CFD become more mainstream, we may see a shift in the  types of computing hardware on which these types of simulations are run.

Rescale has positioned itself at the forefront of these advancements by enabling users to provision their own custom GPU clusters to run a variety of scientific and engineering focused software tools which leverage these architectures with only a few simple clicks of a mouse in an easy-to-use, web-based interface.

Click here to download a PDF copy of this article. Give PyFR a try and run your own simulation on Rescale today.

Witherden, F. D., Farrington, A. M., and P. E. Vincent. “PyFR: An Open Source Framework for Solving Advection-Diffusion Type Problems on Streaming Architectures using the Flux Reconstruction Approach.” arXiv:1312.1638. Web. 7 May 2014.

H. T. Huynh, ”A reconstruction approach to high-order schemes including discontinuous Galerkin for diffusion.” AIAA Paper 2009-403. Print.

P. Persson and J. Peraire, ”Sub-cell shock capturing for discontinuous Galerkin methods.” AIAA Paper 2006-112. Print.

C. Geuzaine and J.-F. Remacle, ”Gmsh: a three-dimensional finite element mesh generator with built-in pre- and post-processing facilities.” International Journal for Numerical Methods in Engineering 79 (2009): 1309-1331. Print.

5 Vincent, P.E., P. Castonguay, and A. Jameson. ”A New Class of High-Order Energy Stable Flux Reconstruction Schemes.” Journal of Scientific Computing. Web. 5 Sep. 2010.

This article was written by Rescale.