Servers image

We have made a number of blog posts over the years where we have run some MPI microbenchmarks against the offerings from the major public cloud providers. All of these providers have made a number of networking improvements during this time so we thought it would be useful to rerun these microbenchmarks against the latest generation of VMs. In particular, AWS has released a new version of “Enhanced Networking” that supports up to 20Gbps, and Azure has released the H-series family of VMs which offers virtualized FDR InfiniBand.

My colleague Irwen recently ran the point-to-point latency (osu_latency) and bisection bandwidth (osu_bibw) tests from the OSU Microbenchmarks library (version 5.3.2) against a number of different VM types from Google Compute Engine. For consistency, we’ll use the same library here with Azure and AWS.  The table below includes the best performing machine from Irwen’s post: the n1-highmem-32. The c4.8xlarge represents an AWS VM type from the previous Enhanced Networking generation and the newer m4.32xlarge VM is running the newer version of Enhanced Networking.

In the table below, we list the averaged results over 3 trials. A new pair of VMs were  provisioned from scratch for each trial:

0-byte Latency (us) 1MB bisection bandwidth (MB/s)
GCE (n1-highmem-32) 41.04 1076
AWS (c4.8xlarge) 37.07 1176
AWS (m4.32xlarge) 32.43 1152
Azure (H16r) 2.63 10807

As you might expect, the Azure H-series VMs seriously outpace the non-InfiniBand equipped competition in these tests. One of the frequent criticisms levied against using the public cloud for HPC is that networking performance is not up to the task of running a tightly-coupled workload. Microsoft’s Azure has shown that it is possible to run a virtualized high-performance networking fabric at hyperscale.

That said, while this is interesting from a raw networking performance perspective, it is important to avoid putting too much stock into synthetic benchmarks like this. Application benchmarks are generally a much better representation of real-world performance. It is certainly possible to achieve strong scaling with some CFD solvers with virtualized 10GigE. AWS has published STAR-CCM+ benchmarks showing close to linear scaling on a 16M cell model on runs up to 700 MPI processes. Microsoft has also published some STAR-CCM+ benchmarks showing close to linear scaling on up to 1,024 MPI processes with an older generation of InfiniBand equipped VMs (note that this is not an apples-to-apples comparison because Microsoft used a larger 100M cell model in their tests). It’s also important to highlight that specialized networking fabric typically comes at a higher price point. Additionally, keep in mind is that network speed is just one dimension of performance. Disk IO, RAM, CPU core count and generation, as well as the type of simulation and model size all need to be taken into consideration when making a decision about what hardware profiles to use. One of the advantages of using a multi-cloud platform like Rescale’s ScaleX Platform is that it makes it easy to run benchmarks and moreover, enterprise HPC workloads, across a variety of hardware configurations by simply changing the core type in your job submission request.

Finally, it is impressive to note how far things have come from the original Magellan report. There is a fierce battle going on right now between the public cloud heavyweights and we are starting to see hardware refresh cycles including not only high-performance interconnect but also modern CPU generations (Skylake) as well as GPU and FPGA availability at large scale. The “commodity” public cloud is increasingly viable for a growing number of HPC workloads.

This article was written by Ryan Kaneshiro.

The Rescale platform provides end-to-end file management backed by storage offerings from the major public cloud vendors. This includes optimized client-side transfer tools as well as in-transit and at-rest encryption. In this model, Rescale controls the object store layout and encryption key management. In order to retrieve the decrypted file content, users must use Rescale tooling. While this can be convenient if you are starting from scratch and looking for a totally managed secure solution, one scenario that comes up is how to use the platform with input data that has already been uploaded to the cloud. Another use case that we see is integrating with an existing data pipeline that operates directly on simulation output files sitting in a customer-controlled storage location. For cost and performance reasons it is important to try and keep your compute as close to the storage as possible. One of the benefits of Rescale’s platform is that we support a number of different cloud providers and can bring the compute to any cloud storage accounts that you might already be using.

In this post, we will show how customers can transfer input and output files from a user-specified location instead of using the default Rescale-managed storage. For this example, we’ll focus on Amazon S3, however a similar approach can be used with any provider. In the following, we will go through the setup of a design of experiments job where the input and output files reside in a customer-controlled bucket. Let’s assume that the bucket is called “my-simulation-data”, the input files are all prefixed “input”, and all output files generated by the parameter sweep should be uploaded to a path prefixed by “output”.

This DOE will run over the HSDI and Pintle Injector examples for CONVERGE CFD found on our support page ( in parallel. Normally, the DOE framework is used to change specific numerical values within an input file but here we will use it to select completely different input.zips per run.

First, upload the converge input zips ( and to the s3://my-simulation-data/input/ directory in s3.

Next, create a file locally called inputs.csv that looks like the following:

In order to give Rescale compute nodes to access to the bucket, an IAM policy needs to be created that provides read-access to the input directory and full access to the output directory:

Note that another way to accomplish this is to setup cross-account access ( This is a preferable way to configure access if all compute nodes will run in AWS. However, the above approach will work regardless of where the compute nodes are executing.

Now, attach this policy to an IAM user and generate an access key and secret key. This access key and secret key should then be placed into a AWS config file that you save to a local file:

Save the above to a file called config.

The last file that needs to be created locally is the run script template. We will reference the s3_input and s3_output variables from the inputs.csv created above in a shell script template file that will be executed for each run. Create a file called that looks like:

There are a couple things to point out in the above script. Normally, the Rescale platform will automatically unarchive zip files however in this case we need to handle that ourselves since we are bypassing Rescale storage for our inputs. The rm -rf * at the end of the script deletes all of the output files after uploading them to the user-specified S3 location. If we omit this step, then output files will also be uploaded to Rescale storage after the script exits.

Now that the necessary files have been created locally, we can configure a new DOE job on the platform that references them. From the New Job page (, change the Job Type to DOE and configure the job as follows:

  1. Input Files: Upload config
  2. Parallel Settings: Select “Use a run definition file” and upload input.csv
  3. Templates: Upload Use as the template name
  4. Software: Select converge 2.3.X and set the command to
  5. Hardware: Onyx, 8 cores per slot, 2 task slots

Submit the job. When the job has completes, all of the output files can be found in the s3://my-simulation-data/output/hsdi/ and s3://my-simulation-data/output/pintle/ directories.

In this DOE setup, the ancillary setup data (eg: the AWS config file, csv file, and run script template) are encrypted and stored in Rescale-managed storage. The meat of the job, the input and output files, are stored in the user-specified buckets.

We do recognize that the above setup requires a little manual work to get configured. One of the things on our roadmap is to provide better integration with customer provided storage accounts. Stay tuned for details!

This article was written by Ryan Kaneshiro.

ryanblogpost (1)
The web is the preferred delivery mechanism for most applications these days but there are scenarios where you might want to build a CLI or desktop application for your customers to use. However, once you leave the cozy confines of the browser there are a whole slew of proxy configurations that your poor application will have to deal with if it needs to run within a typical corporate network.

For the purposes of this post, “typical corporate network” means your users are running some flavor of Windows and are sitting behind an authenticating HTTP proxy. While this does seem like a pretty common setup, a surprising number of applications will simply not work in this environment.

Thankfully, when writing a .NET application, the default settings get you most of the way there for free. The default web proxy will automatically use whatever proxy settings the user has configured in IE. If possible, this is what you should rely on. It is tempting to expose proxy hostname and port configuration values that the user can pass to the application, however in some cases a corporate user may not have a single well-known proxy to use. WPAD and PAC files allow proxies to be configured dynamically. See this post for more gory details.

Unfortunately, the default settings do not handle authentication for you out-of-the-box. Web requests will typically fail with a 407 ProxyAuthenticationRequired error. The next step is to examine the Proxy-Authenticate response header returned to see what type of authentication the proxy accepts. Typically this will be some combination of Basic, Digest, NTLM, or Negotiate. If the proxy supports either NTLM or Negotiate, then it is possible to automatically authenticate the signed in user running your application by simply adding the useDefaultCredentials=true attribute to your app.config as described here:

This is particularly nice because we don’t have to modify any of our application code nor deal with the headaches of dealing with credential management. Alas, this won’t work if the proxy is configured to use Basic or Digest authentication. While this is an unusual setup, it is something that you will come across in the wild every so often. If this is the case then you will need a way to read in a username and password and then store that in the IWebProxy.Credentials property. As pointed out here, this setup is not typically used because it puts the burden on every application to manage proxy credentials.

In C#, the default proxy settings configured in the app.config are reflected in the WebRequest.DefaultWebProxy static variable. Rather than directly modifying its Credentials, it is cleaner to create a decorator for the proxy that passes through the read requests but manages its own set of credentials without touching the underlying proxy:

Then, you can do something like the following to use the default proxy settings with custom credentials:

This lets you easily switch back to the original credentials that were configured in the app.config or use a different set as needed.

Note that while all of this is pretty straightforward for people using .NET, it might not be as easy to support authenticating proxies (particularly ones that only use NTLM and Negotiate) in http libraries used in other languages. In these scenarios, some people have had success using cntlm as a proxy for the authenticating proxy.

TL;DR: For people writing applications in .NET, you should simply set useDefaultCredentials=true in your app.config file and that should “just work” most of the time.

This article was written by Ryan Kaneshiro.


Microsoft’s announcement of Azure Linux RDMA support last year was great news for those looking to run tightly coupled HPC workloads in the cloud.  Unfortunately, there still isn’t a lot of documentation out there describing how to set it up.  This tutorial appears to be the main source of information for configuring Azure Linux RDMA.  However, there are a couple of omissions in there that can trip you up when setting up your cluster for the first time.  In this post, we’ll cover a few gotchas that you might encounter and some workarounds.

First, the tutorial uses the older ASM model for deploying virtual machines.  Microsoft recommends that new projects use ARM for deployment.  One big reason for switching is that ARM deployments will provision virtual machines in parallel whereas ASM will deployment them serially.  For larger clusters, this can make a big difference in startup time.  This is a simple ARM template that can be used as a starting point that will launch a standalone MPI cluster with the recommended vanilla SLES 12 HPC VHD.

After the cluster launches, you will likely want to install some common packages like, say, git.


# zypper install git
Loading repository data…
Reading installed packages…
‘git’ not found in package names. Trying capabilities.
No provider of ‘git’ found.
Resolving package dependencies…

Nothing to do.

The reason for this is that the vanilla SLES VHD is missing a bunch of repos out of the box.  You can re-add them by running the following:

# cd /etc/zypp/repos.d
# mv sldp-msft.repo sldp-msft.repo.bak
# rm -f *.repo
# systemctl restart guestregister.service
# mv sldp-msft.repo.bak sldp-msft.repo
# zypper addrepo sldp-msft.repo
# zypper refresh

Now, you should have access to a much wider range of packages to install.  As described in the tutorial guide, after you’ve installed any custom packages and also setup Intel MPI, you can capture your custom VHD and use that as the starting point for your MPI clusters instead.

Once you’ve launched a cluster with the custom VHD, you may need to install a VM extension that will update the RDMA drivers.  The tutorial states that you should not update the RDMA driver in the US West, West Europe, and Japan East regions.  However, this appears to be an out-of-date notice, because when we tried running the Intel MPI pingpong test in those regions, we ran into the same DAPL errors that are described here.  After updating the drivers, the pingpong test started working without error.

As far as installing the OSTC Extension goes, there is one small wrinkle that you will need to be aware of- if you ssh into the VM immediately after the installing the extension, you will notice that your connection is dropped shortly after logging in.

azureadmin@n1:~> Connection to closed by remote host.
Connection to closed.

The reason for this is that the VM is rebooted about 2-3 minutes after the extension deployment completes.  It would be nicer if the VM was ready for use when the extension installation finishes, but unfortunately that doesn’t seem to be the case here.  This is something that you’ll need to take into account if you are trying to automate the cluster deployment.

Hopefully, once Azure Linux RDMA support is added to the Azure Batch service you won’t have to deal with any of the above.  Of course, launching the cluster is just the starting point.  You still need to install and tune your simulation software, setup a connection to your license server, and securely transfer your input and output files to and from the cluster.  Rescale’s support team is ready to work with you to accomplish this on Azure using our web, API, or CLI tools.

This article was written by Ryan Kaneshiro.