Servers image

We have made a number of blog posts over the years where we have run some MPI microbenchmarks against the offerings from the major public cloud providers. All of these providers have made a number of networking improvements during this time so we thought it would be useful to rerun these microbenchmarks against the latest generation of VMs. In particular, AWS has released a new version of “Enhanced Networking” that supports up to 20Gbps, and Azure has released the H-series family of VMs which offers virtualized FDR InfiniBand.

My colleague Irwen recently ran the point-to-point latency (osu_latency) and bisection bandwidth (osu_bibw) tests from the OSU Microbenchmarks library (version 5.3.2) against a number of different VM types from Google Compute Engine. For consistency, we’ll use the same library here with Azure and AWS.  The table below includes the best performing machine from Irwen’s post: the n1-highmem-32. The c4.8xlarge represents an AWS VM type from the previous Enhanced Networking generation and the newer m4.32xlarge VM is running the newer version of Enhanced Networking.

In the table below, we list the averaged results over 3 trials. A new pair of VMs were  provisioned from scratch for each trial:

0-byte Latency (us) 1MB bisection bandwidth (MB/s)
GCE (n1-highmem-32) 41.04 1076
AWS (c4.8xlarge) 37.07 1176
AWS (m4.32xlarge) 32.43 1152
Azure (H16r) 2.63 10807

As you might expect, the Azure H-series VMs seriously outpace the non-InfiniBand equipped competition in these tests. One of the frequent criticisms levied against using the public cloud for HPC is that networking performance is not up to the task of running a tightly-coupled workload. Microsoft’s Azure has shown that it is possible to run a virtualized high-performance networking fabric at hyperscale.

That said, while this is interesting from a raw networking performance perspective, it is important to avoid putting too much stock into synthetic benchmarks like this. Application benchmarks are generally a much better representation of real-world performance. It is certainly possible to achieve strong scaling with some CFD solvers with virtualized 10GigE. AWS has published STAR-CCM+ benchmarks showing close to linear scaling on a 16M cell model on runs up to 700 MPI processes. Microsoft has also published some STAR-CCM+ benchmarks showing close to linear scaling on up to 1,024 MPI processes with an older generation of InfiniBand equipped VMs (note that this is not an apples-to-apples comparison because Microsoft used a larger 100M cell model in their tests). It’s also important to highlight that specialized networking fabric typically comes at a higher price point. Additionally, keep in mind is that network speed is just one dimension of performance. Disk IO, RAM, CPU core count and generation, as well as the type of simulation and model size all need to be taken into consideration when making a decision about what hardware profiles to use. One of the advantages of using a multi-cloud platform like Rescale’s ScaleX Platform is that it makes it easy to run benchmarks and moreover, enterprise HPC workloads, across a variety of hardware configurations by simply changing the core type in your job submission request.

Finally, it is impressive to note how far things have come from the original Magellan report. There is a fierce battle going on right now between the public cloud heavyweights and we are starting to see hardware refresh cycles including not only high-performance interconnect but also modern CPU generations (Skylake) as well as GPU and FPGA availability at large scale. The “commodity” public cloud is increasingly viable for a growing number of HPC workloads.

This article was written by Ryan Kaneshiro.

As a green, but eager developer at a fast-paced startup, it can feel like staring straight up at El Capitan when tasked with simultaneously learning the code base and writing production-ready maintainable code.  At Rescale, developers are given wide latitude in making software design choices to implement a given feature.  Recently, we implemented the ability to create persistent clusters which can accept multiple jobs.  Choosing the optimal design pattern that is scalable, maintainable, and aligns with the architecture of the application requires a measure of thoughtfulness. I will share my experience in working on this awesome feature.

As a quick recap, a ‘job’ in Rescale parlance is the combination of software, for example Tensorflow for machine learning or BLAST+ for genetic alignment query (amongst numerous other software packages that Rescale offers), and a hardware cluster (e.g., a C4 instance from AWS) and a job-specific command.  Prior to the persistent cluster feature, in the typical “basic” use case, the cluster would shut-down after the job ran its course.  Now with the persistent cluster feature a user can build a cluster according to their exact requirements, run multiple jobs throughout the day, and tear it down when done.   

This feature comprised of new API endpoints, which was done with the Python Django framework, and several view (UI) changes done with the Angular framework.  One of the challenges with the UI was how to reuse the cluster list view (referred to as “clusterList View”), which is encapsulated as an Angular directive (referred to as the “clusterList Directive”) and is shown below.

A corresponding controller and template for this directive completes the encapsulation.  Angular directives provide reusability through encapsulation of the HTML template, view logic, and business logic.  The core responsibility of this directive is to render a list of clusters, and it also includes functionality to delete clusters.  The controller for this view includes functionality for pagination and for navigating to a specific cluster for detailed information.

Below is the clusterList Directive reused in the hardware settings view (which itself is its own encapsulated directive —  referred to as the “hardwareSettings Directive”).  In the hardware settings view, a user can select either a persistent cluster (which a user previously spun up) or create a new cluster (which was the previous workflow).  

This view was composed of the following directives (attributes removed for brevity).

The software design decision centered on how to communicate that a persistent cluster was selected from the clusterList Directive to the controller responsible for the hardwareSettings view.  The hardwareSettings controller includes logic to set the hardware settings on the Job object.  With the addition of the persistent clusters feature, these hardware settings either come from a persistent cluster or from a newly generated cluster.

Publish Subscribe(‘pub-sub’) vs Dependency Injection
With Angular, the $emit and $on functions on $rootscope can be used to create an application wide pub-sub architecture.

The above code will transmit a message with the topic ‘clusterSelected’ to the event bus within $rootscope, along with the cluster object.  Other parts of the application can tap into this event bus by injecting in $rootscope, and subscribing to the topic with $on.

The pub-sub design pattern is an elegant way to communicate across boundaries (in this case, across directive boundaries), and is an often used architecture to communicate across network boundaries.  The ease of implementation in Angular, along with a ‘just make this work’ mindset when working under a timeline, sealed the deal for me as the appropriate design pattern.    

Another design choice I made was to reuse the cluster list controller (“clusterListCtrl”) in the hardware settings view:

With this choice I expanded clusterListCtrl with extra functions to achieve the behavior required in the hardwareSettings View, adding unnecessary bloat to the controller and further coupling the controller to the directive.  I presented my design choices to Alex Kudlick, co-developer on this feature, and a seasoned developer with a penchant for dispensing software development wisdom.  We discussed the merits of pub-sub vs dependency injection.

Dependency Injection
Dependency injection, which is closely related to inversion of control, is a pattern wherein the dependency flow is inverted.  In contrast, with the pub-sub example, a dependency is created by all subscribers listening to the topic ‘clusterSelected’, which is published by the clusterList Directive when a click is detected — the dependency flow is from the source flowing outwards.

With dependency injection a click handler, ‘on-cluster-click’,  is injected, or passed in, as an attribute into the clusterList Directive as shown below.

This attribute is now a property on the isolate scope of the clusterList Directive.  The controller for the hardwareSettings view, includes a function on $scope called toggleClusterSelected which contains logic to handle when a persistent cluster is selected.  When a click is detected by the clusterList Directive, the toggleClusterSelected function is invoked.

The reusability of the clusterList Directive with dependency injection is now illustrated with how the clusterList Directive handles clicks in the clusterList View.

The controller for the clusterList View, includes the function, goToCluster, which is invoked when the click handler is executed in this view.   The above illustrates the use of dependency injection, and by binding different expressions (the toggleClusterSelected, and goToCluster methods), to the click handler, the clusterList Directive can possess different behavior depending on the context in which it is used.  

With the pub-sub architecture the ability to transmit an event, application wide to communicate across boundaries is elegant and simple to accomplish in Angular, — it also satisfies the ideal of loose coupling.  In using $rootScope.$emit messages were transmitted via the $rootScope, the parent scope in an Angular application.  This presented a problem, in that listeners in different parts of the application (there were subscribers for ‘clusterSelected’ in the hardwareSettings controller and also in the clusterList controller), would interfere with each other.  For example while in the hardwareSettings View a click on the cluster list would trigger goToCluster (not desired), and toggleClusterSelected (desired behavior).  However, goToCluster won out, and thus the correct behavior was not achieved.  The work-around for this issue was to use $scope.$emit and $scope.$on which limited messaging to just the relevant scope rather than the application wide $rootScope.  However, potential for side effects and interference still exists when multiple subscribers are acting on the same message.  

Another downside was that components that subscribed to ‘clusterSelected’ message, the hardwareSettings controller and clusterList controller, formed an implicit dependency to the clusterList Directive.  A future developer would need to examine the body of the controllers to determine that there is a dependency to the ‘clusterSelected’ message.  With two subscribers, this may not be a big deal, but as more subscribers are added, this approach can become burdensome to maintain and can lead to brittle code.

With dependency injection, the dependencies are explicit, a developer can either scan the attributes in the directive markup or scan the isolate scope in the directive to get a complete list of dependencies.    

Furthermore, the directive is reusable in different contexts by specifying the appropriate dependencies and expressions bindings.  I learned that as a software engineer time must be allocated to think through the consequences of a given design choice — with an eye towards maintainability and scalability, and ultimately explicit is better than implicit.

This article was written by Nam Dao.

As a long-time GNU Emacs user, I have come to take C-x C-e (eval-last-sexp) for granted. That is, in almost any file that I’m editing, I can insert a parenthesis expression wherever I am, hit C-x C-e, and get a result displayed in the minibuffer (or if programming in a lisp, in the REPL).

Many other languages provide console interfaces or REPLs, but they usually come as separate features of the language, rather than an assumed mode of interaction. I’ll admit that this is a superficial observation, but one could reframe this terms of a “script-centric” or a “REPL-centric” approach to user interaction. Imagine yourself learning a new programming language. A “script-centric” approach would quickly introduce you how to run a hello world program from a terminal (or IDE), whereas a “REPL-centric” approach would quickly introduce you how to drop into a REPL and evaluate the requisite expressions to achieve the hello world effect.

In other words, a “script-centric” approach conceptually separates the editing and evaluation cycles, whereas the “REPL-centric” approach is designed around a single flow of expression, evaluation, and feedback. Each method has its strengths and weaknesses, but for prototyping and experimentation, it’s hard to beat the feeling of “being with the code” from evaluating code right at the cursor. In Emacs, for languages in the lisp family, this is generally the expected method of interaction.

How might we demonstrate this? Well, my colleague Mark wrote a tutorial on setting up a Keras training job back in February. The software has been updated since then, so we’ll create an updated version of that job; but instead of raw python, we’ll use hylang, which is basically a lisp for the Python interpreter.


Assuming you have a recent version of Emacs running (say, version 24 and up) with the Emacs package manager set up, you would want to install hy-mode (M-x package-install RET hy-mode RET), and one of the many lispy paren packages, e.g. paredit, smartparens, lispy.

In our new file, cifar10_cnn.hy, make sure you are on Hy mode. Then use M-x inferior-lisp to start a hy repl! Scroll to the bottom for a video demonstration of Emacs interaction.

1 First let’s take care of the imports 

1.1 Python

1.2 Rewriting this in hy:

51 words to 39. Not bad for succinctness!

2 Then let’s rewrite the constants

2.1 Python

2.2 Hy

Hylang takes inspiration from Clojure, but unlike Clojure, allows multiple variable def-s by default. In this case it’s quite handy.

3 The functions

Note that the input_shape argument ordering has changed since the previous version from (channels, rows, columns) to (rows, columns, channels)

3.1 Python

3.2 Hy

In the code below, we could have written the variable destructuring for TRAIN and TEST with a single step like so:

(let [ [[X-train y-train] [X-test y-test]] (cifar10.load_data) ] )

Notice how using doto here really simplifies repetitive assignment. Since expressions evaluate to the final form inside them, which gets passed through by doto, our model setup in make-network doesn’t even require a variable name.

Hylang also supports the handy threading macros (->) a la clojure, which makes it easy to chain statements together; in this example we just thread the format string through the .format function, then print it.

4 The main runner block

4.1 Python

4.2 Hy

5 A quick video

In this short video you can see how one might load the resulting hylang file into emacs and run it through the hy REPL, evaluating the statement on point (which show up as ^X ^E in the key overlay), and otherwise switching back and forth between the REPL and the edit buffer.

Towards the end, you can see how I missed the numpy import, added the import to the (import ...) form, and re-evaluated the entire import form. I moved the cursor to the end of the form and used eval-last-sexp for evaluate the expression, but could have also used Ctrl Alt x or lisp-eval-defun, which would evaluate the top level expression surround the cursor.

Since load-dataset and make-network take no arguments, it is convenient to wrap the function bodies in a let or doto expression (or form), repeatedly evaluating the block at-point, checking the REPL output, and when satisfied, wrap the let block into a defn form.

6 Running it on the platform

The video stops right after I start the actual model training, because my computer doesn’t have a good GPU, and the training would take a long time. So instead we’ll put it back on the Rescale platform to complete, first installing the hy compiler by running

then simply calling hy cifar10_cnn.hy to replicate Mark’s previous example output.

(note that in this example, we’re running the latest version (version 0.11.0, commit 0abc218) of hylang directly from github). If you installed hy using pip install hy it may be using an older assignment let/with syntax.

A ready-to-run example can be found here. In addition, a tanglable version of this post can be found on github.

7 Pardon the pun

But is it hy time? As you can see, the REPL integration into Emacs has its quirks. Debugging errors from deep stack traces is also a bit more challenging due to less support from mature debugging tools. But if you enjoy working with lisps and want or need to work with the python ecosystem, hy is a functional, powerful, and enjoyable tool to use, for web development, machine learning, or plain old text parsing. Did I mention you can even mix and match hy files and py files?

This article was written by Alex Huang.


Sandi Metz recently wrote an article proclaiming that duplication is cheaper than the wrong abstraction. This article raises valuable points about the costs of speculative generalization, but it’s part of a long line of articles detailing and railing against those costs. By now it should be old hat to hear someone criticize abstraction, and yet the meme persists.

The aspect of Sandi Metz’s article that I’d like to respond to in this post is the mindset it promotes, or at least the mindset that has responded to it the most. This mindset is very common to see in comments – just get the task done, nothing more.  Sometimes that’s appropriate and the right approach to take, but the problem here is that the costs of abstraction, especially when it’s gotten wrong, are obvious, and the costs of the “simplicity first” mindset aren’t as obvious. I won’t talk about the specific costs of duplicated code, as those are already well known. I will talk about the opportunity costs – the missed learning opportunities.

Good developers should be constantly learning, constantly honing their skills. There’s always room to improve. The skill that’s most important for developers to practice is recognizing profitable abstractions, because doing so correctly relies on honed intuition. It takes seeing costs manifest over the long term, and it takes making mistakes. Developers should be constantly evaluating their past decisions and taking risks on new ones.

Opportunity cost is an often overlooked aspect of technical debt. The reason accumulating technical debt is the cheaper choice in the moment is that it takes a path for which the solution is already known. There’s nothing to learn, just implement the hack. That’s fine in small doses, but it forgoes the opportunity to learn things about the codebase, to discover missing abstractions and create conceptual tools that can help solve the problem.

So what the developers in Sandi Metz’s example should have done is noticed that this particular abstraction was costing them more than it was benefiting them. That’s a good thing to notice – it’s a valuable learning experience. What specific aspects of the abstraction were slowing down development? Which parts confused new developers and led them to make it worse? These are questions the developers should have asked themselves in order to learn from the experience.

Our development team has a weekly practice that we call “Tech Talks,” in which a developer talks about something they learned that week, some part of the codebase that was thornier than it should have been, and so on. This practice is invaluable for promoting a growth mindset, and the situation from Mz. Metz’s article would have been a perfect example to bring up.

Developers shouldn’t focus on just cranking out code. Those who limit their attention in such a way aren’t growing and will soon be surpassed by better tools. Instead, we should recognize that the job of a developer is to understand which abstractions will prove valuable for the codebase. The only way to learn that is through experience.

This article was written by Alex Kudlick.