gce
Google has officially thrown its gauntlet into the IaaS cloud computing ring by opening up access to the Google Compute Engine (GCE) service to the general public. One of the differentiating features touted by Google is the performance of its networking infrastructure.

We decided to take the service for a quick spin to see what the interconnect performance was like within the context of the HPC application domain. In particular, we were interested in measuring the latency between two machines in an MPI cluster.

For our test, we spun up two instances, setup an OpenMPI cluster, and then ran the osu_latency benchmark from the OSU Micro-Benchmarks test suite to measure the amount of time it takes to send a 0-byte message between nodes in a ping-pong fashion. The numbers reported below are the one-way latency numbers averaged over 3 trials. A new pair of machines was launched for each trial.

Instance Type Trial #1 Trial #2 Trial #3 Average
n1-standard-1 183.12 172.57 169.90 175.20
n1-standard-2 192.27 202.51 196.20 196.99
n1-standard-4 169.97 170.96 177.03 172.65
n1-highcpu-2 176.34 210.81 192.04 193.06
n1-highcpu-4 205.00 176.11 159.95 180.35
n1-highmem-2 176.80 177.73 189.72 181.42
n1-highmem-4 173.78 175.94 185.85 178.52

*all latency numbers measured in microseconds

The reported latency numbers are roughly the same for all of the instance types we tested. The variance between tests is likely due to contention from other tenants on the machine. Benchmarking cloud compute instances is a notoriously tricky problem. In the future, we’ll look at running a more exhaustive test across more instances and over different time periods.

As a point of comparison, we see latencies between 70-90 microseconds when running the same test with Amazon EC2 instances. It is important to point out that this is not a true apples-to-apples comparison: Amazon offers special cluster compute instance types as well as placement groups. The latter allows for better bandwidth and reduced latencies between machines in the same group. The GCE latency numbers appear to be closer to what Edward Walker reported for non-cluster compute instances on EC2. It appears likely that Google is focusing on the more typical workload of hosting web services for now and will eventually turn their focus towards tuning their infrastructure for other domains such as HPC. At the moment, it seems like GCE is better suited for workloads that are more “embarrassingly parallel” in nature.

It should be noted that these types of micro benchmarks do not necessarily represent the performance that will be seen when running real-world applications.  We encourage users to perform macro-level, application-specific testing to get a true sense of the expected performance. There are several ways to mitigate latency penalties:

  • For certain classes of simulation problems, it may be possible to decompose models into separate pieces that can then be evaluated in parallel. A shift in thinking is required with the advent of the public cloud. Rather than having a single on-premise cluster, it is possible to launch many smaller clusters that can operate over the decomposed pieces at the same time.
  • Leveraging hybrid Open MP / MPI applications when possible. Reducing the amount of chattiness between cluster nodes is an excellent approach for avoiding latency costs altogether.

We look forward to seeing the continued arms race amongst the various cloud providers, and expect that HPC performance will continue to improve.  As an example, Microsoft has recently announced a new HPC offering for Azure that promises Infiniband connectivity between instances. As in most cases, competition between large cloud computing providers is very good for the end customer. At Rescale, we are excited about the opportunities to continue providing our customers with the best possible performance.

This article was written by Ryan Kaneshiro.

Image tailing

Visualization of air velocity around aircraft landing gear

What is ‘live tailing’? Why did you build it?

The solvers in modern simulation codes in applications such as CFD, FEA and molecular dynamics are becoming more sophisticated by the day. While taking advantage of (i) new technologies in hardware and (ii) advances in numerical methods, many of these solvers require close monitoring to ensure they converge to a useful and correct solution. It is important to know when a simulation reaches an un-desired state so it can be stopped and the problem can be diagnosed.

At Rescale, we heard consistent feedback from our customers that they wanted to track the status of their jobs real-time. In response, we have recently added a powerful new feature to the platform that enables comprehensive monitoring in an easy and efficient way.

We call this feature ‘live tailing’.

Live tailing allows Rescale customers to monitor any output file for jobs running on the cluster with just one click. This feature replaces the currently painful process of dealing with ssh keys, logging into the cluster, and / or deciphering where the relevant files are located on the server. Rescale’s live tailing is intuitive, user-friendly, highly secure, and much more efficient than traditional monitoring.

How does it work?

Once a customer submits a job, they can go to the Status page, where a list of active runs is displayed. Clicking on one of these runs will display all the files related to that particular job. Customers can scroll through the list or even text-search for a specific file. Clicking on the name of the desired file will display the user-specified number of lines for that particular file.

live_tailing_screenshot

Live tailing section in relation to the Status Page

Why is it useful?

As engineers, we recognize how important it is to track the status of any analysis at any time. Here are some examples of useful applications for live tailing:

  • Monitor progress of a simulation, either to extrapolate total expected runtime or to ensure that the simulation doesn’t enter a negative state.
  • View output plots to quickly analyze important trends and metrics of the simulation.
  • Monitor load balancing for parallelized simulations to diagnose inefficient behavior and to help the customer choose the correct number of processors.
  • Monitor time step conditions such as CFL or adaptive grid conditions to ensure that the simulation doesn’t “blow up.” Simulations that creep along and blow up in time or size can now be stopped quickly.

Does live tailing work with image files as well?

Yes. Some simulation codes are able to generate image files such as meshes, graphs or surface plots. These files can be live tailed as well. Clicking on a file that is a jpg, png or gif will display the image right inside the browser. Check out this aircraft landing gear example using Gerris (http://gfs.sourceforge.net/wiki), an open-source CFD code, with data provided by the AIAA.

Screen Shot 2013-05-14 at 3.35.05 PM

Live tailing allows displaying analysis-generated images

How can I try it?

Contact us at support@rescale.com – we can share existing jobs with you so you can see how it works.

This article was written by Mulyanto Poort.