gce
Google has officially thrown its gauntlet into the IaaS cloud computing ring by opening up access to the Google Compute Engine (GCE) service to the general public. One of the differentiating features touted by Google is the performance of its networking infrastructure.

We decided to take the service for a quick spin to see what the interconnect performance was like within the context of the HPC application domain. In particular, we were interested in measuring the latency between two machines in an MPI cluster.

For our test, we spun up two instances, setup an OpenMPI cluster, and then ran the osu_latency benchmark from the OSU Micro-Benchmarks test suite to measure the amount of time it takes to send a 0-byte message between nodes in a ping-pong fashion. The numbers reported below are the one-way latency numbers averaged over 3 trials. A new pair of machines was launched for each trial.

Instance Type Trial #1 Trial #2 Trial #3 Average
n1-standard-1 183.12 172.57 169.90 175.20
n1-standard-2 192.27 202.51 196.20 196.99
n1-standard-4 169.97 170.96 177.03 172.65
n1-highcpu-2 176.34 210.81 192.04 193.06
n1-highcpu-4 205.00 176.11 159.95 180.35
n1-highmem-2 176.80 177.73 189.72 181.42
n1-highmem-4 173.78 175.94 185.85 178.52

*all latency numbers measured in microseconds

The reported latency numbers are roughly the same for all of the instance types we tested. The variance between tests is likely due to contention from other tenants on the machine. Benchmarking cloud compute instances is a notoriously tricky problem. In the future, we’ll look at running a more exhaustive test across more instances and over different time periods.

As a point of comparison, we see latencies between 70-90 microseconds when running the same test with Amazon EC2 instances. It is important to point out that this is not a true apples-to-apples comparison: Amazon offers special cluster compute instance types as well as placement groups. The latter allows for better bandwidth and reduced latencies between machines in the same group. The GCE latency numbers appear to be closer to what Edward Walker reported for non-cluster compute instances on EC2. It appears likely that Google is focusing on the more typical workload of hosting web services for now and will eventually turn their focus towards tuning their infrastructure for other domains such as HPC. At the moment, it seems like GCE is better suited for workloads that are more “embarrassingly parallel” in nature.

It should be noted that these types of micro benchmarks do not necessarily represent the performance that will be seen when running real-world applications.  We encourage users to perform macro-level, application-specific testing to get a true sense of the expected performance. There are several ways to mitigate latency penalties:

  • For certain classes of simulation problems, it may be possible to decompose models into separate pieces that can then be evaluated in parallel. A shift in thinking is required with the advent of the public cloud. Rather than having a single on-premise cluster, it is possible to launch many smaller clusters that can operate over the decomposed pieces at the same time.
  • Leveraging hybrid Open MP / MPI applications when possible. Reducing the amount of chattiness between cluster nodes is an excellent approach for avoiding latency costs altogether.

We look forward to seeing the continued arms race amongst the various cloud providers, and expect that HPC performance will continue to improve.  As an example, Microsoft has recently announced a new HPC offering for Azure that promises Infiniband connectivity between instances. As in most cases, competition between large cloud computing providers is very good for the end customer. At Rescale, we are excited about the opportunities to continue providing our customers with the best possible performance.

This article was written by Ryan Kaneshiro.

Image tailing

Visualization of air velocity around aircraft landing gear

What is ‘live tailing’? Why did you build it?

The solvers in modern simulation codes in applications such as CFD, FEA and molecular dynamics are becoming more sophisticated by the day. While taking advantage of (i) new technologies in hardware and (ii) advances in numerical methods, many of these solvers require close monitoring to ensure they converge to a useful and correct solution. It is important to know when a simulation reaches an un-desired state so it can be stopped and the problem can be diagnosed.

At Rescale, we heard consistent feedback from our customers that they wanted to track the status of their jobs real-time. In response, we have recently added a powerful new feature to the platform that enables comprehensive monitoring in an easy and efficient way.

We call this feature ‘live tailing’.

Live tailing allows Rescale customers to monitor any output file for jobs running on the cluster with just one click. This feature replaces the currently painful process of dealing with ssh keys, logging into the cluster, and / or deciphering where the relevant files are located on the server. Rescale’s live tailing is intuitive, user-friendly, highly secure, and much more efficient than traditional monitoring.

How does it work?

Once a customer submits a job, they can go to the Status page, where a list of active runs is displayed. Clicking on one of these runs will display all the files related to that particular job. Customers can scroll through the list or even text-search for a specific file. Clicking on the name of the desired file will display the user-specified number of lines for that particular file.

live_tailing_screenshot

Live tailing section in relation to the Status Page

Why is it useful?

As engineers, we recognize how important it is to track the status of any analysis at any time. Here are some examples of useful applications for live tailing:

  • Monitor progress of a simulation, either to extrapolate total expected runtime or to ensure that the simulation doesn’t enter a negative state.
  • View output plots to quickly analyze important trends and metrics of the simulation.
  • Monitor load balancing for parallelized simulations to diagnose inefficient behavior and to help the customer choose the correct number of processors.
  • Monitor time step conditions such as CFL or adaptive grid conditions to ensure that the simulation doesn’t “blow up.” Simulations that creep along and blow up in time or size can now be stopped quickly.

Does live tailing work with image files as well?

Yes. Some simulation codes are able to generate image files such as meshes, graphs or surface plots. These files can be live tailed as well. Clicking on a file that is a jpg, png or gif will display the image right inside the browser. Check out this aircraft landing gear example using Gerris (http://gfs.sourceforge.net/wiki), an open-source CFD code, with data provided by the AIAA.

Screen Shot 2013-05-14 at 3.35.05 PM

Live tailing allows displaying analysis-generated images

How can I try it?

Contact us at support@rescale.com – we can share existing jobs with you so you can see how it works.

This article was written by Mulyanto Poort.

aero-turbine

Aerospace manufacturers operate at the leading edge of technology in materials, solid mechanics, fluid dynamics, electronics, and several other engineering disciplines.  In an industry where 100% product reliability is expected and achieved on a regular basis, the product development process is at the core of company performance.

Across the aerospace industry, whether in spacecraft, aircraft, jet engines, or other systems, advanced engineering simulations allow engineers to cycle through various design iterations early in the design process, without investing in hardware. However, the complexity of simulation models is quickly increasing. For example, a 3D model of a high-pressure turbine rotor can contain millions of elements for a finite element analysis (FEA). Running these simulations can take hours or days, depending on the speed and age of the available high-performance computing (HPC) hardware. In addition, running design iterations in parallel to fully explore the design space remains extremely difficult. Engineers quickly run into HPC capacity constraints and have to wait in a virtual queue to run their jobs. This can lead to sub-optimal designs as engineers don’t have the tools or the time to fully optimize their designs.

To address these concerns on a new development program, an aerospace OEM turned to Rescale. With Rescale, this customer gained access to a comprehensive suite of simulation software tools, along with the capabilities of a fully secure, large commercial cluster on demand and at a fraction of the cost. Our customer used Rescale to perform multiple-parameter sweeps and designs of experiments (DOEs) to improve the designs of various sub-systems and critical individual components. The customer ran analyses on several models and used commercially available computational fluid dynamics (CFD) solvers.

To compute large-scale parallel simulations in a short time requires hundreds of clustered processors, resulting in significant upfront and maintenance costs for IT infrastructure . As a result, this type of simulation has largely remained out of reach to all but the largest of aerospace companies. Using the Rescale platform, the user was able to set up a full parameter sweep in a matter of minutes. For a typical job, upon submission by the user, the Rescale platform performed as follows:

  • Hundreds of processors were dynamically provisioned within five minutes of job submission.
  • The user-chosen solver was invoked across the entire multiple-processor cluster, simulating thousands of time steps
  • Results were delivered to local servers for post-processing and analysis.
  • All computing instances across the cluster were deleted upon completion.

Security concerns were fully addressed: Rescale jobs are run on SOC 2, ISO 27001, and ITAR-certified infrastructure; all customer data was transferred in an end-to-end encrypted environment. Dedicated cluster instances were provisioned to ensure that data mingling did not occur. Finally, data was purged upon completion of the case.

In this case, the Rescale customer estimated that running these analyses on Rescale reduced runtime by >95% and saved hundreds of engineer-hours when compared to the alternative of running locally. Additionally, the ability to run much broader simulations revealed critical insights to the engineers that would have remained hidden due to time constraints had they run these jobs on local machines.

To learn more, contact us at sales@rescale.com

This article was written by Rescale.