Engineers face many daily operational inefficiencies that inhibit their time-to-solution. Every day we work with engineers to provide solutions to computing resource limitations and management of HPC. Specifically, we excel at utilizing our platform to accelerate HPC engineering simulations. The impact is real: Rescale users have seen accelerated time-to-solution by 23%, allowing engineering teams to be 12% more productive overall.

In this article, we hope to give you exactly what you need to better plan for HPC in 2019.

(Your) 2019 Engineering Objectives: Measurably Improve Engineering Team Productivity

1. Shorten the turnaround time of your engineering services

2. Eliminate engineering hours spent in HPC queues

3. Increase the individual productivity of your engineers

4. Develop best practices for HPC usage by workflow

Some key issues engineers face when developing a product are simulation constraints due to queue times from lack of computing resources, software availability, architecture diversity, and departmental management. The shortage of these vital resources and tools results longer development cycles of the products that generate revenue.   

1. Shorten the turnaround time of your engineering services

By eliminating queue time and enabling engineers with the best HPC hardware and software, you can optimize your research pipeline and push innovations to the market, sooner.

The Proof:

Dinex, an automotive exhaust supplier, saw a reduction in time-to market of 25% by utilizing the Rescale platform. With abundant computing resources available through our public cloud partners, you gain the ability to mitigate queue time by immediately securing the resources as you need them. The abundant computing hardware and software diversity allows engineers to run simulation that were previously unsupported by on-premise systems (either based off intolerable queue time or software and hardware resource demand). The availability of software and computing resources, ability to innovate design of experiments, and the mitigation of queue time allow engineers to be more efficient and deliver products to market faster.

2. Eliminate engineering hours spent in HPC queues

Stop waiting to run your simulations because of limited HPC resources and/or low priority. Empower every engineer with the resources to run simulations immediately using our AWS, Azure, and IBM cloud resources.

The Proof:

Queues for running simulations can halt the research pipeline and waste valuable engineering time. A queue directly results in a delayed time-to-solution that can be critical to the progression of research. The days spent without answers can cost a company millions of dollars in engineer idle time. The ability to secure hardware as needed allows engineers to be agile with their computing resources and break the constraints of a static on-premise HPC system that limit their simulation volume and fidelity. These inefficiencies directly impact the company’s objective to bring innovations to the market and generate revenue; so, the ramifications of research inefficiencies reverberate throughout the entire organization and externally. By utilizing Rescale, you can run a single simulation on 10,000 cores, or run 10,000 simulations on 10 cores each: the availability of resources means there is no reason not to run a simulation immediately.

3. Increase the individual productivity of your engineers

Remove the constraints of static On-Premise HPC systems and engage a dynamic environment with a the latest HPC hardware and simulation software. Explore new DOE and optimize your research pipeline to achieve the fastest time-to-solutions.  

The Proof:

Rescale has over 300 ported and tuned software’s incorporated into our platform; many on a pay as you use model such as ANSYS, Siemens, CONVERGE, and LS-DYNA. Utilization of the endless, diverse computing resources allows engineers to use the best software on the best hardware, always. The coupling of the best software and hardware allows engineers to have the best results available, quickly. In addition, engineers are exposed to new software and computing resources that were previously unavailable. Some Rescale customers have seen as high as 80% reduction in time-to-answers. The freedom of architecture choices allows for the exploration of new processes in your design of experiments which can create quicker research pipelines with higher fidelity. Enabling researchers with the best tools HPC tools produces quicker results and increases productivity.

4. Develop best practices for HPC usage by workflow

Gain real time insight into your engineers activities and utilize the information to optimize your engineering departments operations and finances.

The Proof:

Scale X Enterprise allows you to fully manage your engineers by tracking expenses, allocating resources, and budgeting teams. With control of computing and software resources, budgets, projects, and access, you can fully manage how your engineering teams utilize cloud computing. In addition, access to billing summaries and real time spending dashboards allow you to monitor your computing expenses. Rescale doesn’t only provide a solution to engineering inefficiencies, it gives management the insight to innovate their own research pipeline.  

Rescale is a turn-key platform that enables access to limitless computing resources and over 300 ported and tuned softwares. With ScaleX Enterprise’s management dashboard, engineering departments are capable of fully managing and reporting on their HPC usage. Rescale has had significant impact on many of our customers; but to understand the true impact Rescale can have on your organization, it is best to reach out to us. With our confidential tools and industry leading knowledge, we can define the impact of Rescale on your engineering operations.

If you have any questions or interest in seeing how Rescale can improve your engineering department, please reach out to our specialists today.

This article was written by Thomas Helmonds.

Total Cost of Ownership (TCO) is a powerful financial tool that allows you to understand the direct and indirect expenses related to an asset, such as your HPC system. Calculating the TCO for an on-premise HPC system is direct: add up all expenses related to your system and its management for the entirety of its deployment. But what happens when you’re interested in switching to cloud-enabled HPC? Can you confidently compare the cloud-enabled HPC system’s TCO with an on-premise HPC system’s TCO?

This question has been addressed by many different institutions.

Our view is simple: TCO is a poor financial tool for evaluating the value of cloud-enabled HPC. Comparing a system with a static environment against a dynamic environment creates an unreliable and misleading analysis. It is an apples to oranges comparison, and using TCO to assess cloud-enabled HPC attempts to make apple juice from oranges.

What is a static environment and how does it apply to my TCO analysis?

Static environments for TCO are used when you have set expense for a set return. For an on-premise system, you can get X amount of computing power for Y amount of dollars. This same relationship goes on for most expenses in the cost analysis of an on-premise HPC system until you reach a comprehensive TCO. There are some variable costs involved (fluctuation in software pricing, staffing, energy, unpredicted errors, etc.); however, margins can be used to monitor their influence on the TCO. Essentially, you end up with the general TCO analysis of X computing power = Y expenses ± margin of change. This is a great tool for comparing systems with little expense variations and known rewards that create a near-linear relationship. However, what happens when the computing power is nearly infinite, and the expenses are reactive, as is the case for cloud computing?

What is a dynamic environment and how does it apply to my TCO analysis?

A dynamic environment for a TCO analysis is a system where the expenses and rewards are not directly correlated, making them difficult to define and compare. In a cloud-enabled HPC system, you pay for computing power when you need it; there is little initial capital expenditure needed to use cloud-enabled HPC, when compared to on-premise HPC systems. In this environment, your expenses for HPC become less predictable and more reactive because they are generated from your computing demand. In addition, you are no longer constrained by a set limit of computing power, so your reward is extremely variable due to how much you utilize HPC. This scalability can heavily influence your HPC usage; especially if your current system is inhibiting your peak performance and potential Design of Experiment (DOE). The rewards of cloud computing beckon the question: if you have less restrictions on HPC, would you utilize it differently?

What happens when you use TCO to compare on-premise vs cloud-enabled HPC systems?

TCO is a tool that is helpful for static environments, but when you try to take the same static tool and apply it to a highly dynamic environment, it is misleading. For example, consider you want to calculate the TCO of an on-premise HPC system. First, you must predict your peak usage and utilization for a system that will be used for approximately 3 years. To manage all an organization’s requirements, trade offs are made between peak usage and the maintenance of high utilization.Then you must pay the massive initial capital expenditure to purchase all the hardware, software, and staff required to assemble and operate the system. Calculate all these expenses and you receive your TCO for a system that awards you limited computing power.

Now, try to use the same analysis of a cloud-enabled HPC system. Most take the projected peak computing power and average utilization and multiply it by the price to compute in their prospective cloud service provider. This is the first problem, you’re already treating both systems as if their rewards and expenses are equal. With cloud-enabled HPC systems, you have instant access to the latest hardware and software resources which means you are always utilizing the best infrastructure for your applications. In addition, your computing power becomes near-infinite meaning there is no reason to have a queue for running simulations, which increases your productivity. These innovations in the research and design process are essential to getting better products to market before competitors, and the inability to easily scale and upgrade resources for an on-premise HPC system can severely inhibit your ability to compete. The differences in rewards makes it hard to quantify the expenses associated with the aging on-premise HPC system’s effect on potential new workflows that can help you outcompete your competition.

When comparing HPC solution’s TCO, you must acknowledge the rewards provided by each solution, because the lack of a rewards should be reflected as an expense in the competitor’s TCO. For example, if your cloud computing solution provides no queue time, better computing performance, and new DOEs, but your on-premise solution doesn’t, then you must calculate the expenses of inefficiency correlated to the absence of rewards from the on-premise system. That is the only way to level the TCO with the corresponding rewards, but it proves extremely difficult to define exact numbers for each reward; henceforth, making TCO a misleading and inaccurate tool. Comparing the TCO and rewards of cloud-enabled and on-premise HPC systems is pointless because the tool does not address the reality of each system; one is static and requires massive investment to create limited computing power, and the other is agile and requires pay-as-you-go expenses for limitless computing power.

Determining the financial implications of incorporating cloud-enabled HPC into you HPC system can be difficult. Thankfully, Rescale has many specialists and confidential tools to help define the benefit of cloud-enabled HPC on your organization.

Come talk to us today.

This article was written by Thomas Helmonds.

gce
Google has officially thrown its gauntlet into the IaaS cloud computing ring by opening up access to the Google Compute Engine (GCE) service to the general public. One of the differentiating features touted by Google is the performance of its networking infrastructure.

We decided to take the service for a quick spin to see what the interconnect performance was like within the context of the HPC application domain. In particular, we were interested in measuring the latency between two machines in an MPI cluster.

For our test, we spun up two instances, setup an OpenMPI cluster, and then ran the osu_latency benchmark from the OSU Micro-Benchmarks test suite to measure the amount of time it takes to send a 0-byte message between nodes in a ping-pong fashion. The numbers reported below are the one-way latency numbers averaged over 3 trials. A new pair of machines was launched for each trial.

Instance Type Trial #1 Trial #2 Trial #3 Average
n1-standard-1 183.12 172.57 169.90 175.20
n1-standard-2 192.27 202.51 196.20 196.99
n1-standard-4 169.97 170.96 177.03 172.65
n1-highcpu-2 176.34 210.81 192.04 193.06
n1-highcpu-4 205.00 176.11 159.95 180.35
n1-highmem-2 176.80 177.73 189.72 181.42
n1-highmem-4 173.78 175.94 185.85 178.52

*all latency numbers measured in microseconds

The reported latency numbers are roughly the same for all of the instance types we tested. The variance between tests is likely due to contention from other tenants on the machine. Benchmarking cloud compute instances is a notoriously tricky problem. In the future, we’ll look at running a more exhaustive test across more instances and over different time periods.

As a point of comparison, we see latencies between 70-90 microseconds when running the same test with Amazon EC2 instances. It is important to point out that this is not a true apples-to-apples comparison: Amazon offers special cluster compute instance types as well as placement groups. The latter allows for better bandwidth and reduced latencies between machines in the same group. The GCE latency numbers appear to be closer to what Edward Walker reported for non-cluster compute instances on EC2. It appears likely that Google is focusing on the more typical workload of hosting web services for now and will eventually turn their focus towards tuning their infrastructure for other domains such as HPC. At the moment, it seems like GCE is better suited for workloads that are more “embarrassingly parallel” in nature.

It should be noted that these types of micro benchmarks do not necessarily represent the performance that will be seen when running real-world applications.  We encourage users to perform macro-level, application-specific testing to get a true sense of the expected performance. There are several ways to mitigate latency penalties:

  • For certain classes of simulation problems, it may be possible to decompose models into separate pieces that can then be evaluated in parallel. A shift in thinking is required with the advent of the public cloud. Rather than having a single on-premise cluster, it is possible to launch many smaller clusters that can operate over the decomposed pieces at the same time.
  • Leveraging hybrid Open MP / MPI applications when possible. Reducing the amount of chattiness between cluster nodes is an excellent approach for avoiding latency costs altogether.

We look forward to seeing the continued arms race amongst the various cloud providers, and expect that HPC performance will continue to improve.  As an example, Microsoft has recently announced a new HPC offering for Azure that promises Infiniband connectivity between instances. As in most cases, competition between large cloud computing providers is very good for the end customer. At Rescale, we are excited about the opportunities to continue providing our customers with the best possible performance.

This article was written by Ryan Kaneshiro.

Image tailing

Visualization of air velocity around aircraft landing gear

What is ‘live tailing’? Why did you build it?

The solvers in modern simulation codes in applications such as CFD, FEA and molecular dynamics are becoming more sophisticated by the day. While taking advantage of (i) new technologies in hardware and (ii) advances in numerical methods, many of these solvers require close monitoring to ensure they converge to a useful and correct solution. It is important to know when a simulation reaches an un-desired state so it can be stopped and the problem can be diagnosed.

At Rescale, we heard consistent feedback from our customers that they wanted to track the status of their jobs real-time. In response, we have recently added a powerful new feature to the platform that enables comprehensive monitoring in an easy and efficient way.

We call this feature ‘live tailing’.

Live tailing allows Rescale customers to monitor any output file for jobs running on the cluster with just one click. This feature replaces the currently painful process of dealing with ssh keys, logging into the cluster, and / or deciphering where the relevant files are located on the server. Rescale’s live tailing is intuitive, user-friendly, highly secure, and much more efficient than traditional monitoring.

How does it work?

Once a customer submits a job, they can go to the Status page, where a list of active runs is displayed. Clicking on one of these runs will display all the files related to that particular job. Customers can scroll through the list or even text-search for a specific file. Clicking on the name of the desired file will display the user-specified number of lines for that particular file.

live_tailing_screenshot

Live tailing section in relation to the Status Page

Why is it useful?

As engineers, we recognize how important it is to track the status of any analysis at any time. Here are some examples of useful applications for live tailing:

  • Monitor progress of a simulation, either to extrapolate total expected runtime or to ensure that the simulation doesn’t enter a negative state.
  • View output plots to quickly analyze important trends and metrics of the simulation.
  • Monitor load balancing for parallelized simulations to diagnose inefficient behavior and to help the customer choose the correct number of processors.
  • Monitor time step conditions such as CFL or adaptive grid conditions to ensure that the simulation doesn’t “blow up.” Simulations that creep along and blow up in time or size can now be stopped quickly.

Does live tailing work with image files as well?

Yes. Some simulation codes are able to generate image files such as meshes, graphs or surface plots. These files can be live tailed as well. Clicking on a file that is a jpg, png or gif will display the image right inside the browser. Check out this aircraft landing gear example using Gerris (http://gfs.sourceforge.net/wiki), an open-source CFD code, with data provided by the AIAA.

Screen Shot 2013-05-14 at 3.35.05 PM

Live tailing allows displaying analysis-generated images

How can I try it?

Contact us at support@rescale.com – we can share existing jobs with you so you can see how it works.

This article was written by Mulyanto Poort.