HPC Disruption

In 1991, I joined Cray and had the opportunity to work on the machines Seymour Cray designed. I was working on the operating system and would often have to work alone on it at night, but the excitement of working on such unique systems kept me going. The Cray 1, XMP, YMP, represented a family of machines where a differentiated architecture and design allowed you to solve problems that you just couldn’t solve with a regular computer.

When I joined, Cray was considering building a new type of parallel machines we called MPPs (massively parallel processing). I worked on the design and implementation of the operating system for the Cray T3E, a system with 2048 individual nodes, with standard CPU chips, memory, and a proprietary high-speed interconnect.  Ahead of its time, Cray was building what we today call HPC clusters. Besides it being a fantastic engineering project, it was the beginning of a disruption: going from proprietary Cray architectures to clusters of nodes with commodity parts.

Continue reading

This article was written by Gabriel Broner.

Why high utilization doesn’t work for TSA and why it doesn’t work for HPC

© CC-BY-SA 2.0 2010 “Security at Denver International Airport” by oddharmonic via Flickr

Executive Summary:

  • In the case of big compute power, the purchase of large capital assets can create an organizational misalignment of incentives that places the needs of the end user last
  • Achieving high utilization rates of on-premise computing is a pyrrhic victory; it creates winners and losers and puts a governor on the pace of innovation
  • Information technology leaders with high utilization rates of on-premise compute should establish a cloud bypass for work to encourage a culture of agility, innovation, and “outside-the-stacks” thinking
  • When calculating total cost of ownership (TCO) of on-premise computing, user experience, workflow cycle times, responsiveness to new requirements, and other factors must be considered

Airport Travelers and HPC Users Have the Same Complaints
While standing in line in airport security at LAX recently, travelers behind me began engaging in a familiar sport: wondering if there were better alternatives to the US airport security screening process. As some lines proved to be faster than others, the complaints ranged from line choice to the efficacy of the entire system. Having recently returned from several meetings with future users of cloud computing, the complaints were similar: wait times, capacity limitations, and perceived unfairness in the system.

High utilization rates of on-premise computing assets are often cited in a cost-based defense of maintaining a pure on-premise strategy for big compute (HPC) workloads. The argument goes like this: the higher the utilization rate of an on-premise system, the more costly it is to lift and shift those workloads to the cloud. This frequently is a result of a total cost of ownership (TCO) study comparing an incomplete set of variables:

The above TCO comparison is woefully incomplete, but the missing pieces aside, even more visibly apparent is the key assumption underlying cloud computing: 100% utilization. The use of the assumption is understandable. Capital investments require financial justification and, depending on their scale, often detailed NPV analysis. Unfortunately, it is difficult to compare a fixed and capitalized expenditure to a variable and operational expenditure for these analyses. Forecasting opex requires detailed logging of compute usage and assumptions that past behavior can predict future requirements. For simplicity, it is easier to simply assume 100% utilization of cloud computing and move on. However, the organizational implications for 100% utilization of cloud computing versus 100% utilization of on-premise assets are very different. 100% utilization of a constrained on-premise compute asset implies queue times, a constant reevaluation of internal resource priorities, and slow reaction times to new requirements. 100% utilization of a certain portion of the immense cloud has none of these disadvantages.

This brings us back to our TSA story.

A TSA Nightmare
Imagine one day, the TSA agents at a particular airport received a peculiar directive: the taxpayers are extremely sensitive to the purchase of capitalized assets; and, as a result, it is now an agency priority to achieve 95% or greater capacity utilization of the newly installed scanners. What would be the consequences?

First, 95% utilization would require passenger processing through the line at all hours of the night, regardless of the fact that airplanes were only leaving and arriving between 6AM and midnight. Second, every 19 out of 20 passengers that arrived at the security line should expect a queue, regardless of the time they arrived. Third, during peak travel periods, wait times would increase exponentially. Fourth, in the long run, to achieve the targets, the TSA agents would be incentivized to shut down additional security lines and laterally transfer “excess” scanners to other airports. Somewhere in the aftermath is the passenger whose needs have been subordinated to the quest for high utilization rates. The psychology of the passenger changes, also. The passenger begins planning for long queue times, devoting otherwise productive time to gaming a system with limited predictability.

In the case of the purchase of a large, fixed-capacity compute system, the misalignment of incentives begin almost immediately after the purchase of the asset. Finance wants to optimize the return on the asset, putting pressure on Information Technology leaders to use the smallest possible asset at the highest levels of utilization for the longest amount of time. Meanwhile, hardware requirements continue to diverge and evolve outside the walls of the company, artificially constraining the company to decisions made years prior when business conditions were unlikely similar to present day. The very nature of a fixed asset creates winners and losers as workloads from some portions of the company are prioritized over others. Unlike airline travelers, however, engineers, researchers, and data scientists can be given options to bypass the system.

The cloud has inherent advantages relative to its on-premise counterpart. As a result, cloud big compute has earned its seat at the table in any organization that values agility, fast innovation cycles, and new approaches to problems. On-premise resources are inherently capacity-constrained and over time can place psychological governors to how employees think about finding solutions to problems. For example, an engineer may simply assume she has no other option and over-design a part rather than run a design study to understand sensitivity to key parameters. The cloud is not a panacea for all problems that need big compute. However, Information Technology leaders can do their part to encourage a culture of innovation by merely having a capable cloud strategy.

The cloud is more than TSA PreCheck, it is driving up on the tarmac and getting on the plane.

Learn more about the advantages of moving HPC to the cloud by downloading our free white paper: Motivations and IT Roadmap for Cloud HPC

This article was written by Matt McKee.

The growth of CAE tools has followed the industry through a familiar progression of technologies, starting out on mainframes in the 1980’s. Then pre- and post-processing migrated to the desktop in the 90’s, while solving continued on HPC systems, especially for compute-intensive analysis. The 2000’s have seen a continuous shift to more solving on the desktop with bigger, faster, parallel systems being used for the biggest problems.

CAE tools are now being used by mainstream engineers rather than CAE specialists, so  more engineers are now able to run larger and more complex models. Smaller models may still be run on the desktop, but HPC systems are still required for larger, more-accurate models. Another driver for increased compute power is automated workflows. Parameter sweeps for geometric changes or boundary conditions can mean that 10 or 20+ models may need to be computed. With a matrix of parameters, run counts can get even higher.  

Automated workflows enable optimization, with systems searching through a complex design workspace often running hundreds of simulation models in the process. Multi Disciplinary Optimization (MDO) pushes the envelope on this technology, linking multiple analysis tools. HEEDs, ModeFrontier and ANSYS DesignXplorer are examples of tools in which runtimes can be drastically reduced by deploying HPC via the Rescale platform.

Water Pump Efficiency Optimization Process
A CAD tool can control many geometric parameters to automatically create new solid models for CFD analysis. With a lot of variables, design spaces can become huge, where even the best MDX search engines still needing to run perhaps 100 different CFD models to find a better design. If each water pump model is 10 million cells, it may converge in 24 hours on a 64 core on-premise HPC system. That’s still 100 days! (Too much for a real world engineering problem.) There are two ways to accelerate this with the Rescale ScaleX platform:

  1. Increase the core count per job (Running each on 256 cores may reduce runtime per point to 6 hours), or
  2. Run multiple jobs concurrently. Technically you could run all 100 points at once on the Rescale ScaleX platform, but most optimization search engines use the results from previous runs to guide the direction of new runs. Typically only 10 (or less) of the 100 points can be run concurrently. At 40 runs per day, the Rescale platform could still condense the timeline from 100 days to 2.5 days!           

Bicycle Aerodynamics Analysis
Many bicycle magazines from the last two years contain reviews of a series of bicycle brands being compared in a wind tunnel. Companies will also try to emulate the process with CFD.

The manual workflow for this process:

  • Build base CAD model
  • Identify key geometric parameters
  • Apply parametric variables to CAD model
  • Export CAD model
  • Create CFD mesh
  • Solve for aerodynamics
  • Post-process for drag

Then repeat 20 times for different yaw angles. Repetitive and exhausting.

Imagine an automated process where parametric CAD geometry is used to create new geometries followed by CFD to evaluate aerodynamic performance of the frame. The numerical computation becomes substantial, and even though the actual man-hours is reduced the elapsed time is still a problem. 

Using the DOE functionality within the Rescale platform, all 20 models above could be run concurrently. It is a simple way for any company, including SMBs, to get instant scalability and power. Instead of a bicycle aero CFD model taking 24 hours on 16 desktop cores, you can use 196 fast cores on the Rescale platform to turn it around in 2 hours. In addition, you can run 20 models at once. The GANTT chart looks very different when the whole virtual wind tunnel CFD test can be completed in 24 hours instead of 20 days!

The need for faster, easily-accessible compute power is now greater than ever. How can companies really get the best from CAE tools when they are continuously stymied by limited resources? Project timelines planned using a GANTT chart can often show CFD project runtimes in days or weeks, a dominant part of the full project timeline.

This has been the accepted norm for many years. But now we can show you how the Rescale cloud platform can remove those roadblocks and condense project timelines!

Level the Playing Field! 
Many engineering companies or consultants are cautious to accept work where a large investment in computer hardware is required. Now, with Rescale’s ScaleX cloud HPC platform, any company can deploy adequate resources for the most complex CAE calculations. This is a market-changing dynamic. In the past, the most successful engineering companies would deploy large on-premise HPC systems. Much of their value to their clients was the availability of those systems. Now any group can deploy large HPC resources on-demand using the Rescale platform!

Sign up for a free trial.

This article was written by Adam Green.

By Jerry Gutierrez, Global HPC Solution Leader, Bluemix Infrastructure (SoftLayer), IBM Cloud & Tyler Smith, Head of Partnerships, Rescale

IBM Bluemix offers our customers an impressive array of leading high-performance computing (HPC) infrastructure, including bare metal servers and the latest GPUs, on an hourly basis to customers all over the world. But as HPC technology gets more advanced it takes more knowledge to leverage effectively, and we recognize that many times the people who need HPC aren’t experts in HPC; they’re experts in something else. They’re data scientists, automotive and aerospace manufacturers, and bioinformaticians. With challenging, high-stakes problems to solve, the intricacies of HPC implementation is just a hurdle to overcome on their quests.

Rescale’s web-based platform for running HPC on the cloud delivers the performance these engineers and scientists need in a turnkey, user-friendly experience. Rescale’s ScaleX platform gives users control over their data, hardware, and software while automating the complex tasks of HPC configuration and optimization. Running on IBM Bluemix infrastructure, Rescale makes sophisticated compute capabilities accessible to the users who need it most.

Here are just a few ways that Rescale simplifies HPC for all those trailblazers, helping put IBM Bluemix’s world-class HPC infrastructure network to work curing cancer, finding life on Mars, and predicting the future:

  1. Automated cluster configuration
    Building out your HPC cluster each time you want to run a job can be time-consuming and complicated, especially for the HPC novice. Doing hours of legwork just to get your computations started can undermine the value of being able to burst to the cloud exactly when you need to making the user’s task as simple and quick as choosing their hardware and cluster size and clicking “Submit.” It just takes a few clicks to spin up a cluster.
  2. Broad portfolio of pre-tuned and optimized software applications
    When a customer comes to IBM Bluemix, they install their own software based on their specific needs. Then they run benchmarks and tune their software to run on IBM Bluemix infrastructure. Again, this takes wizard-level HPC knowledge to do, and if you skip this step your performance will degrade or your problem might not even be solvable in the cloud. Rescale has a team of HPC, CAE, and deep learning experts that have automated and productized the tuning and optimization process for That saves a lot of time for our shared customers, and ensures they effortlessly get the maximum performance out of our hardware. Plus, Rescale offers hourly on-demand licenses, cloud license-hosting, and license proxy tooling to simplify the tangle of cloud licensing models software users must navigate if they want to leverage the cloud.
  3. Cloud management features
    Using the cloud to infinitely scale out your computations is game-changing, but it presents its own set of challenges at enterprise scale. Rescale’s has an administrative portal for IT teams to manage budgets and permissions for large, multi-disciplinary teams and projects. It also integrates easily into existing private infrastructure hardware and schedulers for seamless hybrid cloud deployment. Collaboration features allow users to share jobs with their colleagues in real-time, without having to download or transfer large files. These features and others give enterprise employees the raw power of HPC while ensuring the organization is productive, cost-effective, and secure.

In short, Rescale’s synergies with IBM Bluemix open the doors to world-class HPC, increase utilization of our infrastructure, and make them a valued partner. We’ve got some big stuff in the pipeline for 2017 and beyond, and we’re excited to bring them to market with Rescale at our side.

To learn more, watch Rescale’s Head of Partnerships, Tyler Smith, present on “How to Leverage New HPC and AI Capabilities Via the IBM Cloud” at IBM InterConnect on Thursday, March 23, 2017 at 11:30 am PDT in Mandalay Bay North, Level 0, South Pacific D. Click here for more information.

See this post on the IBM Bluemix blog.

This article was written by Tyler Smith.