Why high utilization doesn’t work for TSA and why it doesn’t work for HPC

© CC-BY-SA 2.0 2010 “Security at Denver International Airport” by oddharmonic via Flickr

Executive Summary:

  • In the case of big compute power, the purchase of large capital assets can create an organizational misalignment of incentives that places the needs of the end user last
  • Achieving high utilization rates of on-premise computing is a pyrrhic victory; it creates winners and losers and puts a governor on the pace of innovation
  • Information technology leaders with high utilization rates of on-premise compute should establish a cloud bypass for work to encourage a culture of agility, innovation, and “outside-the-stacks” thinking
  • When calculating total cost of ownership (TCO) of on-premise computing, user experience, workflow cycle times, responsiveness to new requirements, and other factors must be considered

Airport Travelers and HPC Users Have the Same Complaints
While standing in line in airport security at LAX recently, travelers behind me began engaging in a familiar sport: wondering if there were better alternatives to the US airport security screening process. As some lines proved to be faster than others, the complaints ranged from line choice to the efficacy of the entire system. Having recently returned from several meetings with future users of cloud computing, the complaints were similar: wait times, capacity limitations, and perceived unfairness in the system.

High utilization rates of on-premise computing assets are often cited in a cost-based defense of maintaining a pure on-premise strategy for big compute (HPC) workloads. The argument goes like this: the higher the utilization rate of an on-premise system, the more costly it is to lift and shift those workloads to the cloud. This frequently is a result of a total cost of ownership (TCO) study comparing an incomplete set of variables:

The above TCO comparison is woefully incomplete, but the missing pieces aside, even more visibly apparent is the key assumption underlying cloud computing: 100% utilization. The use of the assumption is understandable. Capital investments require financial justification and, depending on their scale, often detailed NPV analysis. Unfortunately, it is difficult to compare a fixed and capitalized expenditure to a variable and operational expenditure for these analyses. Forecasting opex requires detailed logging of compute usage and assumptions that past behavior can predict future requirements. For simplicity, it is easier to simply assume 100% utilization of cloud computing and move on. However, the organizational implications for 100% utilization of cloud computing versus 100% utilization of on-premise assets are very different. 100% utilization of a constrained on-premise compute asset implies queue times, a constant reevaluation of internal resource priorities, and slow reaction times to new requirements. 100% utilization of a certain portion of the immense cloud has none of these disadvantages.

This brings us back to our TSA story.

A TSA Nightmare
Imagine one day, the TSA agents at a particular airport received a peculiar directive: the taxpayers are extremely sensitive to the purchase of capitalized assets; and, as a result, it is now an agency priority to achieve 95% or greater capacity utilization of the newly installed scanners. What would be the consequences?

First, 95% utilization would require passenger processing through the line at all hours of the night, regardless of the fact that airplanes were only leaving and arriving between 6AM and midnight. Second, every 19 out of 20 passengers that arrived at the security line should expect a queue, regardless of the time they arrived. Third, during peak travel periods, wait times would increase exponentially. Fourth, in the long run, to achieve the targets, the TSA agents would be incentivized to shut down additional security lines and laterally transfer “excess” scanners to other airports. Somewhere in the aftermath is the passenger whose needs have been subordinated to the quest for high utilization rates. The psychology of the passenger changes, also. The passenger begins planning for long queue times, devoting otherwise productive time to gaming a system with limited predictability.

In the case of the purchase of a large, fixed-capacity compute system, the misalignment of incentives begin almost immediately after the purchase of the asset. Finance wants to optimize the return on the asset, putting pressure on Information Technology leaders to use the smallest possible asset at the highest levels of utilization for the longest amount of time. Meanwhile, hardware requirements continue to diverge and evolve outside the walls of the company, artificially constraining the company to decisions made years prior when business conditions were unlikely similar to present day. The very nature of a fixed asset creates winners and losers as workloads from some portions of the company are prioritized over others. Unlike airline travelers, however, engineers, researchers, and data scientists can be given options to bypass the system.

The cloud has inherent advantages relative to its on-premise counterpart. As a result, cloud big compute has earned its seat at the table in any organization that values agility, fast innovation cycles, and new approaches to problems. On-premise resources are inherently capacity-constrained and over time can place psychological governors to how employees think about finding solutions to problems. For example, an engineer may simply assume she has no other option and over-design a part rather than run a design study to understand sensitivity to key parameters. The cloud is not a panacea for all problems that need big compute. However, Information Technology leaders can do their part to encourage a culture of innovation by merely having a capable cloud strategy.

The cloud is more than TSA PreCheck, it is driving up on the tarmac and getting on the plane.

Learn more about the advantages of moving HPC to the cloud by downloading our free white paper: Motivations and IT Roadmap for Cloud HPC

This article was written by Matt McKee.

The growth of CAE tools has followed the industry through a familiar progression of technologies, starting out on mainframes in the 1980’s. Then pre- and post-processing migrated to the desktop in the 90’s, while solving continued on HPC systems, especially for compute-intensive analysis. The 2000’s have seen a continuous shift to more solving on the desktop with bigger, faster, parallel systems being used for the biggest problems.

CAE tools are now being used by mainstream engineers rather than CAE specialists, so  more engineers are now able to run larger and more complex models. Smaller models may still be run on the desktop, but HPC systems are still required for larger, more-accurate models. Another driver for increased compute power is automated workflows. Parameter sweeps for geometric changes or boundary conditions can mean that 10 or 20+ models may need to be computed. With a matrix of parameters, run counts can get even higher.  

Automated workflows enable optimization, with systems searching through a complex design workspace often running hundreds of simulation models in the process. Multi Disciplinary Optimization (MDO) pushes the envelope on this technology, linking multiple analysis tools. HEEDs, ModeFrontier and ANSYS DesignXplorer are examples of tools in which runtimes can be drastically reduced by deploying HPC via the Rescale platform.

Water Pump Efficiency Optimization Process
A CAD tool can control many geometric parameters to automatically create new solid models for CFD analysis. With a lot of variables, design spaces can become huge, where even the best MDX search engines still needing to run perhaps 100 different CFD models to find a better design. If each water pump model is 10 million cells, it may converge in 24 hours on a 64 core on-premise HPC system. That’s still 100 days! (Too much for a real world engineering problem.) There are two ways to accelerate this with the Rescale ScaleX platform:

  1. Increase the core count per job (Running each on 256 cores may reduce runtime per point to 6 hours), or
  2. Run multiple jobs concurrently. Technically you could run all 100 points at once on the Rescale ScaleX platform, but most optimization search engines use the results from previous runs to guide the direction of new runs. Typically only 10 (or less) of the 100 points can be run concurrently. At 40 runs per day, the Rescale platform could still condense the timeline from 100 days to 2.5 days!           

Bicycle Aerodynamics Analysis
Many bicycle magazines from the last two years contain reviews of a series of bicycle brands being compared in a wind tunnel. Companies will also try to emulate the process with CFD.

The manual workflow for this process:

  • Build base CAD model
  • Identify key geometric parameters
  • Apply parametric variables to CAD model
  • Export CAD model
  • Create CFD mesh
  • Solve for aerodynamics
  • Post-process for drag

Then repeat 20 times for different yaw angles. Repetitive and exhausting.

Imagine an automated process where parametric CAD geometry is used to create new geometries followed by CFD to evaluate aerodynamic performance of the frame. The numerical computation becomes substantial, and even though the actual man-hours is reduced the elapsed time is still a problem. 

Using the DOE functionality within the Rescale platform, all 20 models above could be run concurrently. It is a simple way for any company, including SMBs, to get instant scalability and power. Instead of a bicycle aero CFD model taking 24 hours on 16 desktop cores, you can use 196 fast cores on the Rescale platform to turn it around in 2 hours. In addition, you can run 20 models at once. The GANTT chart looks very different when the whole virtual wind tunnel CFD test can be completed in 24 hours instead of 20 days!

The need for faster, easily-accessible compute power is now greater than ever. How can companies really get the best from CAE tools when they are continuously stymied by limited resources? Project timelines planned using a GANTT chart can often show CFD project runtimes in days or weeks, a dominant part of the full project timeline.

This has been the accepted norm for many years. But now we can show you how the Rescale cloud platform can remove those roadblocks and condense project timelines!

Level the Playing Field! 
Many engineering companies or consultants are cautious to accept work where a large investment in computer hardware is required. Now, with Rescale’s ScaleX cloud HPC platform, any company can deploy adequate resources for the most complex CAE calculations. This is a market-changing dynamic. In the past, the most successful engineering companies would deploy large on-premise HPC systems. Much of their value to their clients was the availability of those systems. Now any group can deploy large HPC resources on-demand using the Rescale platform!

Sign up for a free trial.

This article was written by Adam Green.