In this Big Compute podcast episode, host Gabriel Broner interviews Mike Woodacre, HPE Fellow, to discuss the shift from CPUs to an emerging diversity of architectures. Hear about the evolution of CPUs, the advent of GPUs with increasing data parallelism, memory driven computing, and the potential benefits of a cloud environment with access to multiple architectures.

Register for Future Big Compute Podcast Episodes

Overview and Key Comments

Emerging Architectures

After many years of CPU architectures doubling in performance every eighteen months we have reached a plateau due to constraints of miniaturization and power. This has led to a new era of diversity with systems that use CPUs, GPUs, FPGAs, etc. Woodacre welcomes the new diversity after years of consolidation and commoditization:

“It is a really exciting time in the industry. We are going back to the variety we used to have. Because volumes and commoditization, we gravitated to a few variants. As we started hitting the limits of the ability to scale we are seeing a new era of specialization. At a time data continue to grow exponentially, there are new business reasons to develop unique architectures. The challenge is to pick the right tool for the job.”

The Evolution of CPUs

CPUs have evolved in the last few generations. They have increased the number of cores, and they have increased its memory bandwidth.

“The biggest change has been the continuing growing number of cores. The challenge is how you feed them. You need more memory bandwidth, I/O.”

“When you analyze HPC applications, pretty soon memory bandwidth is the limiting factor. Skylake can go from 4 cores to 28 cores, but in some benchmarks you may top out at 12 cores because of memory bandwidth.”

GPUs

GPUs, which initially started as the graphical processing unit are, according to Woodacre, “the ultimate data parallel device” which has found success in AI/machine learning/deep learning:

“Deep learning at the core is a matrix multiplication, and you can take advantage of the multiple multiplication units to perform training in hours, not weeks.”

Memory Driven Architectures

Emerging memory driven architectures hold all data in memory and they can accelerate insights and make us rethink workflows:

“Data continues to grow exponentially and traditional architectures cannot cope. By having data in memory you remove the storage bottlenecks and speed up the analysis work. You can take the current pipeline of preprocessing, simulation, analysis, and by keeping data in memory achieve tremendous speedups.”

Cloud as an Enabler

Cloud helps enable this new diversity of architectures. As each application and workflow benefit from a different architecture, cloud allows different applications to run on different architectures to effectively use the right tool for the job.

“Security of cloud used to be a concern but it’s now pretty much in the background. The public cloud provides an entry point to explore all these architectural options. If the software applications and licensing is provided, that helps people with the on-ramp. In the end you have to decide what makes the most business sense for you.”

Mike Woodacre

Mike Woodacre is a Fellow at HPE. Over the years Mike has been a systems architect and Chief Engineer at MIPS, SGI and HPE, where he set direction for hardware architectures.

Gabriel Broner

Gabriel Broner is VP & GM of HPC at Rescale. Prior to joining Rescale in July 2017, Gabriel spent 25 years in the industry as OS architect at Cray, GM at Microsoft, head of innovation at Ericsson, and VP & GM of HPC at SGI/HPE.

This article was written by Gabriel Broner.

The options for High Performance Computing (HPC) systems can be overwhelming with the different expenses and benefits associated with each system. The different systems currently available fall into the following categories: on-premise, cloud-enabled (full or hybrid), and bare-metal cloud.

Depending on your current and future HPC and organizational demands, each system offers benefits and limitations that need to be defined and compared. One of the main comparisons between systems is usually the Total Cost of Ownership (TCO). As I mentioned in a previous blog post, TCO not exactly a good fit for making buying decisions between fundamentally dissimilar alternatives. The TCO of on-premise HPC systems has been discussed for +30 years, even by our VP of Sales in his blog “The Real Cost of High-Performance Computing.” For people who are considering buying on-premise HPC systems, there are some hidden expenses that are often overlooked when calculating the TCO of and on-premise HPC system.

In this post, I intend to break down the TCO of an on-premise system and expose some expenses that may be overlooked.

A quick review on TCO

The broad definition of an on-premise HPC system’s TCO is that you sum the amount of all direct and indirect expenses correlated with your prospective system. The more obvious expenses are hardware, software, staffing, and power.  For hardware, you need the following: servers, wiring, ToR switches, aggregation switches, server racks, power distribution units, etc. Then you must buy software that coordinates the communication between each node to solve complex problems. In addition, you must buy licenses for the software you plan on using. A resource that can be extremely variable and hard to estimate is the staffing required to develop, deploy, and maintain the on-premise HPC system. Finally, on-premise HPC systems require a lot of power and cooling capabilities: it is essential to calculate your energy consumption and how it will affect your operational expenses. Take the sum of the expenses for the items above and you have the basic TCO for your on-premise HPC system; however, there are some hidden costs that can heavily affect the TCO of your on-premise system.  

Real-world, Hidden Costs

#1 The facilities hosting your HPC systems have cost dependencies that reach further than at first glance. Ensuring your facility has the proper cooling and power provisions necessary to support the current system and its potential scalability can save a lot of expenses down the road. Power is a major expense and can be extremely impactful on your overall operating expense. Depending on cluster location and utilization, your power costs can vary greatly. Due to your location, you may also see highly variable power prices that will heavily affect how you operate your HPC system to minimize expenses. In some cases, power can become over 1/3 of your operating expenses. Facilities and energy are important to consider when calculating your TCO and, for a large facility, should be considered a primary concern.

#2 Staffing will cost and vary more than you think, with performance and uptime suffering if neglected. One of the most variable and elusive expenses to define is the staffing for on-premise HPC systems. It can be very difficult to find, hire and train good Operations and IT Managers that can perform the development, deployment, and maintenance of an HPC system. Designing an HPC system requires expensive specialists to match the best hardware and software for your computing demands. The procurement of the system alone can cost as much as 5% of the total HPC system and takes at least 6 months. During this time, you must continue paying specialists to assemble the cluster while receiving no reward for the HPC system. Once deployed, the systems require very specific IT staffing to ensure its’ maintenance and operation. These employees require specialized skills to test and protect your HPC system’s longevity and performance. Finding the right employees to perform these functions can be cumbersome and costly, but is a priority when considering deploying an on-premise HPC system.

#3 Underutilization costs more than just the idle time, the associated overhead is substantial as well. An idle HPC system not only lowers your ROI, but can have devastating impacts on your product development cycle. Back-up systems can be overlooked because they are not considered necessary expenses to have an operating HPC system; however, the consequences for not having them can be dire. Generators, switches, gas, and maintenance of your backup energy system are all necessary to ensure that your systems are protected from power outages. Comparable to back-up energy provisions, back-up hardware is extremely important to mitigate an idle HPC system. Spare hardware is important to have on hand in case there is an issue; without backup hardware, you can find your system sitting idle while the part is repaired or bought. If you fail to plan, you should plan to fail; this is especially true for running an on-premise HPC system.  

#4 Finally, on-premise technology is a constant uphill (and usually losing) battle. This is the harm caused by not utilized the best technology, and having to spend enormous efforts and capital to race to keep up. When comparing HPC systems, you have to acknowledge the costs and rewards, and their effect on each other. Not using the best technology can create expenses that stem from forfeiting rewards that are given by the best system. The expenses correlated to not using the best HPC solution are: lost productivity, missed innovation, longer time-to-solution, technology refresh cost, IT risk management, and increased IT debt and commitment. The most harmful forfeited reward is inefficiency in the research pipeline which creates a plethora of expenses correlated to the increase in time-to-market, delay in innovation, and increase in researcher idle time. The lack of HPC technology can cause your organization to have irreparable implications such as not being able to research larger problems and missing innovations that can make your organization uncompetitive. These expenses are often difficult to calculate because you have to assess how much more efficient your team will be with a better HPC solution and then work backwards to calculate the expenses correlated to inefficiency.  

In summary, finding the true TCO of an on-premise HPC system can prove very difficult when considering all the hidden costs: staffing, facilities, power consumption, backup provisions, and forfeited rewards. I argue that one of the most important expenses to consider when comparing HPC systems is the expenses caused by forfeited rewards; however, these prove to be the most difficult to calculate and predict. The topic of TCO comparisons between cloud-enabled and on-premise HPC systems has been discussed regularly and is still not clearly defined. It is a comparison that we are working to improve, so if you have any comments or questions on this blog post or TCO, we would love to hear what you think.

Sara Jeanes. (2017, June 19). Cloud vs. Datacenter Costs for High Performance Computing (HPC): A Real World Example. Retrieved from: https://www.internet2.edu/blogs/detail/14114

Tony Spagnuolo. (2015, January). The Real Cost of High Performance Computing. Retrieved from: https://blog.rescale.com/the-real-cost-of-high-performance-computing/

Wolfgang Gentzsch. (2016, March 6). A Total Cost Analysis for Manufacturers of In-house Computing Resources and Cloud Computing. Retrieved from: https://community.theubercloud.com/wp-content/uploads/2016/04/TCO-Study-UberCloud.pdf

This article was written by Thomas Helmonds.

Engineers face many daily operational inefficiencies that inhibit their time-to-solution. Every day we work with engineers to provide solutions to computing resource limitations and management of HPC. Specifically, we excel at utilizing our platform to accelerate HPC engineering simulations. The impact is real: Rescale users have seen accelerated time-to-solution by 23%, allowing engineering teams to be 12% more productive overall.

In this article, we hope to give you exactly what you need to better plan for HPC in 2019.

(Your) 2019 Engineering Objectives: Measurably Improve Engineering Team Productivity

1. Shorten the turnaround time of your engineering services

2. Eliminate engineering hours spent in HPC queues

3. Increase the individual productivity of your engineers

4. Develop best practices for HPC usage by workflow

Some key issues engineers face when developing a product are simulation constraints due to queue times from lack of computing resources, software availability, architecture diversity, and departmental management. The shortage of these vital resources and tools results longer development cycles of the products that generate revenue.   

1. Shorten the turnaround time of your engineering services

By eliminating queue time and enabling engineers with the best HPC hardware and software, you can optimize your research pipeline and push innovations to the market, sooner.

The Proof:

Dinex, an automotive exhaust supplier, saw a reduction in time-to market of 25% by utilizing the Rescale platform. With abundant computing resources available through our public cloud partners, you gain the ability to mitigate queue time by immediately securing the resources as you need them. The abundant computing hardware and software diversity allows engineers to run simulation that were previously unsupported by on-premise systems (either based off intolerable queue time or software and hardware resource demand). The availability of software and computing resources, ability to innovate design of experiments, and the mitigation of queue time allow engineers to be more efficient and deliver products to market faster.

2. Eliminate engineering hours spent in HPC queues

Stop waiting to run your simulations because of limited HPC resources and/or low priority. Empower every engineer with the resources to run simulations immediately using our AWS, Azure, and IBM cloud resources.

The Proof:

Queues for running simulations can halt the research pipeline and waste valuable engineering time. A queue directly results in a delayed time-to-solution that can be critical to the progression of research. The days spent without answers can cost a company millions of dollars in engineer idle time. The ability to secure hardware as needed allows engineers to be agile with their computing resources and break the constraints of a static on-premise HPC system that limit their simulation volume and fidelity. These inefficiencies directly impact the company’s objective to bring innovations to the market and generate revenue; so, the ramifications of research inefficiencies reverberate throughout the entire organization and externally. By utilizing Rescale, you can run a single simulation on 10,000 cores, or run 10,000 simulations on 10 cores each: the availability of resources means there is no reason not to run a simulation immediately.

3. Increase the individual productivity of your engineers

Remove the constraints of static On-Premise HPC systems and engage a dynamic environment with a the latest HPC hardware and simulation software. Explore new DOE and optimize your research pipeline to achieve the fastest time-to-solutions.  

The Proof:

Rescale has over 300 ported and tuned software’s incorporated into our platform; many on a pay as you use model such as ANSYS, Siemens, CONVERGE, and LS-DYNA. Utilization of the endless, diverse computing resources allows engineers to use the best software on the best hardware, always. The coupling of the best software and hardware allows engineers to have the best results available, quickly. In addition, engineers are exposed to new software and computing resources that were previously unavailable. Some Rescale customers have seen as high as 80% reduction in time-to-answers. The freedom of architecture choices allows for the exploration of new processes in your design of experiments which can create quicker research pipelines with higher fidelity. Enabling researchers with the best tools HPC tools produces quicker results and increases productivity.

4. Develop best practices for HPC usage by workflow

Gain real time insight into your engineers activities and utilize the information to optimize your engineering departments operations and finances.

The Proof:

Scale X Enterprise allows you to fully manage your engineers by tracking expenses, allocating resources, and budgeting teams. With control of computing and software resources, budgets, projects, and access, you can fully manage how your engineering teams utilize cloud computing. In addition, access to billing summaries and real time spending dashboards allow you to monitor your computing expenses. Rescale doesn’t only provide a solution to engineering inefficiencies, it gives management the insight to innovate their own research pipeline.  

Rescale is a turn-key platform that enables access to limitless computing resources and over 300 ported and tuned softwares. With ScaleX Enterprise’s management dashboard, engineering departments are capable of fully managing and reporting on their HPC usage. Rescale has had significant impact on many of our customers; but to understand the true impact Rescale can have on your organization, it is best to reach out to us. With our confidential tools and industry leading knowledge, we can define the impact of Rescale on your engineering operations.

If you have any questions or interest in seeing how Rescale can improve your engineering department, please reach out to our specialists today.

This article was written by Thomas Helmonds.

Total Cost of Ownership (TCO) is a powerful financial tool that allows you to understand the direct and indirect expenses related to an asset, such as your HPC system. Calculating the TCO for an on-premise HPC system is direct: add up all expenses related to your system and its management for the entirety of its deployment. But what happens when you’re interested in switching to a cloud-enabled HPC environment? Can you confidently compare the cloud-enabled HPC system’s TCO with an on-premise HPC system’s TCO?

This question has been addressed by many different institutions.

Our view is simple: TCO is a poor financial tool for evaluating the value of cloud-enabled HPC. Comparing a system with a static environment against a dynamic environment creates an unreliable and misleading analysis. It is an apples to oranges comparison, and using TCO to assess cloud-enabled HPC attempts to make apple juice from oranges.

What is a static environment and how does it apply to my TCO analysis?

Static environments for TCO are used when you have set expense for a set return. For an on-premise system, you can get X amount of computing power for Y amount of dollars. This same relationship goes on for most expenses in the cost analysis of an on-premise HPC system until you reach a comprehensive TCO. There are some variable costs involved (fluctuation in software pricing, staffing, energy, unpredicted errors, etc.); however, margins can be used to monitor their influence on the TCO. Essentially, you end up with the general TCO analysis of X computing power = Y expenses ± margin of change. This is a great tool for comparing systems with little expense variations and known rewards that create a near-linear relationship. However, what happens when the computing power is nearly infinite, and the expenses are reactive, as is the case for cloud computing?

What is a dynamic environment and how does it apply to my TCO analysis?

A dynamic environment for a TCO analysis is a system where the expenses and rewards are not directly correlated, making them difficult to define and compare. In a cloud-enabled HPC system, you pay for computing power when you need it; there is little initial capital expenditure needed to use cloud-enabled HPC, when compared to on-premise HPC systems. In this environment, your expenses for HPC become less predictable and more reactive because they are generated from your computing demand. In addition, you are no longer constrained by a set capacity or architecture of computing resources, so your reward is extremely variable due to how you utilize HPC. This scalability and agility can heavily influence your HPC usage; especially if your current system is inhibiting your simulation-throughput and potential Design of Experiment (DOE). The rewards of cloud computing beckon the question: if you have less restrictions on HPC, would you utilize it differently?

What happens when you use TCO to compare on-premise vs cloud-enabled HPC systems?

TCO is a tool that is helpful for static environments, but when you try to take the same static tool and apply it to a highly dynamic environment, it is misleading. For example, consider you want to calculate the TCO of an on-premise HPC system. First, you must predict your peak usage and utilization for a system that will be used for approximately 3-5 years. To manage all an organization’s requirements, trade offs are made between peak capacity and the cost of obsolescence. Then you must pay the massive initial capital expenditure to purchase all the hardware, software, and staff required to assemble and operate the system. Calculate all these expenses and you receive your TCO for a system that awards you limited computing resources.

Now, try to use the same analysis of a cloud-enabled HPC system. Most take the projected peak computing power and average utilization and multiply it by the price to compute in their prospective cloud service provider. This is the first problem, you’re already treating both systems as if their rewards and expenses are equal. With cloud-enabled HPC systems, you have instant access to the latest hardware and applications which means you are always utilizing the best infrastructure for your workflow. By utilizing cloud resources, your computing power becomes near-infinite, meaning there is no reason to have a queue for running simulations, which increases your productivity. The limitless and diverse computing resources allows for innovations in the research and design process that are essential to getting better products to market before competitors. The inability to easily scale and upgrade resources for an on-premise HPC system can severely inhibit your ability to compete. The differences in rewards makes it hard to quantify the expenses associated with the aging on-premise HPC system’s effect on potential new workflows that can help you out-compete your competition.

When comparing HPC solution’s TCO, you must acknowledge the rewards provided by each solution, because the lack of a rewards should be reflected as an expense in the competitor’s TCO. For example, if your cloud computing solution provides no queue time, better computing performance, and new DOEs, but your on-premise solution doesn’t, then you must calculate the expenses of inefficiency correlated to the absence of rewards from the on-premise system. That is the only way to level the TCO with the corresponding rewards, but it proves extremely difficult to define exact numbers for each reward; henceforth, making TCO a misleading and inaccurate tool. Comparing the TCO and rewards of cloud-enabled and on-premise HPC systems is pointless because the tool does not address the reality of each system; one is static and requires massive investment to create limited computing power, and the other is agile and requires pay-as-you-go expenses for limitless computing power.

Determining the financial implications of incorporating cloud-enabled HPC into you HPC system can be difficult. Thankfully, Rescale has many specialists and confidential tools to help define the benefit of cloud-enabled HPC on your organization.

Come talk to us today.

This article was written by Thomas Helmonds.