laptop

Sometimes I feel good after fixing a bug. More likely though, I feel like I’ve made things worse. Fixing bugs often makes the code a little harder to read and a little more difficult to understand. Worse, fixing bugs may accidentally introduce even more bugs.

Most of the time, bugs occur because programmers can’t envision all possible runtime behaviors of a program. These unhandled behaviors are sometimes called edge cases. Usually, edge cases can be easily addressed with a simple if statement: if we encounter this case, do something else. However, doing so can make programs more difficult to comprehend because the reader now has to visualize multiple code paths in their head. It gets worse when there are multiple edge cases for which we pile on if statements. When it’s time to refactor some related code, these ifs would have to carry through the refactoring, and this increases the likelihood of a regression.

When I find myself piling on if statements to fix bugs, I ask myself if there are better ways to address the issue without making the program more difficult to understand and without the possibility of introducing more bugs. When there are, it usually involves refactoring the way data is modeled and handled. Below is a recount of one of those times.

At Rescale, users can launch desktop instances in the cloud. These desktops can be in the not_started, starting, running, stopping, or stopped state. The desktops and their latest known state are returned from an API endpoint for which we polled when displaying the desktops to the user.

There was a bug regarding the local UI state of the desktop. The state of the desktop is optimistically set to stopping when the user requests a desktop to be stopped. This is optimistic because it is set regardless of whether the stop request, which sends a message to queue a task for the desktop to be stopped, is successful.

There was a window of time in which if the user requested the list of desktops again, the API would return running for the desktop that was just requested to be stopped because the task for stopping the desktop is stilled queued and hasn’t run yet. The UI would update with the latest status and the user would see that the desktop went from stopping to running. When the stop task finally runs, the user would see a transition from running to stopping. Seeing stopping then running then stopping is a jarring experience for the user, so we needed to fix this.

An approach to fixing this would be to check whether the desktop is in the stopping state locally, and if so, skip updating its status to running. This is the “pile on an if statement” approach.

Instead, I decided to hold a set of statuses for each desktop. Then, whenever the desktops’ list API response came back, I would add the latest status of each desktop to their respective status set. I have a function that takes a set of statuses and displays the appropriate status to the user. For example, if running and stopping are in the set, it will just display stopping, but this function also has rules for handling starting and running.

This fixes the issue because the order in which the statuses arrive no longer mattered because the status displayed depends purely on what’s in the status set and not on the order in which they arrived. In other words, it was no longer possible to see stopping, then running, then back to stopping.

What’s great about this is that it fixed a similar issue that I had forgotten: when the user launches a desktop, the UI optimistically shows that the desktop is in the starting state, but the next API call may respond with not_started, and so the user could see that the desktop went from starting to not_started, then eventually back to starting. This issue had been effectively fixed for free.

In conclusion, when tasked to fix a bug, a simple solution may appear fine at first, but we should be thinking whether the solution encourages further complexity down the road. For example, in the previous example, I could have solved both cases with simple if statements. But if there were an issue with the status changing from stopped to stopping, then another programmer may be encouraged to pile on another if statement. Sometimes, it’s worth it to spend a little more time thinking about solving not just the bug, but the class of bugs that the issue represents.

This article was written by Kenneth Chung.

spk2

San Francisco, CA – SPK Corporation of Seoul, South Korea (CEO: Sung Park) has formed a strategic partnership with Rescale, Inc. of San Francisco, California (co-founder and CEO: Joris Poort) to become a value-added reseller of Rescale’s cloud simulation platforms in South Korea. Rescale provides a unified environment that allows companies to accelerate their engineering and science simulations using customizable high-performance computing (HPC) resources. Under the partnership, SPK now provides marketing, sales, and technical support for the Rescale platform, including integration with companies’ existing HPC environments and operation and analysis support. With strong relationships with Korean conglomerates, including LG, Samsung, Hyundai, POSCO, SK, and KT, SPK will provide value-added HPC cloud services to those companies’ private and public cloud services.

Computer simulations are very compute-intensive and an integral part of research and development in many industries, including aerospace, automotive, pharmaceutical, and energy. However, building large on-premise HPC clusters is very costly; therefore, many companies do not have sufficient capacity to run all their analyses. Insufficient simulation resources can lead to missed deadlines, suboptimal product designs, and foregone profits.

Rescale presents a solution to these pain points, allowing its customers to access a powerful infrastructure network of more than 8 million servers comprised of the latest hardware technology. The platforms are available on pay-per-use service based on compute cluster size and runtime. In addition to providing IT hardware, the platform supports more than 180 simulation software packages, including those for structural analysis, fluid dynamics, and quantum chemical calculations. All of Rescale’s 36+ data centers around the world maintain the highest security standards in the industry.  It is also possible to restrict data within a geographical region, such as to data centers in South Korea, according to business security needs by accessing one of Rescale’s five regional platforms.

With its secure, customizable simulation resources, Rescale enables companies to drastically reduce HPC procurement timelines, analysis runtimes, and product release schedules. Rescale’s platforms are accessible through an intuitive web-based interface or an application programming interface (API) function for integrating with a company’s existing HPC systems. Companies can focus attention, budgets, and resources towards primary business activities and R&D. Many of Rescale’s customers have reduced the overall simulation runtime by over 40%.

This new strategic partnership with Rescale will allow SPK to expand the offering of HPC software like ANSYS, LS-DYNA, and open-source machine learning software (e.g. TensorFlow) to large enterprises in Korea.

About SPK
SPK (www.spkr.co.kr) was founded in 2002, is headquartered in South Korea, and is dedicated to helping Korean enterprise customers transform from legacy to virtualized and cloud-based infrastructure. SPK has specialized in helping US-based start-ups achieve rapid market success in Korea. SPK counts the pillars of the Korean economy as its enterprise customers, including LG, Samsung, Hyundai, POSCO, SK, and KT.

About Rescale, Inc.
Rescale is the world’s leading cloud platform provider of simulation software and high performance computing (HPC) solutions. Rescale’s platform solutions are deployed securely and seamlessly to enterprises via a web-based application environment powered by preeminent simulation software providers and backed by the largest commercially available HPC infrastructure. Headquartered in San Francisco, CA, Rescale’s customers include global Fortune 500 companies in the aerospace, automotive, life sciences, marine, consumer products, and energy sectors.

This article was written by Rescale.

With the recent release of Rescale Deep Learning Cloud, we will present an example here that makes use of our new interactive notebook feature to develop deep neural networks. This feature enables an iterative workflow alternating between interactive data preprocessing and analysis, and batch training of neural networks.

mark3

In this article we will start with an image classification data set (CIFAR10), try a few different neural network designs in our interactive notebook, and then launch a batch training cluster to train that network for more epochs.

Starting a Jupyter Notebook
To get started, you first need to start up a Rescale Linux Desktop with a NVIDIA K80 GPU:

desktop-start
Here we have chosen a desktop configuration with a single NVIDIA K80 GPU. While you wait for the notebook to finish booting, you can clone and save the job that holds the CIFAR10 image dataset and the notebook code you will run. Follow this link and then save the job it creates (you do not need to submit it to run, you will just use the job to stage the notebook and dataset input files): CIFAR10 TensorFlow notebook.

Once the desktop finishes booting, attach TensorFlow software and the job with the notebook code.

desktop-attach1

desktop-attach2

Once the software and job are attached, open the notebook URL and enter the password when prompted:

note
Next, navigate into the attach_jobs directory, then the directory of the job you attached, and then to .ipynb file.

note-attached-jobs
The code in this notebook was adapted from the TensorFlow CIFAR10 training example.
We have already added another inference function to the example: inference_3conv, with a 3rd convolutional layer. You can try training the 2 convolution layer network by running all the cells as-is. To run the 3 conv layer version, replace the call to inference_2conv with inference_3conv,restart the kernel (ESC-0-0), and then run all the cells again.

cifar10-inference

TensorBoard
You can also access TensorBoard, TensorFlow’s built-in GUI, on the desktop via SSH tunnel. To configure your own SSH keys follow the instructions here. Just download one of the connection scripts in the Node Access section of the Desktop panel:

desktop-connect

and take the username and IP address out of the script. Then forward port 6006 to your localhost and run TensorBoard:

ssh -L 6006:localhost:6006 @ tensorboard –logdir=/tmp/cifar10_train

You should now be able to access it from your local browser at http://localhost:6006. The particular training example we are using already set /tmp/cifar10_train as the default location for training logs. Here are the 2 network graphs as they appear in TensorBoard. Two convolutional layers:

cifar_hacking2conv

Three convolutional layers:

cifar_hacking3conv

Batch Training
If you train the 2-layer and 3-layer convolutional networks on the notebook GPU for 10-20 epochs, you will see the loss does indeed drop faster for the 3-layer network. We would now like to see whether the deeper network yields better accuracy when trained longer or if it reaches the same accuracy in less training time.

You can launch a batch training job with your updated 3-convolutional-layer code directly from the notebook. First, save your notebook (Ctrl-S), then there is a shell command shortcut which will automatically export your notebook to regular python and launch a job with all the files in the same directory as the notebook. For example:


rungpus

The syntax is as follows:


This can be run from the command line on the desktop or within the notebook with the IPython shell magic ! syntax.

Some GPU core types you can choose from when launching from the notebook:

Jade: NVIDIA Kepler K520s
Obsidian: NVIDIA Tesla K80s

Once the job starts running, you can attach it to your desktop and the job files will be accessible on the notebook as part of a shared filesystem. First, the attach:

attach-running
Then, on the desktop, in addition to opening and viewing files, you can also open a terminal:

terminal
From the terminal, you can tail files, etc.

terminal-tail
Alternatively, you can navigate to the job in the Rescale web portal and live tail files in your browser. This allows you to shut down your Rescale desktop and still monitor training progress, or enables monitoring of your batch job on a mobile device while you are away from your workstation.


tail-running

Iterative Development
Above, you have just completed a single development iteration of our CIFAR10 training example, but you do not need to stop once the batch training is done. You can stop the batch training job anytime, review training logs in more depth from your notebook, then submit new training jobs.

The advantage here is that you can develop and test your code on similar hardware, the same software configuration, and the same training data as the batch training cluster we used. This eliminates the headache of bugs due to differences in software or hardware configuration between testing you might do on your local workstation and the training cluster in the cloud.

Additionally, if you prefer to do more compute heavy workloads directly in the notebook environment, we have Rescale Desktop configurations available with up to 8 K80 GPUs (4 K80 cards), email support@rescale.com for access to those.

To try out the workflow above, sign up here and immediately start doing deep learning on Rescale today.

Edit (2016-10-31): Added link for setting user SSH keys.

This article was written by Mark Whitney.

mattblog2

The above tweet in my newsfeed caught my attention because it succinctly echoed thoughts several IT leaders have recently shared with me. Recent research, like this study from Accenture, reinforce Tim’s observation:

  • 95% of respondents have a five-year cloud strategy already in place
  • Four of five executives reported that less than half of their business functions are currently operated in public cloud, but noted increasing intent on moving more of their operations to the cloud in the coming years
  • 89% of respondents agree that implementing cloud strategies is a competitive advantage which allows their companies to leverage innovation through agility
  • While half of respondents cite security as their biggest concern with the public model, more than 80% believe public cloud security is more robust and transparent than what they’re able to provide in-house

Frustrating the “how” decisions by IT leaders is the velocity in the expansion of the cloud market. The low barrier to entry is flooding the market with SaaS, IaaS, and PaaS technologies but the overwhelming amount of options are as likely to lead to paralysis as decision. Selecting an enterprise high-performance computing partner with the right strategy – the “how” – is critical. If you want more thoughts on the “why,” a link at the end of the article will share a perspective from one of our partners.

Here are five characteristics IT leaders should look for when selecting a cloud solution for high-performance computing:

The service has an enterprise strategy. This is seemingly obvious, but is a question worth asking when surveying a variety of available solutions. A service that has an enterprise strategy is a service that supports a diverse customer environment (variety of software vendors, software tools, workflows and/or hardware types) and replaces the burden of IT administration with IT management control. In our experience, enterprises commonly need:

1) Scalability
2) Flexibility
3) Compatibility with hybrid environments
4) Support for a diversity of workflows
5) Management tools
6) Integration strategies

Scalability, flexibility, and hybrid environments will be addressed below. The other three elements (workflows, management tools, and integration) are important, but largely a feature and function discussion that we’ll take on in another article.

The service is scalable for the enterprise. Product design, data research, and R&D cycles — and thus their corresponding infrastructure needs — are anything but smooth or predictable. Therefore, an enterprise solution needs to be able to rapidly scale from sixteen to thousands of cores to meet the needs of the enterprise. The engineering team that must undertake a massive redesign in the 11th hour should not be capped by limited capacity and queues. Delivering a solution to this problem is simply expensive without a service that offers wide-ranging access to multiple public clouds with on-demand pricing options.

The solution supports hybrid environments. As the Accenture study showed, many organizations are taking a “cloud first” approach. However, the diversity of workloads in the enterprise environment, varying size of data sets, and legacy systems and operations will likely keep on-premise infrastructure around for the short-term. CIOs may be “getting out of the data center business,” but as companies move along the cloud spectrum, there will be some companies that have a need for a mixed model. Thus, an enterprise solution allows organizations to manage both cloud and on-premise resources from a single platform. Not only does this assist the IT team with management, monitoring, and control, but also simplifies the experience for the end user.

The solution features the best of the cloud’s flexibility and agility. Much to the chagrin of IT leaders, hardware requirements for engineering, data scientists, and researchers are becoming increasingly more diverse. As hardware is increasingly optimized for different purposes and software developers are optimizing codes on platforms with divergent strategies, flexibility should be a concern of any IT leader. Enterprises need the agility to support the demands of an enterprise software tool suite – both today and tomorrow. One-size-fits-all solutions for HPC hardware will become obsolete in an enterprise with a sizable engineering staff (or even a small, but diverse staff). Enterprise flexibility and agility is delivered by a single platform/environment that brings different clouds, hardware, software, and pricing models together in one place. Alignment of cost and demand is the promise of the cloud. With a wide variety of hardware available, enterprises should be able to make decisions about computing that align with business drivers. What does this mean in practice? This means being able to take advantage of everything from low-cost public cloud spot markets to instantly-available, cutting-edge processors. This also means not requiring lock-in to pre-paid models that expire. Select a partner that can navigate the cloud environment on behalf of the enterprise’s evolving requirements.

The solution is secure. I included security because without it, invariably a reader would find this list woefully incomplete. However, I would contend that a vast majority of mature cloud HPC solutions can satisfy the requirements of 98% of companies. More often, security concerns are a red herring to disguise a cultural or organizational bias against a “XaaS” approach.

What about data transfer and large data sets? I often get this question, for good reason. There are several mitigation strategies for this issue. However, an entire article should be dedicated to this topic – we’ll follow-up on that in a later post.

To conclude, as the market for cloud services rapidly expands and most organizations are satisfied with the answers to “why” cloud, enterprise IT is faced with a multitude of decisions on the “how” cloud. For high-performance computing, this article should give managers some critical factors to consider when evaluating solutions.

And one last thing, a “why” cloud link, as promised: 6 Advantages of Cloud Computing

This article was written by Matt McKee.