With the recent release of Rescale Deep Learning Cloud, we will present an example here that makes use of our new interactive notebook feature to develop deep neural networks. This feature enables an iterative workflow alternating between interactive data preprocessing and analysis, and batch training of neural networks.


In this article we will start with an image classification data set (CIFAR10), try a few different neural network designs in our interactive notebook, and then launch a batch training cluster to train that network for more epochs.

Starting a Notebook
To get started, you first need to start up a Rescale Linux Desktop with a NVIDIA K80 GPU:

Here we have chosen a desktop configuration with a single NVIDIA K80 GPU. While you wait for the notebook to finish booting, you can clone and save the job that holds the CIFAR10 image dataset and the notebook code you will run. Follow this link and then save the job it creates (you do not need to submit it to run, you will just use the job to stage the notebook and dataset input files): CIFAR10 TensorFlow notebook.

Once the desktop finishes booting, attach TensorFlow software and the job with the notebook code.



Once the software and job are attached, open the notebook URL and enter the password when prompted:

Next, navigate into the attach_jobs directory, then the directory of the job you attached, and then to .ipynb file.

The code in this notebook was adapted from the TensorFlow CIFAR10 training example.
We have already added another inference function to the example: inference_3conv, with a 3rd convolutional layer. You can try training the 2 convolution layer network by running all the cells as-is. To run the 3 conv layer version, replace the call to inference_2conv with inference_3conv,restart the kernel (ESC-0-0), and then run all the cells again.


You can also access TensorBoard, TensorFlow’s built-in GUI, on the desktop via SSH tunnel. Just download one of the connection scripts in the Node Access section of the Desktop panel:


and take the username and IP address out of the script. Then forward port 6006 to your localhost and run TensorBoard:

ssh -L 6006:localhost:6006 @ tensorboard –logdir=/tmp/cifar10_train

You should now be able to access it from your local browser at http://localhost:6006. The particular training example we are using already set /tmp/cifar10_train as the default location for training logs. Here are the 2 network graphs as they appear in TensorBoard. Two convolutional layers:


Three convolutional layers:


Batch Training
If you train the 2-layer and 3-layer convolutional networks on the notebook GPU for 10-20 epochs, you will see the loss does indeed drop faster for the 3-layer network. We would now like to see whether the deeper network yields better accuracy when trained longer or if it reaches the same accuracy in less training time.

You can launch a batch training job with your updated 3-convolutional-layer code directly from the notebook. First, save your notebook (Ctrl-S), then there is a shell command shortcut which will automatically export your notebook to regular python and launch a job with all the files in the same directory as the notebook. For example:


The syntax is as follows:

This can be run from the command line on the desktop or within the notebook with the IPython shell magic ! syntax.

Some GPU core types you can choose from when launching from the notebook:

Jade: NVIDIA Kepler K520s
Obsidian: NVIDIA Tesla K80s

Once the job starts running, you can attach it to your desktop and the job files will be accessible on the notebook as part of a shared filesystem. First, the attach:

Then, on the desktop, in addition to opening and viewing files, you can also open a terminal:

From the terminal, you can tail files, etc.

Alternatively, you can navigate to the job in the Rescale web portal and live tail files in your browser. This allows you to shut down your Rescale desktop and still monitor training progress, or enables monitoring of your batch job on a mobile device while you are away from your workstation.


Iterative Development
Above, you have just completed a single development iteration of our CIFAR10 training example, but you do not need to stop once the batch training is done. You can stop the batch training job anytime, review training logs in more depth from your notebook, then submit new training jobs.

The advantage here is that you can develop and test your code on similar hardware, the same software configuration, and the same training data as the batch training cluster we used. This eliminates the headache of bugs due to differences in software or hardware configuration between testing you might do on your local workstation and the training cluster in the cloud.

Additionally, if you prefer to do more compute heavy workloads directly in the notebook environment, we have Rescale Desktop configurations available with up to 8 K80 GPUs (4 K80 cards), email support@rescale.com for access to those.

To try out the workflow above, sign up here and immediately start doing deep learning on Rescale today.

This article was written by Mark Whitney.


The above tweet in my newsfeed caught my attention because it succinctly echoed thoughts several IT leaders have recently shared with me. Recent research, like this study from Accenture, reinforce Tim’s observation:

  • 95% of respondents have a five-year cloud strategy already in place
  • Four of five executives reported that less than half of their business functions are currently operated in public cloud, but noted increasing intent on moving more of their operations to the cloud in the coming years
  • 89% of respondents agree that implementing cloud strategies is a competitive advantage which allows their companies to leverage innovation through agility
  • While half of respondents cite security as their biggest concern with the public model, more than 80% believe public cloud security is more robust and transparent than what they’re able to provide in-house

Frustrating the “how” decisions by IT leaders is the velocity in the expansion of the cloud market. The low barrier to entry is flooding the market with SaaS, IaaS, and PaaS technologies but the overwhelming amount of options are as likely to lead to paralysis as decision. Selecting an enterprise high-performance computing partner with the right strategy – the “how” – is critical. If you want more thoughts on the “why,” a link at the end of the article will share a perspective from one of our partners.

Here are five characteristics IT leaders should look for when selecting a cloud solution for high-performance computing:

The service has an enterprise strategy. This is seemingly obvious, but is a question worth asking when surveying a variety of available solutions. A service that has an enterprise strategy is a service that supports a diverse customer environment (variety of software vendors, software tools, workflows and/or hardware types) and replaces the burden of IT administration with IT management control. In our experience, enterprises commonly need:

1) Scalability
2) Flexibility
3) Compatibility with hybrid environments
4) Support for a diversity of workflows
5) Management tools
6) Integration strategies

Scalability, flexibility, and hybrid environments will be addressed below. The other three elements (workflows, management tools, and integration) are important, but largely a feature and function discussion that we’ll take on in another article.

The service is scalable for the enterprise. Product design, data research, and R&D cycles — and thus their corresponding infrastructure needs — are anything but smooth or predictable. Therefore, an enterprise solution needs to be able to rapidly scale from sixteen to thousands of cores to meet the needs of the enterprise. The engineering team that must undertake a massive redesign in the 11th hour should not be capped by limited capacity and queues. Delivering a solution to this problem is simply expensive without a service that offers wide-ranging access to multiple public clouds with on-demand pricing options.

The solution supports hybrid environments. As the Accenture study showed, many organizations are taking a “cloud first” approach. However, the diversity of workloads in the enterprise environment, varying size of data sets, and legacy systems and operations will likely keep on-premise infrastructure around for the short-term. CIOs may be “getting out of the data center business,” but as companies move along the cloud spectrum, there will be some companies that have a need for a mixed model. Thus, an enterprise solution allows organizations to manage both cloud and on-premise resources from a single platform. Not only does this assist the IT team with management, monitoring, and control, but also simplifies the experience for the end user.

The solution features the best of the cloud’s flexibility and agility. Much to the chagrin of IT leaders, hardware requirements for engineering, data scientists, and researchers are becoming increasingly more diverse. As hardware is increasingly optimized for different purposes and software developers are optimizing codes on platforms with divergent strategies, flexibility should be a concern of any IT leader. Enterprises need the agility to support the demands of an enterprise software tool suite – both today and tomorrow. One-size-fits-all solutions for HPC hardware will become obsolete in an enterprise with a sizable engineering staff (or even a small, but diverse staff). Enterprise flexibility and agility is delivered by a single platform/environment that brings different clouds, hardware, software, and pricing models together in one place. Alignment of cost and demand is the promise of the cloud. With a wide variety of hardware available, enterprises should be able to make decisions about computing that align with business drivers. What does this mean in practice? This means being able to take advantage of everything from low-cost public cloud spot markets to instantly-available, cutting-edge processors. This also means not requiring lock-in to pre-paid models that expire. Select a partner that can navigate the cloud environment on behalf of the enterprise’s evolving requirements.

The solution is secure. I included security because without it, invariably a reader would find this list woefully incomplete. However, I would contend that a vast majority of mature cloud HPC solutions can satisfy the requirements of 98% of companies. More often, security concerns are a red herring to disguise a cultural or organizational bias against a “XaaS” approach.

What about data transfer and large data sets? I often get this question, for good reason. There are several mitigation strategies for this issue. However, an entire article should be dedicated to this topic – we’ll follow-up on that in a later post.

To conclude, as the market for cloud services rapidly expands and most organizations are satisfied with the answers to “why” cloud, enterprise IT is faced with a multitude of decisions on the “how” cloud. For high-performance computing, this article should give managers some critical factors to consider when evaluating solutions.

And one last thing, a “why” cloud link, as promised: 6 Advantages of Cloud Computing

This article was written by Matt McKee.