Interactive Neural Network Design on Rescale

With the recent release of Rescale Deep Learning Cloud, we will present an example here that makes use of our new interactive notebook feature to develop deep neural networks. This feature enables an iterative workflow alternating between interactive data preprocessing and analysis, and batch training of neural networks.

mark3

In this article we will start with an image classification data set (CIFAR10), try a few different neural network designs in our interactive notebook, and then launch a batch training cluster to train that network for more epochs.

Starting a Jupyter Notebook
To get started, you first need to start up a Rescale Linux Desktop with a NVIDIA K80 GPU:

desktop-start
Here we have chosen a desktop configuration with a single NVIDIA K80 GPU. While you wait for the notebook to finish booting, you can clone and save the job that holds the CIFAR10 image dataset and the notebook code you will run. Follow this link and then save the job it creates (you do not need to submit it to run, you will just use the job to stage the notebook and dataset input files): CIFAR10 TensorFlow notebook.

Once the desktop finishes booting, attach TensorFlow software and the job with the notebook code.

desktop-attach1

desktop-attach2

Once the software and job are attached, open the notebook URL and enter the password when prompted:

note
Next, navigate into the attach_jobs directory, then the directory of the job you attached, and then to .ipynb file.

note-attached-jobs
The code in this notebook was adapted from the TensorFlow CIFAR10 training example.
We have already added another inference function to the example: inference_3conv, with a 3rd convolutional layer. You can try training the 2 convolution layer network by running all the cells as-is. To run the 3 conv layer version, replace the call to inference_2conv with inference_3conv,restart the kernel (ESC-0-0), and then run all the cells again.

cifar10-inference

TensorBoard
You can also access TensorBoard, TensorFlow’s built-in GUI, on the desktop via SSH tunnel. To configure your own SSH keys follow the instructions here. Just download one of the connection scripts in the Node Access section of the Desktop panel:

desktop-connect

and take the username and IP address out of the script. Then forward port 6006 to your localhost and run TensorBoard:

ssh -L 6006:localhost:6006 @ tensorboard –logdir=/tmp/cifar10_train

You should now be able to access it from your local browser at http://localhost:6006. The particular training example we are using already set /tmp/cifar10_train as the default location for training logs. Here are the 2 network graphs as they appear in TensorBoard. Two convolutional layers:

cifar_hacking2conv

Three convolutional layers:

cifar_hacking3conv

Batch Training
If you train the 2-layer and 3-layer convolutional networks on the notebook GPU for 10-20 epochs, you will see the loss does indeed drop faster for the 3-layer network. We would now like to see whether the deeper network yields better accuracy when trained longer or if it reaches the same accuracy in less training time.

You can launch a batch training job with your updated 3-convolutional-layer code directly from the notebook. First, save your notebook (Ctrl-S), then there is a shell command shortcut which will automatically export your notebook to regular python and launch a job with all the files in the same directory as the notebook. For example:


rungpus

The syntax is as follows:


This can be run from the command line on the desktop or within the notebook with the IPython shell magic ! syntax.

Some GPU core types you can choose from when launching from the notebook:

Jade: NVIDIA Kepler K520s
Obsidian: NVIDIA Tesla K80s

Once the job starts running, you can attach it to your desktop and the job files will be accessible on the notebook as part of a shared filesystem. First, the attach:

attach-running
Then, on the desktop, in addition to opening and viewing files, you can also open a terminal:

terminal
From the terminal, you can tail files, etc.

terminal-tail
Alternatively, you can navigate to the job in the Rescale web portal and live tail files in your browser. This allows you to shut down your Rescale desktop and still monitor training progress, or enables monitoring of your batch job on a mobile device while you are away from your workstation.


tail-running

Iterative Development
Above, you have just completed a single development iteration of our CIFAR10 training example, but you do not need to stop once the batch training is done. You can stop the batch training job anytime, review training logs in more depth from your notebook, then submit new training jobs.

The advantage here is that you can develop and test your code on similar hardware, the same software configuration, and the same training data as the batch training cluster we used. This eliminates the headache of bugs due to differences in software or hardware configuration between testing you might do on your local workstation and the training cluster in the cloud.

Additionally, if you prefer to do more compute heavy workloads directly in the notebook environment, we have Rescale Desktop configurations available with up to 8 K80 GPUs (4 K80 cards), email support@rescale.com for access to those.

To try out the workflow above, sign up here and immediately start doing deep learning on Rescale today.

Edit (2016-10-31): Added link for setting user SSH keys.

This article was written by Mark Whitney.