Rescale released a new feature with its latest deploy: persistent clusters. This feature, when enabled, allows users to submit multiple jobs to the same cluster using the Rescale workflow (web UI) without needing to launch and shutdown multiple clusters. Prior to this, each job had to spin up its own cluster, which then shut down automatically after the job was completed, resulting in delays that could add up when running multiple small jobs. This new feature allows for faster iteration, which is particularly useful for testing or multiple jobs that require the same hardware configuration.

Saving time and money
Generally, it takes a few minutes for each cluster to spin up and shut down. By keeping a persistent cluster alive, you save time and money for each additional job that you submit to your cluster.

Why is that? A standard cluster shuts down automatically once the job is complete, and subsequent jobs are similarly spun up, shut down, and charged on a separate cluster. With persistent clusters, however, the cluster will instantly be available for the next job submission and you don’t waste time shutting down and spinning up another cluster between jobs. For customers that launch a multitude of similar jobs, the result is significant time and cost savings.

Persistent clusters are also useful for a testing environment: to test that new script that you set up, or to debug issues with your simulation. Normally, an error that causes the software to exit will mark the job as complete, resulting in a premature shutdown of the cluster. However, with persistent clusters, you can continue submitting jobs onto the same cluster, modifying and iterating your code as you go.

A beneficial byproduct of persistent clusters is the ability to queue jobs. By submitting multiple jobs to the same cluster, users are able to “queue” them. The Rescale backend will run the jobs in the order that they were submitted as the cluster frees up. This may be a useful workflow for some of our customers.

A few pro-tips
1. Attach all your software first: Since the attached software is installed onto the VM when the cluster is initialized, users are not able to change the software configuration of a persistent cluster once it has been spun up. If you need to run more software, we recommend that you attach all the various software you need when first launching the cluster. Because the software only checks out licenses when the program runs, you will only be charged for software runtime, and not when the cluster is idling.

2. Start your cluster with the max core count needed: For now, we recommend that you launch the persistent cluster with the maximum number of cores you will need. If you want core count to vary from job to job, you can use command line flags (refer to Software Examples/FAQs section on the Resources Page) to limit the number of cores used for a particular job. However, note that users are charged for the entire cluster, regardless of whether the cores are utilized or not. Ability for real-time expansion and shrinking of clusters is on the roadmap. Do watch for future updates on the Rescale platform!

3. Don’t forget to shut down your cluster: Lastly, don’t forget to manually terminate the persistent cluster once you are done. You will be billed for usage until the cluster shuts down, even if the cluster is idle.

This article was written by Rahul Verghese.

The Rescale platform provides end-to-end file management backed by storage offerings from the major public cloud vendors. This includes optimized client-side transfer tools as well as in-transit and at-rest encryption. In this model, Rescale controls the object store layout and encryption key management. In order to retrieve the decrypted file content, users must use Rescale tooling. While this can be convenient if you are starting from scratch and looking for a totally managed secure solution, one scenario that comes up is how to use the platform with input data that has already been uploaded to the cloud. Another use case that we see is integrating with an existing data pipeline that operates directly on simulation output files sitting in a customer-controlled storage location. For cost and performance reasons it is important to try and keep your compute as close to the storage as possible. One of the benefits of Rescale’s platform is that we support a number of different cloud providers and can bring the compute to any cloud storage accounts that you might already be using.

In this post, we will show how customers can transfer input and output files from a user-specified location instead of using the default Rescale-managed storage. For this example, we’ll focus on Amazon S3, however a similar approach can be used with any provider. In the following, we will go through the setup of a design of experiments job where the input and output files reside in a customer-controlled bucket. Let’s assume that the bucket is called “my-simulation-data”, the input files are all prefixed “input”, and all output files generated by the parameter sweep should be uploaded to a path prefixed by “output”.

This DOE will run over the HSDI and Pintle Injector examples for CONVERGE CFD found on our support page ( in parallel. Normally, the DOE framework is used to change specific numerical values within an input file but here we will use it to select completely different input.zips per run.

First, upload the converge input zips ( and to the s3://my-simulation-data/input/ directory in s3.

Next, create a file locally called inputs.csv that looks like the following:

In order to give Rescale compute nodes to access to the bucket, an IAM policy needs to be created that provides read-access to the input directory and full access to the output directory:

Note that another way to accomplish this is to setup cross-account access ( This is a preferable way to configure access if all compute nodes will run in AWS. However, the above approach will work regardless of where the compute nodes are executing.

Now, attach this policy to an IAM user and generate an access key and secret key. This access key and secret key should then be placed into a AWS config file that you save to a local file:

Save the above to a file called config.

The last file that needs to be created locally is the run script template. We will reference the s3_input and s3_output variables from the inputs.csv created above in a shell script template file that will be executed for each run. Create a file called that looks like:

There are a couple things to point out in the above script. Normally, the Rescale platform will automatically unarchive zip files however in this case we need to handle that ourselves since we are bypassing Rescale storage for our inputs. The rm -rf * at the end of the script deletes all of the output files after uploading them to the user-specified S3 location. If we omit this step, then output files will also be uploaded to Rescale storage after the script exits.

Now that the necessary files have been created locally, we can configure a new DOE job on the platform that references them. From the New Job page (, change the Job Type to DOE and configure the job as follows:

  1. Input Files: Upload config
  2. Parallel Settings: Select “Use a run definition file” and upload input.csv
  3. Templates: Upload Use as the template name
  4. Software: Select converge 2.3.X and set the command to
  5. Hardware: Onyx, 8 cores per slot, 2 task slots

Submit the job. When the job has completes, all of the output files can be found in the s3://my-simulation-data/output/hsdi/ and s3://my-simulation-data/output/pintle/ directories.

In this DOE setup, the ancillary setup data (eg: the AWS config file, csv file, and run script template) are encrypted and stored in Rescale-managed storage. The meat of the job, the input and output files, are stored in the user-specified buckets.

We do recognize that the above setup requires a little manual work to get configured. One of the things on our roadmap is to provide better integration with customer provided storage accounts. Stay tuned for details!

This article was written by Ryan Kaneshiro.