On November 11th, Norway’s Magnus Carlsen will defend his chess world championship against Sergey Karjakin of Russia. The unified championship will return to New York and American soil for the first time since 1990, when two chess legends, Kasparov and Karpov, met for the last time in a world chess championship match. Since then, chess’ popularity in the United States has slowly increased, as has the strength of its players. Just two months ago, the United States men’s team won the 42nd Chess Olympiad for the first time in 40 years. They are now led by Top 10 players Caruana, So and Nakamura.

Nevertheless, World Champion Magnus Carlsen has dominated chess for the last 5 years and is rightfully in position to defend his world championship. In preparation for important tournaments like the world championship match, Grandmasters almost always hire a team of ‘seconds’ (other grandmasters) to assist them in preparation. Their main job is to analyze moves for the opening phase of the game to maximize the charted territory, if you will, of their player. One of the most important tools they use in this analysis is the computer engine. The computer engine is a piece of software which objectively evaluates any chess positions.

Top Chess Grandmasters have a bit of a love-hate relationship with computer chess engines. Although it has become an essential and invaluable tool for training and preparation, many lament the loss of creativity due to the extensive charting of opening sequences known as the opening book. Players take less risk in openings because a well-prepared opponent will easily expose creative but unsound ideas. The capacity for top Grandmasters to memorize the thousands of variation of an opening book then becomes a limiting factor. Some grandmasters, such as Carlsen, intentionally play moves early on which are less analyzed but also less optimal to be able to “just play chess” instead of challenging the opponent’s preparation and memorization skills.

Today, grandmasters and amateurs alike use chess engines for training and analysis. There are many different chess engines (some paid, some open source) which all essentially do the same thing which is to evaluate a chess position. When playing a game against a chess engine, for most beginners to intermediate players the “strength” of the engine is not important. An engine that runs on an iPhone can easily beat most amateur players. You’d have to artificially dial down the strength of the engine to get a competitive game. We have gotten to the point where chess engines can beat any human in a game with “classical time controls” (90 minutes for first 40 moves). The best chess engines have Elo scores of 3200+ while the highest achieved rating of any human player has been just shy of 2900. It is therefore no longer interesting for humans to compete against chess engines. Instead, there are now leagues which just feature chess engines, in which they compete against each other under fixed conditions.

chess

Jerauld, Brian. A Brief Postmortem. Digital image. ChessBase. Chessbase.com, 27 Apr. 2015. Web. 16 Nov. 2016.
2. f4 was a real cool move you played there, Garry. I think… Ok, let’s ask Stockfish

How it Works
Here is a very high level overview of how chess engines works.

There are three distinct stages within a chess game that can be handled differently by chess engines. The opening uses an opening book, a database of predefined lines of moves. Once the engine is “out of book” it will use its evaluation and tree search capabilities to find its best moves. Lastly in positions with few pieces on the board an engine can use an endgame tablebase which stores all winning moves for a given position on the board.

The most important part of a chess engine is its ability to evaluate a static position in the most efficient manner possible. It uses this evaluation in conjunction with a tree search to find the best move or moves possible in the current position. It can store evaluated positions in a hash table so it doesn’t have to recalculate a given position more than once. The deeper the engine can search the tree, theoretically, the more accurate its evaluation is of the current position and its ability to predict the best move.

What specific algorithms in engines makes a better chess engine is up for debate. In general, being able to search a tree faster is not useful if your static evaluation is inaccurate. For this reason you need to test engines against other engines or different versions of engines to make sure that incremental improvements have the desired effect. There are also competitions which pit engines against each other. TCEC is one of those competitions. These competitions accelerate the development of new evaluation techniques in chess engines.

In general, tree search is a fairly brute-force-ish way of evaluating a position even though there is a lot of complex theory behind optimizing this kind algorithm. Most chess engines today are therefore “dumb” tools. You give it a position and it evaluates it. When Grandmasters use it for preparation, there’s always a human in the loop to tell the engine what positions it needs to evaluate and to tell it when it should stop evaluating. A next step in development of chess engines is the inclusion of Artificial Intelligence (AI). As we saw in AlphaGo vs Lee Sedol, inclusion of AI in board game engines significantly increases its strength and utility. AI will allow players to use engines for a specific purposes. For example, we can study how a specific opponent reacts to certain positions, what their tendencies are and generate a strategy specifically for that opponent by feeding the engine all the games the opponent has ever played. At a high level we can learn from patterns in positions and correlate them to outcomes of games given the strength of each player. With the inclusion of these new technologies and increased strength of engines there is no doubt that the landscape of competitive chess will change.

Hardware Considerations
Today’s strongest engines do not parallelize well or not at all across multiple nodes because of the current state parallel tree search algorithms. Some attempts are been made to parallelize over multiple nodes using distributed process algorithms, but these versions are not being used extensively in the chess community. So, the approach today would be to analyze different positions on different instances of the chess engine using a human in the loop. The single-node limitation of many chess engines makes it such that large multi-core SMP machines can significantly outperform, say, a number of laptops networked together.

With the single node limitations of chess engines, clusters can still be used to do many evaluations in parallel for analyses. Clusters can also be used in developing chess engine software. Simulating many games or positions is one of the only ways to make sure that changes in engine code actually make it stronger.

How to Run a Chess Engine on Rescale
Rescale currently provides a framework for running UCI chess engines. It’s a bring-your-own engine setup. If you do not provide a chess engine, it will run Stockfish 7 by default.

Once you launch a job with the chess engine it will broadcast and listen on port 30000. You will need to set up an ssh tunnel to forward a local port to the analysis node port 30000. See the video below for a complete overview of how to run Stockfish on Rescale with the client Scid:

You can run any UCI engine. Make sure you name the engine executable “engine” and upload it as an input file. Rescale will automatically use the uploaded engine:

You can even run 2 engines against each other. If you wanted to run, say, Komodo against Stockfish. You would start two jobs on Rescale each running a different engine. Just make sure you forward a different local port to your second engine:

The key components of linking your UCI client to the engine running on Rescale are your ssh tunnel and your raw connection to the engine using either netcat (nc) on Linux/MacOS or plink.exe on Windows.

Armchair Chess QB
The transmission of live games with “live evaluation” on sites such as Chessbase or Chessbomb allows every chess enthusiast to be an armchair quarterback during tournaments such as the upcoming world championship. This year, take it a step further and do the analysis yourself using Chess Engines on Rescale.

This article was written by Mulyanto Poort.

mulyantoblog

Rescale offers various software with on-demand licensing.  These software are popular because of the pay as you go nature.  Not only do these software provide an easy way of accessing the software, but many software are priced in such a way that they encourage running on more cores and therefore providing a faster turnaround time to solution.

As far as on-demand software pricing goes, most software fall in 4 general per-hour pricing categories:

Proportional – you pay proportionally to the number of cores you use
Flat Rate – you pay a flat rate per hour, no matter how many cores you use
Capped – you pay proportionally up to a certain number of cores and pay a flat rate after that
Discounted – you pay a nominal rate for the first core and pay less for each subsequent core

The pricing model is shown in the Rescale UI (In this case it’s capped)

It is clear that a proportional rate does not benefit the customer when scaling up to many cores, however, the other three pricing models may benefit the customer by running more cores.  In reality the benefit is based both on the pricing model as well as the scalability of the job and software.  The scalability is the assumption that a given job will run faster on more cores.

What I intend to show in this blog post is that certain software pricing models provide the opportunity to run your job faster and on more cores while actually paying less.

Nomenclature
This is not a scientific paper, however I will provide some equations to illustrate some of the concepts related to price and scaling. To make the equations easier to understand here are some definitions

hw Hardware
I Number of iterations
k Some per-unit constant
K Some constant
N Number of Cores or Processes
p Unit price
P Price
sw Software
t Time per iteration
T Total Time

The Problem
An engineer has a model and wants to know the most cost efficient way of running his model on Rescale.  Because the cost of the job is a sum of both hardware and software cost, we need to take into account the pricing models for each.  The hardware cost is charged by Rescale on a proportional basis.  The software cost is priced depending on the software vendor’s (ISV’s) pricing model.  The total hourly price would look something like:

Ptotal = Phw + Psw

The total cost of the job would be the product of the hourly price and the duration of the job.  This is the value we want to minimize.

Total Cost = (Phw + Psw) * Total Simulation Time

As explained before, we can say that the time it takes to run the job and the per hour prices are a function of the number of cores used.

Phw = f(N) = phw * N

Psw = f(N)

T = f(N)

With this information, we can now try to optimize the cost of the entire job.

Hardware Pricing
Hardware is charged by Rescale on a proportional basis.  The per hour price is based on a per core hour rate which differs between the different core types and pricing plans available on Rescale. Simply put, the hourly hardware price equation is the product of the per core hour rate and the number of cores:

Phw = phw* N

Software Pricing
On-demand software pricing is set by the ISVs.  As discussed before, there are several different pricing categories.  The equations for obtaining the hourly software price as a function of the number of cores are fairly straightforward:

Proportional Psw = k * N
Flat Rate Psw = K
Capped Psw = min(K, k * N)
Discounted Psw = K * f(N)
For Example: Psw = K * N.9

Simulation Time and Software Scalability

The time to run a simulation over N processes / cores can be approximated by the following equation:

Tsimulation = Tserial (1 / N + k1 * (N – 1)k2)

Where Tserial is the time it takes to run the simulation in serial.  The blue part describes the raw compute time while the red part describes the communication overhead.  If you are interested in the justification, read on.  Otherwise skip to “Costimizing”.

The details:
There are many ways of modeling scalability.  Amdahl’s law is often quoted as representing scalability relative to the different parts of the simulation which benefit from scaling over extra resources.  It does not, however, address how to define or quantify the benefit of extra resources.  In our case, we have a fixed sized model and we want to know how it will perform when we scale it over an increasing number of hardware cores.  So, for illustrative purposes, let’s make a few assumptions:

  1. The simulation is iterative
  2. The simulation is distributed over N processes on N cores
  3. Each process requires information from other processes to calculate the next iteration and uses MPI
  4. The number of iterations, I, to finish the simulation is independent of N
  5. The number of compute cycles required to complete an iteration is independent of N

Given these assumptions, the time it takes to complete an iteration is:

titeration = tcompute + tcommunicate

if compute and communicate are synchronous operations.  Otherwise, when considering them to be asynchronous (non-blocking) operations,

titeration = max(tcompute, tcommunicate)

Lastly the total compute time is easily calculated from the iteration time and the number of iterations.

T = titeration * I ∝ titeration

Here we assume that each iteration is more or less equal.

Compute Time
For compute time, assumption (5) gives us

tcompute = tserial / N

This is telling us that given no communication overhead, our model will have linear strong scaling.  It also tells us that the larger the model (tserial), the larger tcompute and the less detrimental effect communication overhead has on the relative solution time because of the dominance of tcompute.  The result is that larger models scale better.

Communication Time
The time to communicate is more complicated and depends, among other things, on how fast the interconnect is, how much data needs to be communicated, and the number of processes N.

Let’s simplify and break it down.

The communication time between two processes can be defined as 2 parts: the communication overhead (latency) and the data transfer time (transfer).  So, to send one message from process to another we can for now simply define:

tmessage = tlatency + ttransfer = tlatency + (transfer rate * message size)

The transfer rate is lower for small messages and increases as the messages become bigger,  peaking at the bandwidth of the interconnect.  We can assume for now that the latency is constant for a given interconnect.

To see the effect in practice, this paper (http://mvapich.cse.ohio-state.edu/static/media/publications/abstract/liuj-sc03.pdf) investigates the performance of MPI on various interconnects.  What the paper shows is that for small message sizes, the communication time is more or less constant.

tmessage ≅ k

With this knowledge we can infer that the total time to communicate depends on the number of messages being sent by each process.  The number of messages sent in turn depends on the number of processes because, presumably, each process communicates with every other process.  An equation that can then be used to model the number of messages sent is:

Messages = k1 * (N – 1)k2 ∝ tmessage

Furthermore, we can normalize k1 by the number of iterations and therefore tserial

tcommunicate = tserial * k1 * (N – 1)k2 ∝ tmessage

Putting it Together
The total time for the simulation is

Tsimulation ∝ titeration = tcompute + tcommunicate

when compute and communication are synchronous (blocking) operations. By substitution,

Tsimulation = Tserial (1 / N + k1 * (N – 1)k2)

Obtaining the Constants
The constants can be obtained through fitting the equation to empirical benchmark data.  The communication overhead model, which is presented here, is by no means an end-all.  It has been shown from our internal benchmarks that this equation fits well with almost all scaling numbers.

Furthermore, we have found that usually k2 ≅ 0.5.  We have also found that k1 is a function of model size, interconnect type, and processor performance.

Costimizing
Let’s go back to our total cost equation:

Total Cost = (Phw + Psw) * Tsimulation

The total cost now depends on the software pricing model used.

Total Cost = f(N) = (phw * N +Psw) * Tserial (1 / N + k1 * (N – 1)k2)

An Excel calculator can be found here [link]. Here is an example output:
mp2

How We Can Help
Rescale’s support team can help you estimate the right number of cores to run your simulation on.  It is not always feasible to run benchmarks for every single use case, and from the example above, being a few cores off doesn’t have a huge impact on the cost you save.

What we want you to be aware of, is that with a well scaling software and a flat rate or capped pricing model it is often more cost efficient to run on more cores.

This article was written by Mulyanto Poort.

pitfalls

Introduction

When running HPC on Rescale, or in any traditional HPC environment for that matter, it is essential that the solver can be run in a headless environment. What that means in practice is that ISVs have made sure that their solvers can be run in batch mode using simple (or not so simple) command-line instructions. This allows users of their software to submit jobs to a headless HPC environment. Traditionally, this environment is a set of servers sitting behind a scheduler. The engineer or scientist would write a script of command-line instructions to be submitted to the scheduler. The same is true on Rescale. The user enters a set of command-line instructions to run their jobs on Rescale’s platform.

Let’s take OpenFOAM, for example. An OpenFOAM user will usually write a Allrun script, and invoke it on Rescale by simply calling the script:

This is easy and applies to other solvers available on Rescale: LS-Dyna, CONVERGE, Star-CCM+, NX Nastran, XFlow and many more. All solvers on Rescale are instantiated using a simple command-line instruction.

 The Headless Environment

Being able to run a solver using a command-line instruction does not mean the solver will run in batch. For example, trying to run Star-CCM+ without explicitly specifying to run the solver in batch mode would cause the program to launch its graphical user interface (GUI) and look for a display device, causing it to immediately exit. Star-CCM+ should, therefore, be called using a command-line instruction like:

This is simple enough. There is something to be said for being to run both the batch solver and GUI using the same program. Unfortunately, this type of implementation can be incomplete.

 When ISVs decide to migrate their solver capabilities from the desktop environment to the HPC (batch) environment they usually do this because they have implemented the ability to run their solver over more than a single machine. A solver that can only run on a single machine provides less of a benefit when it’s run in an HPC environment. On initial iterations, ISVs may leave some latent artifacts of the original desktop implementation inside their batch solvers. Although these solvers can be executed from the command-line, they may still require access to a display. Since, on Rescale, we still want to be able to run these “almost-headless” solvers we make use of a tool called virtual frame buffers.

 The X Virtual Frame Buffer

The X virtual frame buffer (Xvfb) renders a virtual display in memory, so applications that are not truly headless can use it to render graphical elements. At Rescale, we use the virtual frame buffers as a last resort because there is a performance penalty to launching and running them. The use of Xvfb requires us to implement a wrapper around these solver programs. In its simplest form, this can be implemented as follows:

This seems fairly simple. We launch a virtual frame buffer on a numbered display, tell our environment to use the display associated with the virtual frame buffer, launch our solver, and clean up at the end.

A Can Of Worms

One very powerful feature on Rescale is the parameter sweep/design of experiment (DOE) functionality. We can run multiple runs of a DOE in parallel. This also means that multiple runs of the DOE can be run on the same server. Let’s imagine running the above script twice on the same node. Each instantiation of the script will now try to launch a frame buffer on the same display. This can lead to all sorts of problems. Race conditions, process corruption, and so on. Regardless of the low level issues this may cause, the biggest high-level issue is when the solver hangs due to issues with the virtual frame buffer. As it stands, a user who initiated a DOE with 100 runs may do so at the end of the day and let the job run overnight. The next morning that user may realize that one run has been hanging the entire night due to issues with Xvfb. He may still see 99 runs finishing in maybe a couple of hours, but the one hanging run has kept his cluster up for the entire night. This kind of situation is one that we want to avoid at all costs.

The implementation of a virtual frame buffer requires us to write all kinds of robustness provisions into our wrapper script. We may decide to only launch a single Xvfb on a single display and use that display for all of our solver instantiations. We can check whether Xvfb is running and, if it isn’t, skip the launching of the frame buffer step:

This has the side effect that we never know when we can shut down the frame buffer, requiring us to leave it up at all times. This may sometimes be okay depending on the requirements of the solver. If it’s not okay, we would have to increment the display number for each solver process, and clean up each frame buffer when each solver finishes. 

We can also explicitly check whether a solver is hanging. We can launch the solver in the background and interrogate its pid for the status of the program using a foreground polling loop. We can write a retry loop around the solver instantiation to restart the solver if it fails the first time. This may be the case if the frame buffer is still initializing while we are calling the solver.

The Case of the Invalid License

One of the CFD solvers we support requires a frame buffer. A customer launched a simple job without specifying a valid license address. Two days later he was wondering whether his job was still running. It turned out that the solver had hung within seconds of being instantiated and had been sitting idle for 2 days. During debugging of the issue, we decided to inspect the virtual frame buffer. We took a screenshot using:

The resulting screenshot showed a window asking the user to enter a valid license location. This was obviously an artifact in the desktop implementation of the batch solver. A true headless batch program would have just exited with a message to the user. We have since fixed this issue and are more careful when making any use of this tool–putting in robustness provisions as described before.

Virtual Frame Buffer’s Post-processing Utility

A very useful utility of Xvfb is that we can use it to render post-processing graphics in a headless environment. We can now call tools such as ParaView or LS-PrePost to generate movies and images of scenes they normally render on-screen.

Here is an example that uses Xvfb, OpenFOAM and ParaView to generate a scene image: https://www.rescale.com/resources/software/openfoam/openfoam-motorbike-post/

Here is an example which uses Xvfb, LS-Dyna and LS-PrePost to generate a movie of a crash simulation: https://www.rescale.com/resources/software/ls-dyna/ls-dyna-post-processing/

Some of our users have used this capability to their advantage by creating a visual representation of their data, forgoing the need to download the raw data sets.

Lessons Learned

Since our first use of Xvfb, we’ve learned that its use sometimes leads to adverse and unforeseen side-effects. We have since worked to make all our use of Xvfb as robust as possible to prevent the worst side effect of all, the stalled job. We also have used Xvfb with a great deal of benefit as it allows us to run solvers that do not run in a traditional headless HPC environment. It also allows us to use certain post-processing tools to render images and movies in batch mode. We encourage ISVs who only have desktop implementations of their solvers to create solvers that run in a headless HPC environment–but in doing so keeping in mind what it means to truly run a solver without a display.

This article was written by Mulyanto Poort.

visualization

The Challenge of Remote Visualization

We have received many requests to integrate some type of visualization solution on Rescale. At Rescale, we agree that this is an important part of some engineering work flows. However, performing visualization in the cloud, doing it securely, and in a user-friendly way is not an easy task.

We do not want to provide a solution where our customers have to follow a complicated script to make it all work on their end. We do want to provide a wide range of solutions for a wide range of applications. These applications include, pre-processing, job progress monitoring, optimization status tracking and post-processing. We want to provide third party pre- and post-processing solutions such as FEMAP, Tecplot, Ensight, and Paraview to our customers as well as making native GUIs of software like Star-CCM+, ANSYS, and AVL FIRE available. Bringing all these solutions to our customers are wrought with challenges such as network performance, handling large data sets, and utilizing GPUs for 3D visualization.

An Incremental Process

Because of all the challenges involved, we want to make integrating these visualization solutions an incremental process. We have decided to start off with providing a basic remote desktop solution to a visualization node which can be provisioned as part of the hardware for a Rescale job. This visualization node will have access to all the files on each machine associated to the Rescale job through a shared file system. Our user can then access this visualization node through SSH, or access the remote desktop using VNC over SSH.

Of course, we do not want to reinvent the wheel. There are already several software-based solutions out there which provide excellent remote visualization technology. For example, Paraview and many other software provide server-client visualization solutions, which perform fairly well. Our second incremental step is making these solutions available to Rescale users.

Finally, we want to improve the overall user experience for everyone. Not everyone wants to perform computationally expensive post-processing. Some only want to track the progress of their job in a GUI that is familiar to them, some may just want to reduce the size of a large data set to minimize transfer costs and some may want to make small changes to their job setup and then rerun their job. We want to provide the capability to perform all these tasks on remote visualization.

A Remote Visualization Preview

Our initial step is to provide a simple remote desktop solution using VNC over SSH. Our solution is currently not available to everyone. If you would like to test drive the remote visualization solution, please contact support@rescale.com.

The following is a short example of using this solution with RFD tNavigator on Rescale.

The first step is to set up an SSH key and is described in a previous blog post.

Having set up a key, the job should be set up as normal. The only difference is that on the hardware selection page our users will be able to select a remote visualization hardware configuration from a set of predefined configurations. If a configuration is selected, an additional node will be provisioned for visualization.

Image 1

 

Once the clusters have started, a tunneling command will be printed to the screen. This command can be copied and pasted in a shell terminal.

Presentation2

 

Once the tunnel has been established, the visualization node can be accessed at vnc://localhost:5901/. The VNC password is ‘rescale’.

Presentation3

To bring up the tNavigator GUI, we can open up a shell and type in tnav-gui, which will launch the GUI. The job files are located in /enc/mount//work. There will be one directory in /enc/mount for each compute cluster in the job.

Presentation4

Using the tNavigator GUI, we can now visualize the in-progress job.

Presentation5

Here are some more samples of remote visualization on Rescale:

Presentation6

CONVERGE

Presentation7

HEEDS

Presentation8

LS-Pre/Post

Help us improve remote visualization.

We’d love to hear about what features are important to you when using remote visualization. Please let us know at support@rescale.com.

 

This article was written by Mulyanto Poort.