On November 11th, Norway’s Magnus Carlsen will defend his chess world championship against Sergey Karjakin of Russia. The unified championship will return to New York and American soil for the first time since 1990, when two chess legends, Kasparov and Karpov, met for the last time in a world chess championship match. Since then, chess’ popularity in the United States has slowly increased, as has the strength of its players. Just two months ago, the United States men’s team won the 42nd Chess Olympiad for the first time in 40 years. They are now led by Top 10 players Caruana, So and Nakamura.

Nevertheless, World Champion Magnus Carlsen has dominated chess for the last 5 years and is rightfully in position to defend his world championship. In preparation for important tournaments like the world championship match, Grandmasters almost always hire a team of ‘seconds’ (other grandmasters) to assist them in preparation. Their main job is to analyze moves for the opening phase of the game to maximize the charted territory, if you will, of their player. One of the most important tools they use in this analysis is the computer engine. The computer engine is a piece of software which objectively evaluates any chess positions.

Top Chess Grandmasters have a bit of a love-hate relationship with computer chess engines. Although it has become an essential and invaluable tool for training and preparation, many lament the loss of creativity due to the extensive charting of opening sequences known as the opening book. Players take less risk in openings because a well-prepared opponent will easily expose creative but unsound ideas. The capacity for top Grandmasters to memorize the thousands of variation of an opening book then becomes a limiting factor. Some grandmasters, such as Carlsen, intentionally play moves early on which are less analyzed but also less optimal to be able to “just play chess” instead of challenging the opponent’s preparation and memorization skills.

Today, grandmasters and amateurs alike use chess engines for training and analysis. There are many different chess engines (some paid, some open source) which all essentially do the same thing which is to evaluate a chess position. When playing a game against a chess engine, for most beginners to intermediate players the “strength” of the engine is not important. An engine that runs on an iPhone can easily beat most amateur players. You’d have to artificially dial down the strength of the engine to get a competitive game. We have gotten to the point where chess engines can beat any human in a game with “classical time controls” (90 minutes for first 40 moves). The best chess engines have Elo scores of 3200+ while the highest achieved rating of any human player has been just shy of 2900. It is therefore no longer interesting for humans to compete against chess engines. Instead, there are now leagues which just feature chess engines, in which they compete against each other under fixed conditions.


Jerauld, Brian. A Brief Postmortem. Digital image. ChessBase. Chessbase.com, 27 Apr. 2015. Web. 16 Nov. 2016.
2. f4 was a real cool move you played there, Garry. I think… Ok, let’s ask Stockfish

How it Works
Here is a very high level overview of how chess engines works.

There are three distinct stages within a chess game that can be handled differently by chess engines. The opening uses an opening book, a database of predefined lines of moves. Once the engine is “out of book” it will use its evaluation and tree search capabilities to find its best moves. Lastly in positions with few pieces on the board an engine can use an endgame tablebase which stores all winning moves for a given position on the board.

The most important part of a chess engine is its ability to evaluate a static position in the most efficient manner possible. It uses this evaluation in conjunction with a tree search to find the best move or moves possible in the current position. It can store evaluated positions in a hash table so it doesn’t have to recalculate a given position more than once. The deeper the engine can search the tree, theoretically, the more accurate its evaluation is of the current position and its ability to predict the best move.

What specific algorithms in engines makes a better chess engine is up for debate. In general, being able to search a tree faster is not useful if your static evaluation is inaccurate. For this reason you need to test engines against other engines or different versions of engines to make sure that incremental improvements have the desired effect. There are also competitions which pit engines against each other. TCEC is one of those competitions. These competitions accelerate the development of new evaluation techniques in chess engines.

In general, tree search is a fairly brute-force-ish way of evaluating a position even though there is a lot of complex theory behind optimizing this kind algorithm. Most chess engines today are therefore “dumb” tools. You give it a position and it evaluates it. When Grandmasters use it for preparation, there’s always a human in the loop to tell the engine what positions it needs to evaluate and to tell it when it should stop evaluating. A next step in development of chess engines is the inclusion of Artificial Intelligence (AI). As we saw in AlphaGo vs Lee Sedol, inclusion of AI in board game engines significantly increases its strength and utility. AI will allow players to use engines for a specific purposes. For example, we can study how a specific opponent reacts to certain positions, what their tendencies are and generate a strategy specifically for that opponent by feeding the engine all the games the opponent has ever played. At a high level we can learn from patterns in positions and correlate them to outcomes of games given the strength of each player. With the inclusion of these new technologies and increased strength of engines there is no doubt that the landscape of competitive chess will change.

Hardware Considerations
Today’s strongest engines do not parallelize well or not at all across multiple nodes because of the current state parallel tree search algorithms. Some attempts are been made to parallelize over multiple nodes using distributed process algorithms, but these versions are not being used extensively in the chess community. So, the approach today would be to analyze different positions on different instances of the chess engine using a human in the loop. The single-node limitation of many chess engines makes it such that large multi-core SMP machines can significantly outperform, say, a number of laptops networked together.

With the single node limitations of chess engines, clusters can still be used to do many evaluations in parallel for analyses. Clusters can also be used in developing chess engine software. Simulating many games or positions is one of the only ways to make sure that changes in engine code actually make it stronger.

How to Run a Chess Engine on Rescale
Rescale currently provides a framework for running UCI chess engines. It’s a bring-your-own engine setup. If you do not provide a chess engine, it will run Stockfish 7 by default.

Once you launch a job with the chess engine it will broadcast and listen on port 30000. You will need to set up an ssh tunnel to forward a local port to the analysis node port 30000. See the video below for a complete overview of how to run Stockfish on Rescale with the client Scid:

You can run any UCI engine. Make sure you name the engine executable “engine” and upload it as an input file. Rescale will automatically use the uploaded engine:

You can even run 2 engines against each other. If you wanted to run, say, Komodo against Stockfish. You would start two jobs on Rescale each running a different engine. Just make sure you forward a different local port to your second engine:

The key components of linking your UCI client to the engine running on Rescale are your ssh tunnel and your raw connection to the engine using either netcat (nc) on Linux/MacOS or plink.exe on Windows.

Armchair Chess QB
The transmission of live games with “live evaluation” on sites such as Chessbase or Chessbomb allows every chess enthusiast to be an armchair quarterback during tournaments such as the upcoming world championship. This year, take it a step further and do the analysis yourself using Chess Engines on Rescale.

This article was written by Mulyanto Poort.


Rescale offers various software with on-demand licensing.  These software are popular because of the pay as you go nature.  Not only do these software provide an easy way of accessing the software, but many software are priced in such a way that they encourage running on more cores and therefore providing a faster turnaround time to solution.

As far as on-demand software pricing goes, most software fall in 4 general per-hour pricing categories:

Proportional – you pay proportionally to the number of cores you use
Flat Rate – you pay a flat rate per hour, no matter how many cores you use
Capped – you pay proportionally up to a certain number of cores and pay a flat rate after that
Discounted – you pay a nominal rate for the first core and pay less for each subsequent core

The pricing model is shown in the Rescale UI (In this case it’s capped)

It is clear that a proportional rate does not benefit the customer when scaling up to many cores, however, the other three pricing models may benefit the customer by running more cores.  In reality the benefit is based both on the pricing model as well as the scalability of the job and software.  The scalability is the assumption that a given job will run faster on more cores.

What I intend to show in this blog post is that certain software pricing models provide the opportunity to run your job faster and on more cores while actually paying less.

This is not a scientific paper, however I will provide some equations to illustrate some of the concepts related to price and scaling. To make the equations easier to understand here are some definitions

hw Hardware
I Number of iterations
k Some per-unit constant
K Some constant
N Number of Cores or Processes
p Unit price
P Price
sw Software
t Time per iteration
T Total Time

The Problem
An engineer has a model and wants to know the most cost efficient way of running his model on Rescale.  Because the cost of the job is a sum of both hardware and software cost, we need to take into account the pricing models for each.  The hardware cost is charged by Rescale on a proportional basis.  The software cost is priced depending on the software vendor’s (ISV’s) pricing model.  The total hourly price would look something like:

Ptotal = Phw + Psw

The total cost of the job would be the product of the hourly price and the duration of the job.  This is the value we want to minimize.

Total Cost = (Phw + Psw) * Total Simulation Time

As explained before, we can say that the time it takes to run the job and the per hour prices are a function of the number of cores used.

Phw = f(N) = phw * N

Psw = f(N)

T = f(N)

With this information, we can now try to optimize the cost of the entire job.

Hardware Pricing
Hardware is charged by Rescale on a proportional basis.  The per hour price is based on a per core hour rate which differs between the different core types and pricing plans available on Rescale. Simply put, the hourly hardware price equation is the product of the per core hour rate and the number of cores:

Phw = phw* N

Software Pricing
On-demand software pricing is set by the ISVs.  As discussed before, there are several different pricing categories.  The equations for obtaining the hourly software price as a function of the number of cores are fairly straightforward:

Proportional Psw = k * N
Flat Rate Psw = K
Capped Psw = min(K, k * N)
Discounted Psw = K * f(N)
For Example: Psw = K * N.9

Simulation Time and Software Scalability

The time to run a simulation over N processes / cores can be approximated by the following equation:

Tsimulation = Tserial (1 / N + k1 * (N – 1)k2)

Where Tserial is the time it takes to run the simulation in serial.  The blue part describes the raw compute time while the red part describes the communication overhead.  If you are interested in the justification, read on.  Otherwise skip to “Costimizing”.

The details:
There are many ways of modeling scalability.  Amdahl’s law is often quoted as representing scalability relative to the different parts of the simulation which benefit from scaling over extra resources.  It does not, however, address how to define or quantify the benefit of extra resources.  In our case, we have a fixed sized model and we want to know how it will perform when we scale it over an increasing number of hardware cores.  So, for illustrative purposes, let’s make a few assumptions:

  1. The simulation is iterative
  2. The simulation is distributed over N processes on N cores
  3. Each process requires information from other processes to calculate the next iteration and uses MPI
  4. The number of iterations, I, to finish the simulation is independent of N
  5. The number of compute cycles required to complete an iteration is independent of N

Given these assumptions, the time it takes to complete an iteration is:

titeration = tcompute + tcommunicate

if compute and communicate are synchronous operations.  Otherwise, when considering them to be asynchronous (non-blocking) operations,

titeration = max(tcompute, tcommunicate)

Lastly the total compute time is easily calculated from the iteration time and the number of iterations.

T = titeration * I ∝ titeration

Here we assume that each iteration is more or less equal.

Compute Time
For compute time, assumption (5) gives us

tcompute = tserial / N

This is telling us that given no communication overhead, our model will have linear strong scaling.  It also tells us that the larger the model (tserial), the larger tcompute and the less detrimental effect communication overhead has on the relative solution time because of the dominance of tcompute.  The result is that larger models scale better.

Communication Time
The time to communicate is more complicated and depends, among other things, on how fast the interconnect is, how much data needs to be communicated, and the number of processes N.

Let’s simplify and break it down.

The communication time between two processes can be defined as 2 parts: the communication overhead (latency) and the data transfer time (transfer).  So, to send one message from process to another we can for now simply define:

tmessage = tlatency + ttransfer = tlatency + (transfer rate * message size)

The transfer rate is lower for small messages and increases as the messages become bigger,  peaking at the bandwidth of the interconnect.  We can assume for now that the latency is constant for a given interconnect.

To see the effect in practice, this paper (http://mvapich.cse.ohio-state.edu/static/media/publications/abstract/liuj-sc03.pdf) investigates the performance of MPI on various interconnects.  What the paper shows is that for small message sizes, the communication time is more or less constant.

tmessage ≅ k

With this knowledge we can infer that the total time to communicate depends on the number of messages being sent by each process.  The number of messages sent in turn depends on the number of processes because, presumably, each process communicates with every other process.  An equation that can then be used to model the number of messages sent is:

Messages = k1 * (N – 1)k2 ∝ tmessage

Furthermore, we can normalize k1 by the number of iterations and therefore tserial

tcommunicate = tserial * k1 * (N – 1)k2 ∝ tmessage

Putting it Together
The total time for the simulation is

Tsimulation ∝ titeration = tcompute + tcommunicate

when compute and communication are synchronous (blocking) operations. By substitution,

Tsimulation = Tserial (1 / N + k1 * (N – 1)k2)

Obtaining the Constants
The constants can be obtained through fitting the equation to empirical benchmark data.  The communication overhead model, which is presented here, is by no means an end-all.  It has been shown from our internal benchmarks that this equation fits well with almost all scaling numbers.

Furthermore, we have found that usually k2 ≅ 0.5.  We have also found that k1 is a function of model size, interconnect type, and processor performance.

Let’s go back to our total cost equation:

Total Cost = (Phw + Psw) * Tsimulation

The total cost now depends on the software pricing model used.

Total Cost = f(N) = (phw * N +Psw) * Tserial (1 / N + k1 * (N – 1)k2)

An Excel calculator can be found here [link]. Here is an example output:

How We Can Help
Rescale’s support team can help you estimate the right number of cores to run your simulation on.  It is not always feasible to run benchmarks for every single use case, and from the example above, being a few cores off doesn’t have a huge impact on the cost you save.

What we want you to be aware of, is that with a well scaling software and a flat rate or capped pricing model it is often more cost efficient to run on more cores.

This article was written by Mulyanto Poort.