The other day I was trying to solve a classic problem in computer science and came up with an interesting solution. The problem is:

You have N nodes on a network. Exactly 1 node is the master, all others are slaves. You need a way for the nodes to elect a master. This technique must have no deadlocks or race conditions. Furthermore, there is no common shared resource that the nodes can use to synchronize their actions.

For some background, this solution has "probabilistic" versus "correct" solutions. A correct solution works all the time, every time. A probabilistic solution is likely to work but there may be some conditions under which it fails. Obviously, all else being equal, we want the correct solution. But all else is rarely equal. Often, a probabilistic solution has certain advantages such as faster performance. Whether a probabilistic solution is acceptable depends on a cost/benefit analysis: what are the benefits, what is the likilihood of failure, and are there any recovery actions in the case of failure?

Most of the time the cost/benefit analysis is not that simple. It may be that the probabilistic solution works 99% of the time and is twice as fast. But in the 1% of the time that it fails, recovery is 25 times slower than if the correct algorithm were used in the first place. In this case, it's clear that the probabilistic solution is better (0.99 * 0.5 + 0.01 * 25 = 0.745) as it runs on average in 25% less time. But not all analyses are this simple.

Now for the solutions.

First, I came up with a correct algorithm. Consider this the "brute force" approach. The drawback is that it requires a RDBMS that all the nodes can share. Here, we're using the RDBMS as a cross-machine mutex semaphore. Each node executes SQL commands inside a transaction. They all try to make a trivial update to a table with 1 row. The RDBMS lets only 1 of them in and blocks all others until the first commits or rolls back its transaction. Meanwhile, the first one reads the name of the master from the database. If the name is NULL, it writes its own name & commits the transaction; it is the master. If the name is not NULL, it reads the name and tries to connect to the server of this name. If the connection succeeds, it rolls back its transaction; it knows who the master is. If the connection fails, it writes its own name & commits the transaction; it is the master.

The drawback to this is that the RDBMS presents a single point of failure for the entire network. Instead of building a fault tolerant highly available database (which usually costs 6 figures), it would be nice to have an algorithm that doesn't require any shared common resource.

Here is the second solution:

Each node has a preassigned priority and periodically holds "elections" with all the other nodes. We ensure that there are no priority collisions (it's either a preassigned integer, or if we can't guarantee integer uniqueness across the network it can be a GUID). A node holds elections by invoking "vote" on all the other nodes in the network. When calling "vote", the node passes in its own priority and gets back a boolean. Each node responds to "vote" as follows: If it is already the master, it returns FALSE. If its own priority is higher than the caller, it returns FALSE. Otherwise, it returns TRUE.

A node collects all the responses. If there are no responses, then this node becomes the master. If any responses are FALSE, then this node is not the master. If all responses are TRUE, then this node must become the master.

Observe that elections are idempotent. Thus if the network is running fine they don't change the state. Thus we can hold elections as often as we like. The election itself becomes the system health monitor.

That's it! All that is required is that each node hold an election with all other nodes on a periodic basis.

Observe that this is a probabilistic rather than correct solution. Here is a case where the network might (temporarily) have no master: Suppose the master goes down but no other node has noticed it yet. Node A is the first node to hold an election after the master went down. Node B is another node on the network whose priority is higher than A. But node B does not yet know the master is down. Node A holds its election, but at least one node (Node B) returns FALSE because B's priority is higher than A's. This means Node A cannot be the master. But Node B isn't the master yet either, because B hasn't held its own election yet. Thus the network will have no master until B decides to hold an election.

This is why each node must hold elections periodically. Interestingly, I came up with a way to eliminate this possibility. Hint: it involves forced elections and making "vote" return a 3-state flag instead of a boolean. It's left as an exercise for the reader ;)