In many undergraduate Computing courses, students learn about the FLP Impossibility (Fischer, Lynch, and Paterson) result. As data and computation scale, distributed systems (scaling services across multiple computers) have become a de facto standard pattern for building scalable infrastructure.

The FLP theorem answers the following fundamental question on consensus

In an asynchronous distributed system, is there a proven deterministic consensus algorithm that can satisfy agreement, validity, termination, and fault tolerance?

It is impossible to implement consensus algorithms that can tolerate even one node fault, as stated by the “Impossibility” term in Fisher, Lynch, and Paterson’s theorem.

Due to the common occurrence of failures of a node in large distributed systems such as Raft and Zookeeper, the FLP result can be counterintuitive and even frustrating.

During my study of Raft, I encountered this disparity between the theoretical FLP result and real-world consensus algorithms. By writing this article, I hope to shed some light on the subject.

  • A practical application of the FLP Theorem in real-world systems.
  • The FLP Impossibility result is overcome by industry-recognized consensus algorithms, such as Raft.

The problems associated with Distributed Consensus

So why do we need to worry about distributed consensus? We often need the machines (nodes) to maintain a form of consensus on the states. Distributions of consensus can take many forms, so distributed systems are bound to experience one of these variants of consensus. Here are some problems that have been shown to be synonymous with consensus:

  • In Raft, the leader is elected by voting for a node to be the primary node.
  • The primary node broadcasts operations to the secondary nodes in a primary/secondary replication process (e.g., primary nodes will broadcast operations to secondary nodes)

How FLP Impossibility Theorem works

Identifying the claims of the Impossibility Theorem and its context is necessitated before we can reframe the FLP Theorem as a practical result.

Asynchronous Network Model

It is necessary to have an asynchronous network model for the FLP impossibility theorem. What are the characteristics of the asynchronous model, and how do they differ from the synchronous model?

 

Sender

Sender

Unbounded Message Delay

$

Node Failure

$

Receiver

Receiver

It is impossible for the receiver to distinguish between sender cases 1 and 2

We cannot detect node failures accurately under an asynchronous network model.

As a result, if a node does not receive messages, it cannot be certain whether the sender node has crashed or that the message got delayed. In an asynchronous model, the message propagation delay between nodes is bounded but finite. This means the node cannot tell exactly whether the sender node has crashed or if the message just got delayed.  The asynchronous network model, however, contains message delays bounded by timeout mechanisms, so that node crashes can be reliably detected.

Furthermore, FLP theorem asserts i) that message channels do not drop messages and ii) that malicious parties are not involved.

A triangle of impossibility

According to the asynchronous network model, it is impossible to have all three properties simultaneously for example Agreement, Fault Tolerance & Termination.

According to the FLP theorem, we cannot attain all three properties of Termination, Agreement, and Fault Tolerance in an asynchronous network. We outline these three properties as follows

  • Nodes eventually terminate (liveness), if they haven’t failed.
  • If all nodes have the same initial input, then that value should be the only one possible for the decision. Agreement (Safety): Every node (even those that failed after deciding) should decide on the same value.
  • Fault Tolerance: If a node fails, then the protocol must also work.

Note that the FLP Theorem does NOT claim that:

  • It is generally impossible to achieve distributed consensus.
  • It is not possible to have all three properties. In fact, Raft has demonstrated that we can have Fault Tolerance and Agreement, but not Termination.

Instead, FLP Theorem claims that:

  • According to the asynchronous network model, it is not possible to have all three properties at once.
  • A synchronized network model (where message delays are limited) has an algorithm that can satisfy all three properties.

To come up with a consensus algorithm that meets business needs, we can make sacrifices (one of the three properties) or relax constraints (e.g. on the asynchronous model).

 

Taking a closer look at constraints in real-world systems

After understanding what FLP Theorem is about, we can finally explore how we can apply this knowledge to real-world systems. As we saw in the previous section, the restrictions of the asynchronous model are often relaxed. A comparison is provided below between constraints imposed on asynchronous models and those imposed on real-world distributed systems.

Asynchronous Model Real-world System
  • Reliable message channels, unbound delays
  • No clocks.
  • Crash failures can not be detected reliably.

 

  • Resend message until receive ACK, message timeout
  • Wall clocks synchronized between nodes within a time limit
  • A predetermined timeout is usually used to detect failures

It is interesting to note that in the real world, failures can usually be detected via timeout mechanisms, rather than an asynchronous model. Using this partially synchronous network model, a pre-determined timeout provides a user-definable limit on message delays.

Moreover, consensus algorithms choose to sacrifice one of the three impossibility triangle properties often. In a distributed system, node failures are almost inevitable. Thus, any consensus protocol must take fault tolerance into consideration, as it is impossible for any protocol to have both fault tolerance AND termination property.

In Raft, for example, termination is sacrificed in order to achieve fault tolerance and agreement properties (in theory, split votes for Raft can be repeated infinitely many times without ever terminating). In fact, many consensus algorithms make use of random algorithms in order to reduce the chances that the protocol will not terminate or make progress.

The FLP Theorem can be relaxed in practice and we can sacrifice either agreement or termination in order to achieve distributed consensus that meets our business goals.