Computability of Fault-Tolerant Distributed Protocols

4 Computability of Fault-Tolerant Distributed Protocols

4.1 Resilience

The strand of work here is concerned with the robust or fault-tolerant implementation of distributed programs. More precisely, the interest here is in wait-free or t-resilient (0 Ł t Ł n-1 where n is the number of processors involved, wait-free being n-1-resilient) implementations on a distributed machine composed of several units communicating through a shared memory via atomic read/write or FIFO queues etc. or through synchronous/asynchronous message passing. This means that the processes executed on the n processors must be as loosely coupled as possible so that even if t processors fail to terminate, the others will carry on their computation and find a correct partial result (as observed on the non-faulty ones).

Consider as an example a machine with two processors, P₁ and P₂, communicating by atomic reads and writes on a shared memory. Each of these processes has a local binary variable x₁ (for P₁) and x₂ (for P₂) respectively. The binary consensus problem is to design an algorithm for having P₁ and P₂ agree on one value, which, moreover, has to be a value that one of them started with. Therefore if x₁=0 and x₂=1 at the beginning, we want them to be x₁=0 and x₂=0 or x₁=1 and x₂=1 at the end. Of course this is complicated by the fact that we ask the solution to be wait-free. If one of the two processes fails (the other process being unable to know whether this process is dead or just very slow) then the non-faulty one must terminate with one of the values P₁ or P₂ started with originally.

It is unfortunately proved in [Fisher et al., 1985] that this is not possible -- in fact it was proven for the equivalent case of an asynchronous message-passing system, in the general 1-resilient case. The article used an argument from graph theory (but already quite geometric in nature). It has in fact originated the geometric point of view on this field of research as we are going to explain ⁹ .

4.2 From graph to simplicial sets

The main argument can be understood in broad terms (``similarity chain'') as a connectedness result. If we represent the local states of each processor P_i (i=1,2), at some point of the execution of a program, by a vertex (P_i,x_i) and the global states of the system by an (un-oriented) edge linking the two local states which compose them, then it is easy to see that the semantics of atomic reads and writes imposes that the reachable global states starting from one initial global state have to form a connected graph. There is no way a program on such a machine can ``break'' connectivity. This implies that the binary consensus cannot be implemented here since we must be able to reach from the global state ((P₁,0),(P₂,1)) two disconnected global states ((P₁,0),(P₂,0)) and ((P₁,1),(P₂,1)) as shown on Figure 7.

Figure 7: The binary consensus decision map from initial global states to final global states

Figure 8: The synchronous protocol complex (for 3 processes) after one round

Further work generalizes this simple geometric argument to the needed higher-dimensio-nal cases. In [Biran et al., 1988] a characterization of a class of problems solvable in asynchronous message-passing systems in the presence of a single failure was given. No generalization to more failures has since been solved using the same kinds of graph techniques. It was then a rather shared belief that one would have to use more powerful techniques in that case. The conjecture [Chaudhuri, 1990] that the k-set agreement problem (a kind of weak consensus; all is required is that the non-faulty processes eventually agree on a subset of at most k values taken from the input values) cannot be solved in certain asynchronous systems was finally proven in three different papers independently, [Borowsky and Gafni, 1993], [Saks and Zaharoglou, 1993] and [Herlihy and Shavit, 1993].

The basic idea of [Herlihy and Shavit, 1993] is to generalize such pictures as the one of Figure 7 using simplicial sets instead of graphs (which are special cases of the former). Simplicial sets are made up of vertices, edges but also triangles, simplexes in general glued altogether. A simplex of dimension n represents the global state, at some point of the execution, of n processes. Vertices are still pairs composed of the name of the process, together with its local state. Again, given an initial global state, the semantics of the operations of the distributed machine we want to study is defined by the reachable global states, at any time. For instance, Figure 8 shows the simplicial set, called protocol complex, after one round of communication on a synchronous message-passing machine, which broadcasts the local states to all processes at each step. If there is a simplicial map from it to some suitable set of global states, respecting the specification of the problem, then there exists a corresponding wait-free protocol. This enables us also to compute how many rounds of communication might be necessary to solve a given problem. Notice that the simplexes of the protocol complex are really the schedules, and this should be related to the directed homotopy approach of Section 2. This has been hinted in [Goubault, 1996a], [Goubault, 1996b], [Goubault, 1997] for simple cases only.

4.3 Some results

It was also proved in [Herlihy and Shavit, 1993] that in a shared-memory model with single reader/single writer registers providing atomic read and write operations, k-set agreement requires at least ë f/k ű + 1 rounds where f is the number of processes that can fail.

The renaming task (processes must try to agree on a smaller set of names between each other than the original set of names), proposed in [Attiya et al., 1990] was also finally solved in [Herlihy and Shavit, 1993]. There is a wait-free protocol for the renaming task in certain asynchronous systems if the output name space is sufficiently large. It was already known that there is a wait-free solution for the renaming task for 2n+1 or more output names on a system of n+1 asynchronous processors and none for n+2 or fewer output names. M. Herlihy and N. Shavit refined this result and showed that there was no solution for strictly less that 2n+1 output names.

Not only impossibility results can be given but also constructive means for finding algorithms follow from this work (for instance [Herlihy and Shavit, 1994]).

More generally, datatypes do matter for computability results. It is known for quite a long time that consensus for two processes can be solved with shared memory with atomic reads and writes plus a FIFO queue, or plus an atomic test&set operation. If we define the consensus number of a data type as the maximal number of asynchronous processors (having atomic read and write) on which it can implement wait-free consensus, then,

atomic read/write registers have consensus number 1,
test&set and fetch&add registers, queues, and stacks have consensus number 2,
n-register assignment has consensus number 2n-2,
load-locked, store-conditional and compare-and-swap registers have consensus number Ą.

These facts motivated the introduction of the following general problem, dealing with the power of the architecture of distributed machines. We say that a datatype, or object, is an (m,j)-consensus object if it allows any set of m processes to solve j-set agreement tasks. In [Herlihy and Rajsbaum, 1994], Herlihy and Rajsbaum (see also [Borowsky and Gafni, 1993]) proved that is is impossible to implement (n+1,k)-consensus using (m,j)-consensus objects if n/k > m/j.

Further work on this can be found in [Jayanti, 1993], [Jayanti, 1997] and [Schenk, 1997]. In particular, the problem identified in [Jayanti, 1997] is that the hierarchy of data objects as briefly sketched above is not robust, in the sense that it is possible to implement some datatypes with consensus number k using several datatypes with consensus numbers strictly less than k.

This of course is due to some subtle interactions between the use of these data objects and we could hope that the more general study of the schedules with such datatypes could lead to some better classifications.

In ``Algebraic Spans'' in a forthcoming issue of Mathematical Structures in Computer Science, M. Herlihy and S. Rajsbaum introduce a new tool (related to the one described in [Herlihy and Rajsbaum, 1995] for proving impossibility results, based on a core theorem of algebraic topology, the acyclic carrier theorem, which unifies, generalizes and extends earlier results.