Paxos algorithm
Encyclopedia
Paxos is a family of protocols for solving consensus
in a network of unreliable processors.
Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.
Consensus protocols are the basis for the state machine approach
to distributed computing, as suggested by Leslie Lamport
and surveyed by Fred Schneider. The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. Ad-hoc techniques may leave important cases of failures unresolved. The principled approach proposed by Lamport et al. ensures all cases are handled safely.
The Paxos protocol was described and named in 1990, after a fictional legislative consensus system used in the Ionian Paxos
islands of Greece. It was widely circulated as a technical report in that period, but was not published until 1998. The topic predates the protocol. For example, state machine replication is a common style of programming within the Virtual synchrony
execution model, which was introduced in 1985. The virtually synchronous membership agreement protocol published by Birman and Joseph in connection with the Isis system, in 1987, was very similar to the leader-based Paxos protocol shown below. In 1988, Lynch
, Dwork
and Stockmeyer
had demonstrated the solvability of consensus in a broad family of "partially synchronous" systems. Paxos has strong similarities to a protocol used for agreement in viewstamped replication, first published by Oki and Liskov
in 1988, in the context of distributed transactions. Notwithstanding this prior work, Paxos offered a particularly elegant formalism, and included one of the earliest proofs of safety for a fault-tolerant distributed consensus protocol.
The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no fault-tolerant consensus protocol can guarantee progress (a result proved in a paper by Fischer, Lynch and Paterson), Paxos guarantees safety (freedom from inconsistency), and the conditions that could prevent it from making progress are difficult to provoke.
Client: The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a distributed file server.
Acceptor: The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
Proposer: A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur.
Learner: Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.
Leader: Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved on that case.
Quorums are defined as subsets of the set of Acceptors such that any two subsets (that is, any two Quorums) share at least one member. Typically, a Quorum is any majority of participating Acceptors. For example, given the set of Acceptors {A,B,C,D}, a majority Quorum would be any three Acceptors: {A,B,C}, {A,C,D}, {A,B,D}, {B,C,D}. More generally, arbitrary positive weights can be assigned to Acceptors and a Quorum defined as any subset of Acceptors with the summary weight greater than half of the total weight of all Acceptors.
Non-triviality:Only proposed values can be learned.
Consistency:At most one value can be learned (i.e., two different learners cannot learn different values).
Liveness(C;L):If value C has been proposed, then eventually learner L will learn some value (if sufficient processors remain non-faulty).
By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its safety properties.
A typical implementation's message flow is covered in the section Typical Multi-Paxos deployment.
The next failure case is when a Proposer fails after proposing a value, but before agreement is reached. Ignoring Leader election, an example message flow is as follows:
The most complex case is when multiple Proposers believe themselves to be Leaders. For instance the current leader may fail and later recover, but the other Proposers have already re-elected a new leader. The recovered leader has not learned this yet and attempts to begin a round in conflict with the current leader.
If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.
To achieve this, the instance number I is included along with each value. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.
This reduction in processor requirements comes at the expense of liveness; if too many main processors fail in a short time, the system must halt until the auxiliary processors can reconfigure the system. During stable periods, the auxiliary processors take no part in the protocol.
Intuitively, if the leader has no value to propose, then a client could send an Accept! message to the Acceptors directly. The Acceptors would respond as in Basic Paxos, sending Accepted messages to the leader and every Learner achieving two message delays from Client to Learner.
If the leader detects a collision, it resolves the collision by sending Accept! messages for a new round which are Accepted as usual. This coordinated recovery technique requires four message delays from Client to Learner.
The final optimization occurs when the leader specifies a recovery technique in advance, allowing the Acceptors to perform the collision recovery themselves. Thus, uncoordinated collision recovery can occur in three message delays (and only two message delays if all Learners are also Acceptors).
In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operation.
This concept is further generalized into ever-growing sets of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sets of operations, ensuring that all proposed commutative operations of one set are stabilized before allowing any non-commuting operation to become stable.
Byzantine Paxos adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:
Fast Byzantine Paxos removes this extra delay, since the client sends commands directly to the Acceptors.
Note the Accepted message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends Accepted messages only to Learners):
The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors will re-broadcast the agreed value:
Consensus (computer science)
Consensus is a problem in distributed computing that encapsulates the task of group agreement in the presence of faults.In particular, any process in the group may fail at any time. Consensus is fundamental to core techniques in fault tolerance, such as state machine replication.- Problem...
in a network of unreliable processors.
Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.
Consensus protocols are the basis for the state machine approach
State machine replication
-State Machine Definition:For the subsequent discussion a State Machine will be defined as the following tuple of values :* A set of States* A set of Inputs* A set of Outputs...
to distributed computing, as suggested by Leslie Lamport
Leslie Lamport
Leslie Lamport is an American computer scientist. A graduate of the Bronx High School of Science, he received a B.S. in mathematics from the Massachusetts Institute of Technology in 1960, and M.A. and Ph.D. degrees in mathematics from Brandeis University, respectively in 1963 and 1972...
and surveyed by Fred Schneider. The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. Ad-hoc techniques may leave important cases of failures unresolved. The principled approach proposed by Lamport et al. ensures all cases are handled safely.
The Paxos protocol was described and named in 1990, after a fictional legislative consensus system used in the Ionian Paxos
Paxi
Paxi or Paxoi is the name given to the smallest group of the Ionian Islands . In Greek it is a plural form and it refers to a complex of islands, the largest of which are Paxos and Antipaxos...
islands of Greece. It was widely circulated as a technical report in that period, but was not published until 1998. The topic predates the protocol. For example, state machine replication is a common style of programming within the Virtual synchrony
Virtual synchrony
Virtual synchrony is an interprocess messaging passing technology. Virtual synchrony systems allow programs running in a network to organize themselves into process groups, and to send messages to groups...
execution model, which was introduced in 1985. The virtually synchronous membership agreement protocol published by Birman and Joseph in connection with the Isis system, in 1987, was very similar to the leader-based Paxos protocol shown below. In 1988, Lynch
Nancy Lynch
Nancy Ann Lynch is a professor at the Massachusetts Institute of Technology. She is the NEC Professor of Software Science and Engineering in the EECS department and heads the Theory of Distributed Systems research group at MIT's Computer Science and Artificial Intelligence Laboratory.She is the...
, Dwork
Cynthia Dwork
Cynthia Dwork is a distinguished scientist at Microsoft Research who works on distributed computing, cryptography, and e-mail spam prevention....
and Stockmeyer
Larry Stockmeyer
Larry Joseph Stockmeyer was a computer scientist. He was one of the pioneers in the field of computational complexity theory, and he also worked in the field of distributed computing...
had demonstrated the solvability of consensus in a broad family of "partially synchronous" systems. Paxos has strong similarities to a protocol used for agreement in viewstamped replication, first published by Oki and Liskov
Barbara Liskov
Barbara Liskov is a computer scientist. She is currently the Ford Professor of Engineering in the MIT School of Engineering's Electrical Engineering and Computer Science department and an Institute Professor at the Massachusetts Institute of Technology.-Life and career:She earned her BA in...
in 1988, in the context of distributed transactions. Notwithstanding this prior work, Paxos offered a particularly elegant formalism, and included one of the earliest proofs of safety for a fault-tolerant distributed consensus protocol.
The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no fault-tolerant consensus protocol can guarantee progress (a result proved in a paper by Fischer, Lynch and Paterson), Paxos guarantees safety (freedom from inconsistency), and the conditions that could prevent it from making progress are difficult to provoke.
Preliminaries
In order to simplify the presentation of Paxos, the following assumptions and definitions are made explicit. Techniques to broaden the applicability are known in the literature, and are not covered in this article; please see references for further reading.Processors
- Processors operate at arbitrary speed.
- Processors may experience failures.
- Processors with stable storage may re-join the protocol after failures (following a crash-recovery failure model).
- Processors do not collude, lie, or otherwise attempt to divert the protocol. (That is, Byzantine failures don't occur. See Byzantine Paxos for a solution which tolerates failures that arise from arbitrary/malicious behavior of the processes.)
Network
- Processors can send messages to any other processor.
- Messages are sent asynchronously and may take arbitrarily long to deliver.
- Messages may be lost, reordered, or duplicated.
- Messages are delivered without corruption. (That is, Byzantine failures don't occur. See Byzantine Paxos for a solution which tolerates corrupted messages that arise from arbitrary/malicious behavior of the messaging chanels.)
Number of processors
In general, a consensus algorithm can make progress using 2F+1 processors despite the simultaneous failure of any F processors. However, using reconfiguration, a protocol may be employed which survives more total failures if they do not occur too rapidly:- "For example, it takes seven servers to tolerate three [ed: simultaneous] failures. In many systems, the best way to achieve the desired degree of fault tolerance is to reconfigure the system to replace failed servers by spares. With reconfiguration, a system that uses three active servers and two spares can tolerate a total of three failures, if a failed server can be replaced by a spare before another failure occurs. Reconfiguration therefore allows fewer processors to tolerate the same total number of failures, though not the same number of simultaneous failures. (In most systems, simultaneous failures are much less likely than successive ones.)"
Roles
Paxos describes the actions of the processes by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.Client: The Client issues a request to the distributed system, and waits for a response. For instance, a write request on a file in a distributed file server.
Acceptor: The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
Proposer: A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur.
Learner: Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.
Leader: Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the safety properties are still preserved on that case.
Quorums
Quorums express the safety properties of Paxos by ensuring at least some surviving processor retains knowledge of the results.Quorums are defined as subsets of the set of Acceptors such that any two subsets (that is, any two Quorums) share at least one member. Typically, a Quorum is any majority of participating Acceptors. For example, given the set of Acceptors {A,B,C,D}, a majority Quorum would be any three Acceptors: {A,B,C}, {A,C,D}, {A,B,D}, {B,C,D}. More generally, arbitrary positive weights can be assigned to Acceptors and a Quorum defined as any subset of Acceptors with the summary weight greater than half of the total weight of all Acceptors.
Proposal Number & Agreed Value
Each attempt to define an agreed value v is performed with proposals which may or may not be accepted by Acceptors. Each proposal is uniquely numbered for a given Proposer. The value corresponding to a numbered proposal can be computed as part of running the Paxos protocol, but does not have to.Safety and liveness properties
In order to guarantee safety, Paxos defines three safety properties and ensures they are always held, regardless of the pattern of failures:Non-triviality:Only proposed values can be learned.
Consistency:At most one value can be learned (i.e., two different learners cannot learn different values).
Liveness(C;L):If value C has been proposed, then eventually learner L will learn some value (if sufficient processors remain non-faulty).
Typical deployment
In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor and Learner. This reduces the message complexity significantly, without sacrificing correctness:By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its safety properties.
A typical implementation's message flow is covered in the section Typical Multi-Paxos deployment.
Basic Paxos
This protocol is the most basic of the Paxos family. Each instance of the Basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has two phases. A Proposer should not initiate Paxos if it cannot communicate with at least a Quorum of Acceptors:Phase 1a: Prepare
- A Proposer (the leader) creates a proposal identified with a number N. This number must be greater than any previous proposal number used by this Proposer. Then, it sends a Prepare message containing this proposal to a Quorum of Acceptors.
Phase 1b: Promise
- If the proposal's number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number lesser than N. If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the Proposer.
- Otherwise, the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial (NackNAK (protocol message)The NAK protocol message is sent in many communications protocols to negatively acknowledge or reject a previously received message, or to indicate some kind of error....
) response would tell the Proposer that it can stop its attempt to create consensus with proposal N.
Phase 2a: Accept Request
- If a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal. If any Acceptors had previously accepted any proposals, then they'll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors. If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose any value for its proposal.
- The Proposer sends an Accept Request message to a Quorum of Acceptors with the chosen value for its proposal.
Phase 2b: Accepted
- If an Acceptor receives an Accept Request message for a proposal N, it must accept it if and only ifIf and only ifIn logic and related fields such as mathematics and philosophy, if and only if is a biconditional logical connective between statements....
it has not already promised to only consider proposals having an identifier greater than N. In this case, it should register the corresponding value v and send an Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request.
- Rounds fail when multiple Proposers send conflicting Prepare messages, or when the Proposer does not receive a Quorum of responses (Promise or Accepted). In these cases, another round must be started with a higher proposal number.
- Notice that when Acceptors accept a request, they also acknowledge the leadership of the Proposer. Hence, Paxos can be used to select a leader in a cluster of nodes.
- Here is a graphic representation of the Basic Paxos protocol. Note that the values returned in the Promise message are null the first time a proposal is made, since no Acceptor has accepted a value before in this round.
Message flow: Basic Paxos
(first round is successful)
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(1)
| |<---------X--X--X | | Promise(1,{null,null,null})
| X--------->|->|->| | | Accept!(1,V)
| |<---------X--X--X------>|->| Accepted(1,V)
|<---------------------------------X--X Response
| | | | | | |
Error cases in basic Paxos
The simplest error cases are the failure of a redundant Learner, or failure of an Acceptor when a Quorum of Acceptors remains live. In these cases, the protocol requires no recovery. No additional rounds or messages are required, as shown below:Message flow: Basic Paxos, failure of Acceptor
(Quorum size = 2 Acceptors)
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(1)
| | | | ! | | !! FAIL !!
| |<---------X--X | | Promise(1,{null, null})
| X--------->|->| | | Accept!(1,V)
| |<---------X--X--------->|->| Accepted(1,V)
|<---------------------------------X--X Response
| | | | | |
Message flow: Basic Paxos, failure of redundant Learner
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(1)
| |<---------X--X--X | | Promise(1,{null,null,null})
| X--------->|->|->| | | Accept!(1,V)
| |<---------X--X--X------>|->| Accepted(1,V)
| | | | | | ! !! FAIL !!
|<---------------------------------X Response
| | | | | |
The next failure case is when a Proposer fails after proposing a value, but before agreement is reached. Ignoring Leader election, an example message flow is as follows:
Message flow: Basic Paxos, failure of Proposer
(re-election not shown, one instance, two rounds)
Client Leader Acceptor Learner
| | | | | | |
X----->| | | | | | Request
| X------------>|->|->| | | Prepare(1)
| |<------------X--X--X | | Promise(1,{null, null, null})
| | | | | | |
| | | | | | | !! Leader fails during broadcast !!
| X------------>| | | | | Accept!(1,Va)
| ! | | | | |
| | | | | | | !! NEW LEADER !!
| X--------->|->|->| | | Prepare(2)
| |<---------X--X--X | | Promise(2,{(1, V), null, null})
| X--------->|->|->| | | Accept!(2,V)
| |<---------X--X--X------>|->| Accepted(2,V)
|<---------------------------------X--X Response
| | | | | | |
The most complex case is when multiple Proposers believe themselves to be Leaders. For instance the current leader may fail and later recover, but the other Proposers have already re-elected a new leader. The recovered leader has not learned this yet and attempts to begin a round in conflict with the current leader.
Message flow: Basic Paxos, dueling Proposers
(one instance, four unsuccessful rounds)
Client Proposer Acceptor Learner
| | | | | | |
X----->| | | | | | Request
| X------------>|->|->| | | Prepare(1)
| |<------------X--X--X | | Promise(1,{null,null,null})
| ! | | | | | !! LEADER FAILS
| | | | | | | !! NEW LEADER (knows last number was 1)
| X--------->|->|->| | | Prepare(2)
| |<---------X--X--X | | Promise(2,{null,null,null})
| | | | | | | | !! OLD LEADER recovers
| | | | | | | | !! OLD LEADER tries 2, denied
| X------------>|->|->| | | Prepare(2)
| |<------------X--X--X | | Nack(2)
| | | | | | | | !! OLD LEADER tries 3
| X------------>|->|->| | | Prepare(3)
| |<------------X--X--X | | Promise(3,{null,null,null})
| | | | | | | | !! NEW LEADER proposes, denied
| | X--------->|->|->| | | Accept!(2,Va)
| | |<---------X--X--X | | Nack(3)
| | | | | | | | !! NEW LEADER tries 4
| | X--------->|->|->| | | Prepare(4)
| | |<---------X--X--X | | Promise(4,{null,null,null})
| | | | | | | | !! OLD LEADER proposes, denied
| X------------>|->|->| | | Accept!(3,Vb)
| |<------------X--X--X | | Nack(4)
| | | | | | | | ... and so on ...
Multi-Paxos
A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of the Basic Paxos protocol, a significant amount of overhead would result.If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.
To achieve this, the instance number I is included along with each value. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.
Message flow: Multi-Paxos, start
(first instance with new leader)
Client Proposer Acceptor Learner
| | | | | | | --- First Request ---
X-------->| | | | | | Request
| X--------->|->|->| | | Prepare(N)
| |<---------X--X--X | | Promise(N,I,{Va,Vb,Vc})
| X--------->|->|->| | | Accept!(N,I,Vn)
| |<---------X--X--X------>|->| Accepted(N,I,Vn)
|<---------------------------------X--X Response
| | | | | | |
Message flow: Multi-Paxos, steady-state
(subsequent instances with same leader)
Client Proposer Acceptor Learner
| | | | | | | --- Following Requests ---
X-------->| | | | | | Request
| X--------->|->|->| | | Accept!(N+1,I,W)
| |<---------X--X--X------>|->| Accepted(N+1,I,W)
|<---------------------------------X--X Response
| | | | | | |
Typical Multi-Paxos deployment
The most common deployment of the Paxos family is Multi-Paxos, specialized for participating processors to each be Proposers, Acceptors and Learners. The message flow may be optimized as depicted here:Message flow: Collapsed Multi-Paxos, start
(first instance with new leader)
Client Servers
| | | | --- First Request ---
X-------->| | | Request
| X->|->| Prepare(N)
| |<-X--X Promise(N,I,{Va,Vb,Vc})
| X->|->| Accept!(N,I,Vn)
| |<-X--X Accepted(N,I)
|<--------X | | Response
| | | |
Message flow: Collapsed Multi-Paxos, steady state
(subsequent instances with same leader)
Client Servers
X-------->| | | Request
| X->|->| Accept!(N+1,I,W)
| |<-X--X Accepted(N+1,I)
|<--------X | | Response
| | | |
Optimizations
A number of optimizations reduce message complexity and size. These optimizations are summarized below:-
- "We can save messages at the cost of an extra message delay by having a single distinguished learner that informs the other learners when it finds out that a value has been chosen. Acceptors then send Accepted messages only to the distinguished learner. In most applications, the roles of leader and distinguished learner are performed by the same processor.
-
- "A leader can send its Prepare and Accept! messages just to a quorum of acceptors. As long as all acceptors in that quorum are working and can communicate with the leader and the learners, there is no need for acceptors not in the quorum to do anything.
-
- "Acceptors do not care what value is chosen. They simply respond to Prepare and Accept! messages to ensure that, despite failures, only a single value can be chosen. However, if an acceptor does learn what value has been chosen, it can store the value in stable storage and erase any other information it has saved there. If the acceptor later receives a Prepare or Accept! message, instead of performing its Phase1b or Phase2b action, it can simply inform the leader of the chosen value.
-
- "Instead of sending the value v, the leader can send a hash of v to some acceptors in its Accept! messages. A learner will learn that v is chosen if it receives Accepted messages for either v or its hash from a quorum of acceptors, and at least one of those messages contains v rather than its hash. However, a leader could receive Promise messages that tell it the hash of a value v that it must use in its Phase2a action without telling it the actual value of v. If that happens, the leader cannot execute its Phase2a action until it communicates with some process that knows v."
-
- "A proposer can send its proposal only to the leader rather than to all coordinators. However, this requires that the result of the leader-selection algorithm be broadcast to the proposers, which might be expensive. So, it might be better to let the proposer send its proposal to all coordinators. (In that case, only the coordinators themselves need to know who the leader is.)
-
- "Instead of each acceptor sending Accepted messages to each learner, acceptors can send their Accepted messages to the leader and the leader can inform the learners when a value has been chosen. However, this adds an extra message delay.
-
- "Finally, observe that phase 1 is unnecessary for round 1 .. The leader of round 1 can begin the round by sending an Accept! message with any proposed value."
Cheap Paxos
Cheap Paxos extends Basic Paxos to tolerate F failures with F+1 main processors and F auxiliary processors by dynamically reconfiguring after each failure.This reduction in processor requirements comes at the expense of liveness; if too many main processors fail in a short time, the system must halt until the auxiliary processors can reconfigure the system. During stable periods, the auxiliary processors take no part in the protocol.
-
- "With only two processors p and q, one processor cannot distinguish failure of the other processor from failure of the communication medium. A third processor is needed. However, that third processor does not have to participate in choosing the sequence of commands. It must take action only in case p or q fails, after which it does nothing while either p or q continues to operate the system by itself. The third processor can therefore be a small/slow/cheap one, or a processor primarily devoted to other tasks."
Message flow: Cheap Multi-Paxos
3 main Acceptors, 1 Auxiliary Acceptor, Quorum size = 3, showing failure of one main processor and subsequent reconfiguration
{ Acceptors }
Proposer Main Aux Learner
| | | | | | -- Phase 2 --
X----------->|->|->| | | Accept!(N,I,V)
| | | ! | | --- FAIL! ---
|<-----------X--X--------------->| Accepted(N,I,V)
| | | | | -- Failure detected (only 2 accepted) --
X----------->|->|------->| | Accept!(N,I,V) (re-transmit, include Aux)
|<-----------X--X--------X------>| Accepted(N,I,V)
| | | | | -- Reconfigure : Quorum = 2 --
X----------->|->| | | Accept!(N,I+1,W) (Aux not participating)
|<-----------X--X--------------->| Accepted(N,I+1,W)
| | | | |
Fast Paxos
Fast Paxos generalizes Basic Paxos to reduce end-to-end message delays. In Basic Paxos, the message delay from client request to learning is 3 message delays. Fast Paxos allows 2 message delays, but requires the Client to send its request to multiple destinations.Intuitively, if the leader has no value to propose, then a client could send an Accept! message to the Acceptors directly. The Acceptors would respond as in Basic Paxos, sending Accepted messages to the leader and every Learner achieving two message delays from Client to Learner.
If the leader detects a collision, it resolves the collision by sending Accept! messages for a new round which are Accepted as usual. This coordinated recovery technique requires four message delays from Client to Learner.
The final optimization occurs when the leader specifies a recovery technique in advance, allowing the Acceptors to perform the collision recovery themselves. Thus, uncoordinated collision recovery can occur in three message delays (and only two message delays if all Learners are also Acceptors).
Message flow: Fast Paxos, non-conflicting
Client Leader Acceptor Learner
| | | | | | | |
| X--------->|->|->|->| | | Any(N,I,Recovery)
| | | | | | | |
X------------------->|->|->|->| | | Accept!(N,I,W)
| |<---------X--X--X--X------>|->| Accepted(N,I,W)
|<------------------------------------X--X Response(W)
| | | | | | | |
Message flow: Fast Paxos, conflicting proposals
Conflicting proposals with uncoordinated recovery. Note: the protocol does not specify how to handle the dropped client request.
Client Leader Acceptor Learner
| | | | | | | | |
| | X------->|->|->|->| | | Any(N,I,Recovery)
| | | | | | | | |
| | | | | | | | | !! Concurrent conflicting proposals
| | | | | | | | | !! received in different order
| | | | | | | | | !! by the Acceptors
| X--------------?|-?|-?|-?| | | Accept!(N,I,V)
X-----------------?|-?|-?|-?| | | Accept!(N,I,W)
| | | | | | | | |
| | | | | | | | | !! Acceptors disagree on value
| | |<-------X--X->|->|----->|->| Accepted(N,I,V)
| | |<-------|<-|<-X--X----->|->| Accepted(N,I,W)
| | | | | | | | |
| | | | | | | | | !! Detect collision & recover
| | |<-------X--X--X--X----->|->| Accepted(N+1,I,W)
|<---------------------------------X--X Response(W)
| | | | | | | | |
Message flow: Fast Paxos, collapsed roles
(merged Acceptor/Learner roles)
Client Servers
| | | | | |
| | X->|->|->| Any(N,I,Recovery)
| | | | | |
| | | | | | !! Concurrent conflicting proposals
| | | | | | !! received in different order
| | | | | | !! by the Servers
| X--------?|-?|-?|-?| Accept!(N,I,V)
X-----------?|-?|-?|-?| Accept!(N,I,W)
| | | | | |
| | | | | | !! Servers disagree on value
| | X--X->|->| Accepted(N,I,V)
| | |<-|<-X--X Accepted(N,I,W)
| | | | | |
| | | | | | !! Detect collision & recover
|<-----------X--X--X--X Response(W)
| | | | | |
Generalized Paxos
Generalized consensus explores the relationship between the operations of a distributed state machine and the consensus protocol used to maintain consistency of that state machine. The main discovery involves optimizations of the consensus protocol when conflicting proposals could be applied to the state machine in any order. i.e.: The operations proposed by the conflicting proposals are commutative operations of the state machine.In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operation.
This concept is further generalized into ever-growing sets of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sets of operations, ensuring that all proposed commutative operations of one set are stabilized before allowing any non-commuting operation to become stable.
Example
- In order to illustrate Generalized Paxos, this example shows a message flow between two concurrently executing clients and a distributed state machine performing the operations of a read/write register with 2 independent register addresses (A and B).
- Commutativity Table; marked cells denote interference:
Read(A) Write(A) Read(B) Write(B)
Read(A) | | X | | |
Write(A)| X | X | | |
Read(B) | | | | X |
Write(B)| | | X | X |
- Proposed Series of operations (global order):
1:Read(A)
2:Read(B)
3:Write(B)
4:Read(B)
5:Read(A)
6:Write(A)
7:Read(A)
- One possible permutation allowed by commutes:
{ 1:Read(A), 2:Read(B), 5:Read(A) }
{ 3:Write(B), 6:Write(A) }
{ 4:Read(B), 7:Read(A) }
- Observations:
-
5:Read(A)
may commute in front of3:Write(B)/4:Read(B)
pair. -
4:Read(B)
may commute behind the3:Write(B)/6:Write(A)
pair. - In practice, a commute occurs only when operations are proposed concurrently.
-
Message flow: Generalized Paxos (example)
Responses not shown. Note: message abbreviations differ from previous message flows due to specifics of the protocol, see for a full discussion.
{ Acceptors }
Client Leader Acceptor Learner
| | | | | | | | !! New Leader Begins Round
| | X----->|->|->| | | Prepare(N)
| | |<-----X--X--X | | Promise(N,null)
| | X----->|->|->| | | Phase2Start(N,null)
| | | | | | | |
| | | | | | | | !! Concurrent commuting proposals
| X--------?|-----?|-?|-?| | | Propose(ReadA)
X-----------?|-----?|-?|-?| | | Propose(ReadB)
| | X------X-------------->|->| Accepted(N,)
| | |<--------X--X-------->|->| Accepted(N,)
| | | | | | | |
| | | | | | | | !! No Conflict, both accepted
| | | | | | | | Stable =
| | | | | | | |
| | | | | | | | !! Concurrent conflicting proposals
X-----------?|-----?|-?|-?| | | Propose()
| X--------?|-----?|-?|-?| | | Propose(ReadB)
| | | | | | | |
| | X------X-------------->|->| Accepted(N,. )
| | |<--------X--X-------->|->| Accepted(N,. )
| | | | | | | |
| | | | | | | | !! Conflict detected, leader chooses
| | | | | | | | commutative order:
| | | | | | | | V =
| | | | | | | |
| | X----->|->|->| | | Phase2Start(N+1,V)
| | |<-----X--X--X-------->|->| Accepted(N+1,V)
| | | | | | | | Stable =.
| | | | | | | |
| | | | | | | |
| | | | | | | | !! More conflicting proposals
X-----------?|-----?|-?|-?| | | Propose(WriteA)
| X--------?|-----?|-?|-?| | | Propose(ReadA)
| | | | | | | |
| | X------X-------------->|->| Accepted(N+2,. )
| | |<--------X--X-------->|->| Accepted(N+2,. )
| | | | | | | |
| | | | | | | | !! Leader chooses order W
| | X----->|->|->| | | Phase2Start(N+2,W)
| | |<-----X--X--X-------->|->| Accepted(N+2,W)
| | | | | | | | Stable =.
| | | | | | | |.
| | | | | | | |
| | | | | | | |
Generalized Paxos vs. Fast Multi-Paxos
- The message flow above shows Generalized Paxos performing agreement on seven values in (nominally) 10 message delays. Fast Multi-Paxos would require 15-17 delays for the same sequence (3 delays for each of the three concurrent proposals with uncoordinated recovery, plus at least 2 delays for the eventual re-submission of the three rejected proposals, concurrent re-proposals may add two additional delays).
Byzantine Paxos
Paxos may also be extended to support arbitrary failures of the participants, including lying, fabrication of messages, collusion with other participants, selective non-participation, etc. These types of failures are called Byzantine failures, after the solution popularized by Lamport.Byzantine Paxos adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:
Message flow: Byzantine Multi-Paxos, steady state
Client Proposer Acceptor Learner
| | | | | | |
X-------->| | | | | | Request
| X--------->|->|->| | | Accept!(N,I,V)
| | X<>X<>X | | Verify(N,I,V) - BROADCAST
| |<---------X--X--X------>|->| Accepted(N,V)
|<---------------------------------X--X Response(V)
| | | | | | |
Fast Byzantine Paxos removes this extra delay, since the client sends commands directly to the Acceptors.
Note the Accepted message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends Accepted messages only to Learners):
Message flow: Fast Byzantine Multi-Paxos, steady state
Client Acceptor Learner
| | | | | |
X----->|->|->| | | Accept!(N,I,V)
| X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST
|<-------------------X--X Response(V)
| | | | | |
The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors will re-broadcast the agreed value:
Message flow: Fast Byzantine Multi-Paxos, failure
Client Acceptor Learner
| | | ! | | !! One Acceptor is faulty
X----->|->|->! | | Accept!(N,I,V)
| X<>X<>X------>|->| Accepted(N,I,{V,W}) - BROADCAST
| | | ! | | !! Learners receive 2 different commands
| | | ! | | !! Correct Acceptors notice error and choose
| X<>X<>X------>|->| Accepted(N,I,V) - BROADCAST
|<-------------------X--X Response(V)
| | | ! | |
Virtually Synchronous Paxos
Although the virtual synchrony model and Paxos are sometimes viewed as competing options for creating state machine replicated services and similar applications, recent work has demonstrated the possibility of combining the two models into a single unified solution. This approach uses virtual synchrony to support execution epochs, within which membership is determined by a consensus protocol and stable. Fault-tolerant protocols can then be defined on a per-epoch basis. The approach offers better performance than other options for making the membership of a Paxos service dynamic, while strengthening the virtual synchrony model.Production use of Paxos
- Google uses the Paxos algorithm in their Chubby distributed lock serviceDistributed lock managerA distributed lock manager provides distributed software applications with a means to synchronize their accesses to shared resources....
in order to keep replicas consistent in case of failure. Chubby is used by BigtableBigTableBigTable is a compressed, high performance, and proprietary database system built on Google File System , Chubby Lock Service, SSTable and a few other Google technologies; it is currently not distributed nor is it used outside of Google, although Google offers access to it as part of their Google...
which is now in production in Google Analytics and other products. Apache ZooKeeperApache ZooKeeperApache ZooKeeper is a software project of the Apache Software Foundation, providing an open source centralized configuration service and naming registry for large distributed systems. ZooKeeper is a sub project of Hadoop....
is the open source implementation. - IBM supposedly uses the Paxos algorithm in their IBM SAN Volume ControllerIBM SAN Volume ControllerIn computer data storage, the IBM SAN Volume Controller is a block storage virtualization appliance that belongs to the IBM System Storage product family. SVC implements an indirection, or "virtualization", layer in a Fibre Channel storage area network .- Architecture :SVC is always deployed as a...
product to implement a general purpose fault-tolerant virtual machine used to run the configuration and control components of the storage virtualizationStorage VirtualizationStorage virtualization or storage virtualisation is a concept and term used within computer science. Specifically, storage systems may use virtualization concepts as a tool to enable better functionality and more advanced features within the storage system.Broadly speaking, a 'storage system' is...
services offered by the cluster. This implementation features: dynamic quorum (which considers power domains and extends the Paxos protocol to an optional quorum disk to maintain fault-tolerance down to clusters as small as two nodes); concurrent ballots of batched requests broadcast and collated using an overlay binary tree network for efficiency; automatic reintegration of restarted nodes without stalling the cluster (by state delta transfer using an overlay hypercube network followed by catch-up of ballots committed during the state transfer) and an underlying view management algorithm used to select a leader and gracefully handle asymmetric network partitions. - Microsoft uses Paxos in the Autopilot cluster management service from Bing.
- WANDiscoWANdiscoWANdisco, Inc. is a United States based software company involved in the production of Subversion, a software versioning and revision control system.-History:WANdisco was incorporated in 2005...
have implemented Paxos within their DConE active-active replication technology.
External links
- Leslie Lamport's home page
- Paxos Made Simple
- Revisiting the Paxos Algorithm
- Paxos Commit
- Yale University's Wiki article
- Google Whitepaper: Chubby Distributed Lock Service
- Google Whitepaper: Bigtable A Distributed Storage System for Structured Data
- Survey of Paxos Algorithms (2007)
- Mencius - Circular rotating Paxos for geo-distributed systems
- WANdisco - Active-Active Replication solutions for Subversion, CVS & JIRA