Self-stabilization
Encyclopedia
Self-stabilization is a concept of fault-tolerance in distributed computing
. A distributed system that is self-stabilizing will end up in a correct state
no matter what state it is initialized with. That correct state is reached after a finite number of execution steps.
At first glance, the guarantee of self stabilization may seem less promising than that of the more traditional fault-tolerance of algorithms, that aims to guarantee that the system always remains in a correct state, under certain kinds of state transitions. However, that traditional fault tolerance cannot always be achieved. For example, it cannot be achieved when the system is started in an incorrect state or is corrupted by an intruder. Moreover, because of their complexity, it is very hard to debug and to analyze distributed systems. Hence, it is very hard to prevent a distributed system from reaching an incorrect state. Indeed, some forms of self stabilization are incorporated into most modern computer
and telecommunications
networks, since it gives them the ability to cope with faults that were not foreseen in the design of the algorithm.
Many years after the seminal paper of Edsger Dijkstra in 1974, this concept remains important as it presents an important foundation for self-managing computer systems
and fault-tolerant systems. As a result, Dijkstra's paper received the 2002 ACM
PODC Influential-Paper Award - one of the highest recognitions in the distributed computing community.
Moreover, after Dijkstra's death, the award was renamed and is now called the Dijkstra Award.
in 1974 presented the concept of self-stabilization, prompting further research in this area.
He also presented the first self-stabilizing algorithms that did not rely on strong assumptions on the system. Some previous protocols used in practice did actually stabilize, but only assuming the existence of a clock that was global to the system, and assuming a known upper bound on the duration of each system transition.
It is only ten years later when Leslie Lamport
pointed out the importance of Dijkstra's work that researchers
directed their attention to this elegant fault-tolerance concept.
Dijkstra's paper, which introduces the concept of self-stabilization, presents an example in the context of a "token ring" — a network of computers ordered in a circle, such that exactly one of them is supposed to "hold a token" at any given time.
The first self-stabilizing algorithm
s did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state, even without explicitly detecting error states. Since traditional methods for detecting an error were often very difficult and time-consuming, such a behavior was considered desirable.
(The method described in the paper cited above collects a huge amount of information from the whole network to one place; after that, it attempts to determine whether the collected global state is correct; even that determination alone can be a hard task).
The combination of these 4 parts is self stabilizing. Initial self stabilizing protocols were also presented in the above papers. More efficient reset protocols were presented later, e.g.
Additional efficiency was introduced with the notion of time-adaptive protocols. The idea behind these is that when only a small number of errors occurs, the recovery time can (and should) be made short. Dijkstra's original self-stabilization algorithms do not have this property.
A useful property of self-stabilizing algorithms is that they can be composed of layers if the layers do not exhibit any circular dependencies. The stabilization time of the composition is then bounded by the sum of the individual stabilization times of each layer.
of a self-stabilizing algorithm is measured in (asynchronous) rounds or cycles.
It is also interesting to measure the output stabilization time. For that, a subset of the state variables is defined to be externally visible (the output). Certain states of outputs are defined to be correct (legitimate). The set of the outputs of all the components of the system is said to have stabilized at the time that it starts to be correct, provided it stays correct indefinitely, unless additional faults occur. The output stabilization time is the time (the number of (asynchronous) rounds) until the output stabilizes.
A system is said to be randomized self-stabilizing if and only if it is self-stabilizing and the expected number of rounds needed to reach a correct state is bounded by some constant .
Design of self-stabilization in the above mentioned sense is well known to be a difficult job. In fact, a class of distributed algorithms do not have the property of local checking: the legitimacy of the network state cannot be evaluated by a single process. The most obvious case is Dijkstra's token-ring defined above: no process can detect whether the network state is legitimate or not in the case where more than one token is present in non-neighboring processes. This suggests that self-stabilization of a distributed system is a sort of group intelligence
where each component is taking local actions, based on its local knowledge but eventually this guarantees global convergence at the end.
To help overcome the difficulty of designing self-stabilization as defined above, other types of stabilization were devised. For instance, weak stabilization is the property that a distributed system has a possibility to reach its legitimate behavior from every possible state.
Weak stabilization is easier to design as it just guarantees a possibility of convergence for some runs of the distributed system rather than convergence for every run.
A self-stabilizing algorithm is silent if and only if it converges to a global state where the values of communication registers used by the algorithm remain fixed.
.
The intent here is to cope with dynamic distributed systems that undergo topological changes. In classical self-stabilization theory, arbitrary changes are viewed as errors where no guarantees are given until the system has stabilized again. With superstabilizing systems, there is a passage predicate that is always satisfied while the system's topology is reconfigured.
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...
. A distributed system that is self-stabilizing will end up in a correct state
State (computer science)
In computer science and automata theory, a state is a unique configuration of information in a program or machine. It is a concept that occasionally extends into some forms of systems programming such as lexers and parsers....
no matter what state it is initialized with. That correct state is reached after a finite number of execution steps.
At first glance, the guarantee of self stabilization may seem less promising than that of the more traditional fault-tolerance of algorithms, that aims to guarantee that the system always remains in a correct state, under certain kinds of state transitions. However, that traditional fault tolerance cannot always be achieved. For example, it cannot be achieved when the system is started in an incorrect state or is corrupted by an intruder. Moreover, because of their complexity, it is very hard to debug and to analyze distributed systems. Hence, it is very hard to prevent a distributed system from reaching an incorrect state. Indeed, some forms of self stabilization are incorporated into most modern computer
Computer network
A computer network, often simply referred to as a network, is a collection of hardware components and computers interconnected by communication channels that allow sharing of resources and information....
and telecommunications
Telecommunications network
A telecommunications network is a collection of terminals, links and nodes which connect together to enable telecommunication between users of the terminals. Networks may use circuit switching or message switching. Each terminal in the network must have a unique address so messages or connections...
networks, since it gives them the ability to cope with faults that were not foreseen in the design of the algorithm.
Many years after the seminal paper of Edsger Dijkstra in 1974, this concept remains important as it presents an important foundation for self-managing computer systems
Self-management (computer science)
Self-Management is the process by which computer systems shall manage their own operation without human intervention. Self-Management technologies are expected to pervade the next generation of network management systems....
and fault-tolerant systems. As a result, Dijkstra's paper received the 2002 ACM
Association for Computing Machinery
The Association for Computing Machinery is a learned society for computing. It was founded in 1947 as the world's first scientific and educational computing society. Its membership is more than 92,000 as of 2009...
PODC Influential-Paper Award - one of the highest recognitions in the distributed computing community.
Moreover, after Dijkstra's death, the award was renamed and is now called the Dijkstra Award.
History
E.W. DijkstraEdsger Dijkstra
Edsger Wybe Dijkstra ; ) was a Dutch computer scientist. He received the 1972 Turing Award for fundamental contributions to developing programming languages, and was the Schlumberger Centennial Chair of Computer Sciences at The University of Texas at Austin from 1984 until 2000.Shortly before his...
in 1974 presented the concept of self-stabilization, prompting further research in this area.
He also presented the first self-stabilizing algorithms that did not rely on strong assumptions on the system. Some previous protocols used in practice did actually stabilize, but only assuming the existence of a clock that was global to the system, and assuming a known upper bound on the duration of each system transition.
It is only ten years later when Leslie Lamport
Leslie Lamport
Leslie Lamport is an American computer scientist. A graduate of the Bronx High School of Science, he received a B.S. in mathematics from the Massachusetts Institute of Technology in 1960, and M.A. and Ph.D. degrees in mathematics from Brandeis University, respectively in 1963 and 1972...
pointed out the importance of Dijkstra's work that researchers
directed their attention to this elegant fault-tolerance concept.
Overview
A distributed algorithm is self-stabilizing if, starting from an arbitrary state, it is guaranteed to converge to a legitimate state and remain in a legitimate set of states thereafter. A state is legitimate if starting from this state the algorithm satisfies its specification. The property of self-stabilization enables a distributed algorithm to recover from a transient fault regardless of its nature. Moreover, a self-stabilizing algorithm does not have to be initialized as it eventually starts to behave correctly regardless of its initial state.Dijkstra's paper, which introduces the concept of self-stabilization, presents an example in the context of a "token ring" — a network of computers ordered in a circle, such that exactly one of them is supposed to "hold a token" at any given time.
- Not holding a token is a correct state for each computer in this network, since the token can be held by another computer. However, if every computer is in the state of "not holding a token" then the network altogether is not in a correct state.
- Similarly, if more than one computer "holds a token" then this is not a correct state for the network, although it cannot be observed to be incorrect by viewing any computer individually. Since every computer can "observe" only the states of its two neighbors, it is hard for the computers to decide whether the network altogether is in a correct state.
The first self-stabilizing algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
s did not detect errors explicitly in order to subsequently repair them. Instead, they constantly pushed the system towards a legitimate state, even without explicitly detecting error states. Since traditional methods for detecting an error were often very difficult and time-consuming, such a behavior was considered desirable.
(The method described in the paper cited above collects a huge amount of information from the whole network to one place; after that, it attempts to determine whether the collected global state is correct; even that determination alone can be a hard task).
Efficiency improvements
More recently, researchers have presented newer methods for light-weight error detection for self-stabilizing systems using local checking. The term local refers to a part of a computer network. When local detection is used, a computer in a network is not required to communicate with the entire network in order to detect an error — the error can be detected by having each computer communicate only with its nearest neighbors. These local detection methods simplified the task of designing self-stabilizing algorithms considerably. This is because the error detection mechanism and the recovery mechanism can be designed separately. Newer algorithms based on these detection methods also turned out to be much more efficient. Moreover, these papers suggested rather efficient general transformers to transform non self stabilizing algorithms to become self stabilizing. The idea is to,- Run the non self stabilizing protocol, at the same time,
- detect faults (during the execution of the given protocol) using the above mentioned detection methods,
- then, apply a (self stabilizing) "reset" protocol to return the system to some predetermined initial state, and, finally,
- restart the given (non- self stabilizing) protocol.
The combination of these 4 parts is self stabilizing. Initial self stabilizing protocols were also presented in the above papers. More efficient reset protocols were presented later, e.g.
Additional efficiency was introduced with the notion of time-adaptive protocols. The idea behind these is that when only a small number of errors occurs, the recovery time can (and should) be made short. Dijkstra's original self-stabilization algorithms do not have this property.
A useful property of self-stabilizing algorithms is that they can be composed of layers if the layers do not exhibit any circular dependencies. The stabilization time of the composition is then bounded by the sum of the individual stabilization times of each layer.
Time complexity
The time complexityComputational complexity theory
Computational complexity theory is a branch of the theory of computation in theoretical computer science and mathematics that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other...
of a self-stabilizing algorithm is measured in (asynchronous) rounds or cycles.
- A round is a shortest execution trace in which each processor executes at least one step.
- Similarly, a cycle is a shortest execution trace in which each processor executes at least one complete iteration of its repeatedly executed list of commands.
It is also interesting to measure the output stabilization time. For that, a subset of the state variables is defined to be externally visible (the output). Certain states of outputs are defined to be correct (legitimate). The set of the outputs of all the components of the system is said to have stabilized at the time that it starts to be correct, provided it stays correct indefinitely, unless additional faults occur. The output stabilization time is the time (the number of (asynchronous) rounds) until the output stabilizes.
Definition
A system is self-stabilizing if and only if:- Starting from any state, it is guaranteed that the system will eventually reach a correct state (convergence).
- Given that the system is in a correct state, it is guaranteed to stay in a correct state, provided that no fault happens (closure).
A system is said to be randomized self-stabilizing if and only if it is self-stabilizing and the expected number of rounds needed to reach a correct state is bounded by some constant .
Design of self-stabilization in the above mentioned sense is well known to be a difficult job. In fact, a class of distributed algorithms do not have the property of local checking: the legitimacy of the network state cannot be evaluated by a single process. The most obvious case is Dijkstra's token-ring defined above: no process can detect whether the network state is legitimate or not in the case where more than one token is present in non-neighboring processes. This suggests that self-stabilization of a distributed system is a sort of group intelligence
Group intelligence
Group intelligence refers to a process by which large numbers of people simultaneously converge upon the same point of knowledge.Social psychologists study group intelligence and related topics such as decentralized decision making and group wisdom, using demographic information to study the...
where each component is taking local actions, based on its local knowledge but eventually this guarantees global convergence at the end.
To help overcome the difficulty of designing self-stabilization as defined above, other types of stabilization were devised. For instance, weak stabilization is the property that a distributed system has a possibility to reach its legitimate behavior from every possible state.
Weak stabilization is easier to design as it just guarantees a possibility of convergence for some runs of the distributed system rather than convergence for every run.
A self-stabilizing algorithm is silent if and only if it converges to a global state where the values of communication registers used by the algorithm remain fixed.
Related work
An extension of the concept of self-stabilization is that of superstabilizationSuperstabilization
Superstabilization is a concept of fault-tolerance in distributed computing. Superstabilizing distributed algorithms combine the features of self-stabilizing algorithms and dynamic algorithms. A superstabilizing algorithm – just like any other self-stabilizing algorithm – can be started in an...
.
The intent here is to cope with dynamic distributed systems that undergo topological changes. In classical self-stabilization theory, arbitrary changes are viewed as errors where no guarantees are given until the system has stabilized again. With superstabilizing systems, there is a passage predicate that is always satisfied while the system's topology is reconfigured.