Redundancy (engineering)
Encyclopedia
In engineering
, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system
, usually in the case of a backup or fail-safe
.
In many safety-critical systems, such as fly-by-wire
and hydraulic systems in aircraft
, some parts of the control system may be triplicated, which is formally termed triple modular redundancy
(TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are expected to fail independently, the probability of all three failing is calculated to be extremely small; often outweighed by other risk factors, e.g., human error. Redundancy may also be known by the terms "majority voting systems" or "voting logic".
A modified form of software redundancy, applied to hardware may be:
Passive redundancy uses excess capacity to reduce the impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety.
Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but depth perception is impaired. Hearing loss in one ear does not cause deafness but directionality is impaired. Performance decline is commonly associated with passive redundancy when a limited number of failures occur.
Active redundancy eliminates performance decline by monitoring performance of individual device, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures components. Error detection and correction and the Global Positioning System (GPS) are two examples of active redundancy.
Electrical power distribution provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line include monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect a power line when monitors detect an overload. Power is redistributed across the remaining lines.
Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust the production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events like earthquake.
The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including the activity message, when the primary detects a fault. The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both to have all outputs active at the same time, can cause both to have all outputs inactive at the same time, or outputs can flutter on and off.
A more reliable form of voting logic involves an odd number of 3 devices or more. All perform identical functions and the outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with avionics systems, such as those responsible for operation of the space shuttle
.
where:
This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two power supplies connected to the same socket, whereby if one socket failed, the other would too.
It also assumes that at only one component is needed to keep the system running. If components are needed for the system to survive, out of , the probability of failure is
, Assuming all components have equal probability, , of failure
This model is probably unrealistic in that it assumes that components are not replaced in time when they fail.
| width="50%" align="" valign="" style="border:0"|
| width="50%" align="" valign="" style="border:0"|
Engineering
Engineering is the discipline, art, skill and profession of acquiring and applying scientific, mathematical, economic, social, and practical knowledge, in order to design and build structures, machines, devices, systems, materials and processes that safely realize improvements to the lives of...
, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system
System
System is a set of interacting or interdependent components forming an integrated whole....
, usually in the case of a backup or fail-safe
Fail-safe
A fail-safe or fail-secure device is one that, in the event of failure, responds in a way that will cause no harm, or at least a minimum of harm, to other devices or danger to personnel....
.
In many safety-critical systems, such as fly-by-wire
Fly-by-wire
Fly-by-wire is a system that replaces the conventional manual flight controls of an aircraft with an electronic interface. The movements of flight controls are converted to electronic signals transmitted by wires , and flight control computers determine how to move the actuators at each control...
and hydraulic systems in aircraft
Aircraft
An aircraft is a vehicle that is able to fly by gaining support from the air, or, in general, the atmosphere of a planet. An aircraft counters the force of gravity by using either static lift or by using the dynamic lift of an airfoil, or in a few cases the downward thrust from jet engines.Although...
, some parts of the control system may be triplicated, which is formally termed triple modular redundancy
Triple modular redundancy
In computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the...
(TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are expected to fail independently, the probability of all three failing is calculated to be extremely small; often outweighed by other risk factors, e.g., human error. Redundancy may also be known by the terms "majority voting systems" or "voting logic".
Forms of redundancy
There are four major forms of redundancy, these are:- Hardware redundancy, such as DMRDual modular redundantA machine which is Dual Modular Redundant has duplicated elements which work in parallel to provide one form of redundancy. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work...
and TMRTriple modular redundancyIn computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the... - Information redundancy, such as Error detection and correctionError detection and correctionIn information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels...
methods - Time redundancy, including transient fault detection methods such as Alternate Logic
- Software redundancy such as N-version programmingN-version programmingN-version programming , also known as multiversion programming, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the same initial specifications...
A modified form of software redundancy, applied to hardware may be:
- Distinct functional redundancy, such as both mechanical and hydraulic braking in a car. Applied in the case of software, code written independently and distinctly different but producing the same results for the same inputs.
Function of redundancy
The two functions of redundancy are passive redundancy and active redundancy. Both functions prevent performance decline from exceeding specification limits without human intervention using extra capacity.Passive redundancy uses excess capacity to reduce the impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety.
Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but depth perception is impaired. Hearing loss in one ear does not cause deafness but directionality is impaired. Performance decline is commonly associated with passive redundancy when a limited number of failures occur.
Active redundancy eliminates performance decline by monitoring performance of individual device, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures components. Error detection and correction and the Global Positioning System (GPS) are two examples of active redundancy.
Electrical power distribution provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line include monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect a power line when monitors detect an overload. Power is redistributed across the remaining lines.
Voting Logic
Voting logic uses performance monitoring to determine how to reconfigure individual components so that operation continues without violating specification limitations of the overall system. Voting logic often involve computers, but systems composed of items other than computers may be reconfigured using voting logic. Circuit breakers are an example of a form of non-computer voting logic.Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust the production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events like earthquake.
The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including the activity message, when the primary detects a fault. The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both to have all outputs active at the same time, can cause both to have all outputs inactive at the same time, or outputs can flutter on and off.
A more reliable form of voting logic involves an odd number of 3 devices or more. All perform identical functions and the outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with avionics systems, such as those responsible for operation of the space shuttle
Space Shuttle
The Space Shuttle was a manned orbital rocket and spacecraft system operated by NASA on 135 missions from 1981 to 2011. The system combined rocket launch, orbital spacecraft, and re-entry spaceplane with modular add-ons...
.
Calculating the probability of system failure
Each duplicate component added to the system decreases the probability of system failure according to the formula:where:
- - number of components
- - probability of component i failing
- - the probability of all components failing (system failure)
This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two power supplies connected to the same socket, whereby if one socket failed, the other would too.
It also assumes that at only one component is needed to keep the system running. If components are needed for the system to survive, out of , the probability of failure is
, Assuming all components have equal probability, , of failure
This model is probably unrealistic in that it assumes that components are not replaced in time when they fail.
See also
| width="50%" align="" valign="" style="border:0"|
- DegeneracyDegeneracy (biology)Within biological systems, degeneracy refers to circumstances where structurally dissimilar components/modules/pathways can perform similar functions under certain conditions, but perform distinct functions in other conditions. Degeneracy is thus a relational property that requires comparing the...
- Common mode failure
- Data redundancyData redundancyData redundancy occurs in database systems which have a field that is repeated in two or more tables. For instance, in case when customer data is duplicated and attached with each product bought then redundancy of data is a known source of inconsistency, since customer might appear with different...
- Double switchingDouble switchingframe|right|A single-switched relay can close inadvertently in response to a single false feed current.frame|right|A double-switched relay cannot close inadvertently with the application of the same current...
- Fault-tolerant designFault-tolerant designIn engineering, fault-tolerant design is a design that enables a system to continue operation, possibly at a reduced level , rather than failing completely, when some part of the system fails...
- Radiation hardeningRadiation hardeningRadiation hardening is a method of designing and testing electronic components and systems to make them resistant to damage or malfunctions caused by ionizing radiation , such as would be encountered in outer space, high-altitude flight, around nuclear reactors, particle accelerators, or during...
- Factor of safetyFactor of safetyFactor of safety , also known as safety factor , is a term describing the structural capacity of a system beyond the expected loads or actual loads. Essentially, how much stronger the system is than it usually needs to be for an intended load...
| width="50%" align="" valign="" style="border:0"|
- Reliability engineeringReliability engineeringReliability engineering is an engineering field, that deals with the study, evaluation, and life-cycle management of reliability: the ability of a system or component to perform its required functions under stated conditions for a specified period of time. It is often measured as a probability of...
- Reliability theory of aging and longevityReliability theory of aging and longevityReliability theory of aging and longevity is a scientific approach aimed to gain theoretical insights into mechanisms of biological aging and species survival patterns by applying a general theory of systems failure, known as reliability theory.-Overview:...
- Safety engineeringSafety engineeringSafety engineering is an applied science strongly related to systems engineering / industrial engineering and the subset System Safety Engineering...
- Self-healing ringSelf-healing ringA self-healing ring, or SHR, is a telecommunications term for loop network topology, a common configuration in telecommunications transmission systems. Like roadway and water distribution systems, a loop or ring is used to provide redundancy...
- MTBF