Error detection and correction
Encyclopedia
In information theory
and coding theory
with applications in computer science
and telecommunication
, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise
, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data.
(i.e., some extra data) to a message, which receivers can use to check consistency of the delivered message, and to recover data determined to be erroneous. Error-detection and correction schemes can be either systematic
or non-systematic: In a systematic scheme, the transmitter sends the original data, and attaches a fixed number of check bits (or parity data), which are derived from the data bits by some deterministic algorithm
. If only error detection is required, a receiver can simply apply the same algorithm to the received data bits and compare its output with the received check bits; if the values do not match, an error has occurred at some point during the transmission. In a system that uses a non-systematic code, the original message is transformed into an encoded message that has at least as many bits as the original message.
Good error control performance requires the scheme to be selected based on the characteristics of the communication channel. Common channel models include memory-less models where errors occur randomly and with a certain probability, and dynamic models where errors occur primarily in bursts. Consequently, error-detecting and correcting codes can be generally distinguished between random-error-detecting/correcting and burst-error-detecting/correcting. Some codes can also be suitable for a mixture of random errors and burst errors.
If the channel capacity
cannot be determined, or is highly varying, an error-detection scheme may be combined with a system for retransmissions of erroneous data. This is known as automatic repeat request (ARQ), and is most notably used in the Internet. An alternate approach for error control is hybrid automatic repeat request (HARQ), which is a combination of ARQ and error-correction coding.
ARQ and FEC may be combined, such that minor errors are corrected without retransmission, and major errors are corrected via a request for retransmission: this is called hybrid automatic repeat-request (HARQ).
(or checksum
algorithm). A hash function adds a fixed-length tag to a message, which enables receivers to verify the delivered message by recomputing the tag and comparing it with the one provided.
There exists a vast variety of different hash function designs. However, some are of particularly widespread use because of either their simplicity or their suitability for detecting certain kinds of errors (e.g., the cyclic redundancy check
's performance in detecting burst errors).
Random-error-correcting code
s based on minimum distance coding can provide a suitable alternative to hash functions when a strict guarantee on the minimum number of errors to be detected is desired. Repetition codes, described below, are special cases of error-correcting codes: although rather inefficient, they find applications for both error correction and detection due to their simplicity.
Repetition codes are very inefficient, and can be susceptible to problems if the error occurs in exactly the same place for each group (e.g., "1010 1010 1010" in the previous example would be detected as correct). The advantage of repetition codes is that they are extremely simple, and are in fact used in some transmissions of numbers station
s.
Extensions and variations on the parity bit mechanism are horizontal redundancy checks, vertical redundancy checks, and "double," "dual," or "diagonal" parity (used in RAID-DP).
sum of message code words of a fixed word length (e.g., byte values). The sum may be negated by means of a ones'-complement operation prior to transmission to detect errors resulting in all-zero messages.
Checksum schemes include parity bit
s, check digit
s, and longitudinal redundancy check
s. Some checksum schemes, such as the Luhn algorithm
and the Verhoeff algorithm
, are specifically designed to detect errors commonly introduced by humans in writing down or remembering identification numbers.
designed to detect accidental changes to digital data in computer networks. It is characterized by specification of a so-called generator polynomial, which is used as the divisor
in a polynomial long division
over a finite field
, taking the input data as the dividend
, and where the remainder
becomes the result.
Cyclic codes have favorable properties in that they are well suited for detecting burst errors. CRCs are particularly easy to implement in hardware, and are therefore commonly used in digital networks
and storage devices such as hard disk drives.
Even parity is a special case of a cyclic redundancy check, where the single-bit CRC is generated by the divisor x + 1.
, provided that changes of the data are only accidental (i.e., due to transmission errors). Any modification to the data will likely be detected through a mismatching hash value. Furthermore, given some hash value, it is infeasible to find some input data (other than the one given) that will yield the same hash value. Message authentication code
s, also called keyed cryptographic hash functions, provide additional protection against intentional modification by an attacker.
, d, can detect up to d − 1 errors in a code word. Using minimum-distance-based error-correcting codes for error detection can be suitable if a strict limit on the minimum number of errors to be detected is desired.
Codes with minimum Hamming distance d = 2 are degenerate cases of error-correcting codes, and can be used to detect single errors. The parity bit
is an example of a single-error-detecting code.
The Berger code
is an early example of a unidirectional error(-correcting) code that can detect any number of errors on an asymmetric channel, provided that only transitions of cleared bits to set bits or set bits to cleared bits can occur.
.
Usually, when the transmitter does not receive the acknowledgment before the timeout occurs (i.e., within a reasonable amount of time after sending the data frame), it retransmits the frame until it is either correctly received or the error persists beyond a predetermined number of retransmissions.
Three types of ARQ protocols are Stop-and-wait ARQ
, Go-Back-N ARQ
, and Selective Repeat ARQ
.
ARQ is appropriate if the communication channel has varying or unknown capacity
, such as is the case on the Internet. However, ARQ requires the availability of a back channel, results in possibly increased latency
due to retransmissions, and requires the maintenance of buffers and timers for retransmissions, which in the case of network congestion
can put a strain on the server and overall network capacity.
data, or parity data, to a message, such that it can be recovered by a receiver even when a number of errors (up to the capability of the code being used) were introduced, either during the process of transmission, or on storage. Since the receiver does not have to ask the sender for retransmission of the data, a back-channel
is not required in forward error correction, and it is therefore suitable for simplex communication
such as broadcasting
. Error-correcting codes are frequently used in lower-layer
communication, as well as for reliable storage in media such as CDs
, DVD
s, hard disk
s, and RAM.
Error-correcting codes are usually distinguished between convolutional code
s and block code
s:
Shannon's theorem is an important theorem in forward error correction, and describes the maximum information rate at which reliable communication is possible over a channel that has a certain error probability or signal-to-noise ratio
(SNR). This strict upper limit is expressed in terms of the channel capacity
. More specifically, the theorem says that there exist codes such that with increasing encoding length the probability of error on a discrete memoryless channel can be made arbitrarily small, provided that the code rate
is smaller than the channel capacity. The code rate is defined as the fraction k/n of k source symbols and n encoded symbols.
The actual maximum code rate allowed depends on the error-correcting code used, and may be lower. This is because Shannon's proof was only of existential nature, and did not show how to construct codes which are both optimal and have efficient encoding and decoding algorithms.
is a combination of ARQ and forward error correction. There are two basic approaches:
The latter approach is particularly attractive on an erasure channel when using a rateless erasure code
.
(FEC). By the time an ARQ system discovers an error and re-transmits it, the re-sent data will arrive too late to be any good.
Applications where the transmitter immediately forgets the information as soon as it is sent (such as most television cameras) cannot use ARQ; they must use FEC
because when an error occurs, the original data is no longer available. (This is also why FEC
is used in data storage systems such as RAID
and distributed data store
).
Applications that use ARQ must have a return channel
. Applications that have no return channel cannot use ARQ.
Applications that require extremely low error rates (such as digital money transfers) must use ARQ.
s and Reed-Muller codes. The Reed-Muller code was well suited to the noise the spacecraft was subject to (approximately matching a bell curve), and was implemented at the Mariner spacecraft for missions between 1969 and 1977.
The Voyager 1
and Voyager 2
missions, which started in 1977, were designed to deliver color imaging amongst scientific information of Jupiter
and Saturn
. This resulted in increased coding requirements, and thus the spacecraft were supported by (optimally Viterbi-decoded
) convolutional codes that could be concatenated with an outer Golay (24,12,8) code
. The Voyager 2 probe additionally supported an implementation of a Reed-Solomon code: the concatenated Reed-Solomon-Viterbi (RSV) code allowed for very powerful error correction, and enabled the spacecraft's extended journey to Uranus
and Neptune
.
The CCSDS currently recommends usage of error correction codes with performance similar to the Voyager 2 RSV code as a minimum. Concatenated codes are increasingly falling out of favor with space missions, and are replaced by more powerful codes such as Turbo code
s or LDPC codes.
The different kinds of deep space and orbital missions that are conducted suggest that trying to find a "one size fits all" error correction system will be an ongoing problem for some time to come. For missions close to earth the nature of the channel
noise
is different from that of a spacecraft on an interplanetary mission experiences. Additionally, as a spacecraft increases its distance from earth, the problem of correcting for noise gets larger.
bandwidth continues to grow, fueled by the desire to deliver television (including new channels and High Definition TV) and IP data. Transponder availability and bandwidth constraints have limited this growth, because transponder capacity is determined by the selected modulation
scheme and Forward error correction
(FEC) rate.
Overview
A "parity track" was present on the first magnetic tape data storage
in 1951. The "Optimal Rectangular Code" used in group code recording
tapes not only detects but also corrects single-bit errors.
Some file format
s, particularly archive formats, include a checksum (most often CRC32) to detect corruption and truncation and can employ redundancy and/or parity file
s to recover portions of corrupted data.
Reed Solomon codes
are used in compact disc
s to correct errors caused by scratches.
Modern hard drives use CRC codes to detect and Reed-Solomon codes to correct minor errors in sector reads, and to recover data from sectors that have "gone bad" and store that data in the spare sectors.
RAID
systems use a variety of error correction techniques, to correct errors when a hard drive completely fails.
memory may provide increased protection against soft error
s by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation
.
Error-correcting memory controllers traditionally use Hamming code
s, although some use triple modular redundancy
.
Interleaving
allows distributing the effect of a single cosmic ray potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset
(SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and the illusion of an error-free memory system may be maintained.
Information theory
Information theory is a branch of applied mathematics and electrical engineering involving the quantification of information. Information theory was developed by Claude E. Shannon to find fundamental limits on signal processing operations such as compressing data and on reliably storing and...
and coding theory
Coding theory
Coding theory is the study of the properties of codes and their fitness for a specific application. Codes are used for data compression, cryptography, error-correction and more recently also for network coding...
with applications in computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
and telecommunication
Telecommunication
Telecommunication is the transmission of information over significant distances to communicate. In earlier times, telecommunications involved the use of visual signals, such as beacons, smoke signals, semaphore telegraphs, signal flags, and optical heliographs, or audio messages via coded...
, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noise
Noise (electronics)
Electronic noise is a random fluctuation in an electrical signal, a characteristic of all electronic circuits. Noise generated by electronic devices varies greatly, as it can be produced by several different effects...
, and thus errors may be introduced during transmission from the source to a receiver. Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data.
Definitions
The general definitions of the terms are as follows:- Error detection is the detection of errors caused by noise or other impairments during transmission from the transmitter to the receiver.
- Error correction is the detection of errors and reconstruction of the original, error-free data.
Introduction
The general idea for achieving error detection and correction is to add some redundancyRedundancy (information theory)
Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data...
(i.e., some extra data) to a message, which receivers can use to check consistency of the delivered message, and to recover data determined to be erroneous. Error-detection and correction schemes can be either systematic
Systematic code
In coding theory, a systematic code is any error-correcting code in which the input data is embedded in the encoded output. Conversely, in a non-systematic code the output does not contain the input symbols....
or non-systematic: In a systematic scheme, the transmitter sends the original data, and attaches a fixed number of check bits (or parity data), which are derived from the data bits by some deterministic algorithm
Deterministic algorithm
In computer science, a deterministic algorithm is an algorithm which, in informal terms, behaves predictably. Given a particular input, it will always produce the same output, and the underlying machine will always pass through the same sequence of states...
. If only error detection is required, a receiver can simply apply the same algorithm to the received data bits and compare its output with the received check bits; if the values do not match, an error has occurred at some point during the transmission. In a system that uses a non-systematic code, the original message is transformed into an encoded message that has at least as many bits as the original message.
Good error control performance requires the scheme to be selected based on the characteristics of the communication channel. Common channel models include memory-less models where errors occur randomly and with a certain probability, and dynamic models where errors occur primarily in bursts. Consequently, error-detecting and correcting codes can be generally distinguished between random-error-detecting/correcting and burst-error-detecting/correcting. Some codes can also be suitable for a mixture of random errors and burst errors.
If the channel capacity
Channel capacity
In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a communications channel...
cannot be determined, or is highly varying, an error-detection scheme may be combined with a system for retransmissions of erroneous data. This is known as automatic repeat request (ARQ), and is most notably used in the Internet. An alternate approach for error control is hybrid automatic repeat request (HARQ), which is a combination of ARQ and error-correction coding.
Implementation
Error correction may generally be realized in two different ways:- Automatic repeat request (ARQ) (sometimes also referred to as backward error correction): This is an error control technique whereby an error detection scheme is combined with requests for retransmission of erroneous data. Every block of data received is checked using the error detection code used, and if the check fails, retransmission of the data is requested – this may be done repeatedly, until the data can be verified.
- Forward error correctionForward error correctionIn telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
(FEC): The sender encodes the data using an error-correcting code (ECC) prior to transmission. The additional information (redundancyRedundancy (information theory)Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data...
) added by the code is used by the receiver to recover the original data. In general, the reconstructed data is what is deemed the "most likely" original data.
ARQ and FEC may be combined, such that minor errors are corrected without retransmission, and major errors are corrected via a request for retransmission: this is called hybrid automatic repeat-request (HARQ).
Error detection schemes
Error detection is most commonly realized using a suitable hash functionHash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
(or checksum
Checksum
A checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
algorithm). A hash function adds a fixed-length tag to a message, which enables receivers to verify the delivered message by recomputing the tag and comparing it with the one provided.
There exists a vast variety of different hash function designs. However, some are of particularly widespread use because of either their simplicity or their suitability for detecting certain kinds of errors (e.g., the cyclic redundancy check
Cyclic redundancy check
A cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data...
's performance in detecting burst errors).
Random-error-correcting code
Forward error correction
In telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
s based on minimum distance coding can provide a suitable alternative to hash functions when a strict guarantee on the minimum number of errors to be detected is desired. Repetition codes, described below, are special cases of error-correcting codes: although rather inefficient, they find applications for both error correction and detection due to their simplicity.
Repetition codes
A repetition code is a coding scheme that repeats the bits across a channel to achieve error-free communication. Given a stream of data to be transmitted, the data is divided into blocks of bits. Each block is transmitted some predetermined number of times. For example, to send the bit pattern "1011", the four-bit block can be repeated three times, thus producing "1011 1011 1011". However, if this twelve-bit pattern was received as "1010 1011 1011" – where the first block is unlike the other two – it can be determined that an error has occurred.Repetition codes are very inefficient, and can be susceptible to problems if the error occurs in exactly the same place for each group (e.g., "1010 1010 1010" in the previous example would be detected as correct). The advantage of repetition codes is that they are extremely simple, and are in fact used in some transmissions of numbers station
Numbers station
A numbers station is a shortwave radio station of uncertain origin. In the 1950s, Time magazine reported that the numbers stations first appeared shortly after World War II and were using a format that had been used to send weather data during that war.Numbers stations generally broadcast...
s.
Parity bits
A parity bit is a bit that is added to a group of source bits to ensure that the number of set bits (i.e., bits with value 1) in the outcome is even or odd. It is a very simple scheme that can be used to detect single or any other odd number (i.e., three, five, etc.) of errors in the output. An even number of flipped bits will make the parity bit appear correct even though the data is erroneous.Extensions and variations on the parity bit mechanism are horizontal redundancy checks, vertical redundancy checks, and "double," "dual," or "diagonal" parity (used in RAID-DP).
Checksums
A checksum of a message is a modular arithmeticModular arithmetic
In mathematics, modular arithmetic is a system of arithmetic for integers, where numbers "wrap around" after they reach a certain value—the modulus....
sum of message code words of a fixed word length (e.g., byte values). The sum may be negated by means of a ones'-complement operation prior to transmission to detect errors resulting in all-zero messages.
Checksum schemes include parity bit
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
s, check digit
Check digit
A check digit is a form of redundancy check used for error detection, the decimal equivalent of a binary checksum. It consists of a single digit computed from the other digits in the message....
s, and longitudinal redundancy check
Longitudinal redundancy check
In telecommunication, a longitudinal redundancy check or horizontal redundancy check is a form of redundancy check that is applied independently to each of a parallel group of bit streams...
s. Some checksum schemes, such as the Luhn algorithm
Luhn algorithm
The Luhn algorithm or Luhn formula, also known as the "modulus 10" or "mod 10" algorithm,is a simple checksum formula used to validate a variety of identification numbers, such as credit card numbers, IMEI numbers, National Provider Identifier numbers in US and Canadian Social Insurance Numbers...
and the Verhoeff algorithm
Verhoeff algorithm
The Verhoeff algorithm, a checksum formula for error detection first published in 1969, was developed by Dutch mathematician Jacobus Verhoeff . Like the more widely known Luhn algorithm, it works with strings of decimal digits of any length...
, are specifically designed to detect errors commonly introduced by humans in writing down or remembering identification numbers.
Cyclic redundancy checks (CRCs)
A cyclic redundancy check (CRC) is a single-burst-error-detecting cyclic code and non-secure hash functionHash function
A hash function is any algorithm or subroutine that maps large data sets to smaller data sets, called keys. For example, a single integer can serve as an index to an array...
designed to detect accidental changes to digital data in computer networks. It is characterized by specification of a so-called generator polynomial, which is used as the divisor
Divisor
In mathematics, a divisor of an integer n, also called a factor of n, is an integer which divides n without leaving a remainder.-Explanation:...
in a polynomial long division
Polynomial long division
In algebra, polynomial long division is an algorithm for dividing a polynomial by another polynomial of the same or lower degree, a generalised version of the familiar arithmetic technique called long division...
over a finite field
Finite field
In abstract algebra, a finite field or Galois field is a field that contains a finite number of elements. Finite fields are important in number theory, algebraic geometry, Galois theory, cryptography, and coding theory...
, taking the input data as the dividend
Dividend
Dividends are payments made by a corporation to its shareholder members. It is the portion of corporate profits paid out to stockholders. When a corporation earns a profit or surplus, that money can be put to two uses: it can either be re-invested in the business , or it can be distributed to...
, and where the remainder
Remainder
In arithmetic, the remainder is the amount "left over" after the division of two integers which cannot be expressed with an integer quotient....
becomes the result.
Cyclic codes have favorable properties in that they are well suited for detecting burst errors. CRCs are particularly easy to implement in hardware, and are therefore commonly used in digital networks
Computer network
A computer network, often simply referred to as a network, is a collection of hardware components and computers interconnected by communication channels that allow sharing of resources and information....
and storage devices such as hard disk drives.
Even parity is a special case of a cyclic redundancy check, where the single-bit CRC is generated by the divisor x + 1.
Cryptographic hash functions
A cryptographic hash function can provide strong assurances about data integrityData integrity
Data Integrity in its broadest meaning refers to the trustworthiness of system resources over their entire life cycle. In more analytic terms, it is "the representational faithfulness of information to the true state of the object that the information represents, where representational faithfulness...
, provided that changes of the data are only accidental (i.e., due to transmission errors). Any modification to the data will likely be detected through a mismatching hash value. Furthermore, given some hash value, it is infeasible to find some input data (other than the one given) that will yield the same hash value. Message authentication code
Message authentication code
In cryptography, a message authentication code is a short piece of information used to authenticate a message.A MAC algorithm, sometimes called a keyed hash function, accepts as input a secret key and an arbitrary-length message to be authenticated, and outputs a MAC...
s, also called keyed cryptographic hash functions, provide additional protection against intentional modification by an attacker.
Error-correcting codes
Any error-correcting code can be used for error detection. A code with minimum Hamming distanceHamming distance
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different...
, d, can detect up to d − 1 errors in a code word. Using minimum-distance-based error-correcting codes for error detection can be suitable if a strict limit on the minimum number of errors to be detected is desired.
Codes with minimum Hamming distance d = 2 are degenerate cases of error-correcting codes, and can be used to detect single errors. The parity bit
Parity bit
A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
is an example of a single-error-detecting code.
The Berger code
Berger code
In telecommunication, a Berger code is a unidirectional error detecting code, named after its inventor, J. M. Berger. Berger codes can detect all unidirectional errors. Unidirectional errors are errors that only flip ones into zeroes or only zeroes into ones, such as in asymmetric channels...
is an early example of a unidirectional error(-correcting) code that can detect any number of errors on an asymmetric channel, provided that only transitions of cleared bits to set bits or set bits to cleared bits can occur.
Automatic repeat request
Automatic Repeat reQuest (ARQ) is an error control method for data transmission that makes use of error-detection codes, acknowledgment and/or negative acknowledgment messages, and timeouts to achieve reliable data transmission. An acknowledgment is a message sent by the receiver to indicate that it has correctly received a data frameData frame
In computer networking and telecommunication, a frame is a digital data transmission unit or data packet that includes frame synchronization, i.e. a sequence of bits or symbols making it possible for the receiver to detect the beginning and end of the packet in the stream of symbols or bits...
.
Usually, when the transmitter does not receive the acknowledgment before the timeout occurs (i.e., within a reasonable amount of time after sending the data frame), it retransmits the frame until it is either correctly received or the error persists beyond a predetermined number of retransmissions.
Three types of ARQ protocols are Stop-and-wait ARQ
Stop-and-wait ARQ
Stop-and-wait ARQ is a method used in telecommunications to send information between two connected devices. It ensures that information is not lost due to dropped packets and that packets are received in the correct order. It is the simplest kind of automatic repeat-request method...
, Go-Back-N ARQ
Go-Back-N ARQ
Go-Back-N ARQ is a specific instance of the automatic repeat request protocol, in which the sending process continues to send a number of frames specified by a window size even without receiving an acknowledgement packet from the receiver...
, and Selective Repeat ARQ
Selective Repeat ARQ
Selective Repeat ARQ / Selective Reject ARQ is a specific instance of the Automatic Repeat-Request protocol used for communications.-Concept:...
.
ARQ is appropriate if the communication channel has varying or unknown capacity
Channel capacity
In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a communications channel...
, such as is the case on the Internet. However, ARQ requires the availability of a back channel, results in possibly increased latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
due to retransmissions, and requires the maintenance of buffers and timers for retransmissions, which in the case of network congestion
Network congestion
In data networking and queueing theory, network congestion occurs when a link or node is carrying so much data that its quality of service deteriorates. Typical effects include queueing delay, packet loss or the blocking of new connections...
can put a strain on the server and overall network capacity.
Error-correcting code
An error-correcting code (ECC) or forward error correction (FEC) code is a system of adding redundantRedundancy (information theory)
Redundancy in information theory is the number of bits used to transmit a message minus the number of bits of actual information in the message. Informally, it is the amount of wasted "space" used to transmit certain data...
data, or parity data, to a message, such that it can be recovered by a receiver even when a number of errors (up to the capability of the code being used) were introduced, either during the process of transmission, or on storage. Since the receiver does not have to ask the sender for retransmission of the data, a back-channel
Back-channel
-In telecommunications:A back-channel is typically a low-speed, or less-than-optimal, transmission channel in the opposite direction to the main channel.-In IT Security:...
is not required in forward error correction, and it is therefore suitable for simplex communication
Simplex communication
Simplex communication refers to communication that occurs in one direction only. Two definitions have arisen over time: a common definition, which is used in ANSI standard and elsewhere, and an ITU-T definition...
such as broadcasting
Broadcasting
Broadcasting is the distribution of audio and video content to a dispersed audience via any audio visual medium. Receiving parties may include the general public or a relatively large subset of thereof...
. Error-correcting codes are frequently used in lower-layer
OSI model
The Open Systems Interconnection model is a product of the Open Systems Interconnection effort at the International Organization for Standardization. It is a prescription of characterizing and standardizing the functions of a communications system in terms of abstraction layers. Similar...
communication, as well as for reliable storage in media such as CDs
Compact Disc
The Compact Disc is an optical disc used to store digital data. It was originally developed to store and playback sound recordings exclusively, but later expanded to encompass data storage , write-once audio and data storage , rewritable media , Video Compact Discs , Super Video Compact Discs ,...
, DVD
DVD
A DVD is an optical disc storage media format, invented and developed by Philips, Sony, Toshiba, and Panasonic in 1995. DVDs offer higher storage capacity than Compact Discs while having the same dimensions....
s, hard disk
Hard disk
A hard disk drive is a non-volatile, random access digital magnetic data storage device. It features rotating rigid platters on a motor-driven spindle within a protective enclosure. Data is magnetically read from and written to the platter by read/write heads that float on a film of air above the...
s, and RAM.
Error-correcting codes are usually distinguished between convolutional code
Convolutional code
In telecommunication, a convolutional code is a type of error-correcting code in which* each m-bit information symbol to be encoded is transformed into an n-bit symbol, where m/n is the code rate and...
s and block code
Block code
In coding theory, block codes refers to the large and important family of error-correcting codes that encode data in blocks.There is a vast number of examples for block codes, many of which have a wide range of practical applications...
s:
- Convolutional codes are processed on a bit-by-bit basis. They are particularly suitable for implementation in hardware, and the Viterbi decoderViterbi decoderA Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has beenencoded using forward error correction based on a convolutional code....
allows optimal decoding.
- Block codes are processed on a block-by-blockBlock (telecommunications)In telecommunications a block is one of:* A group of bits or digits that is transmitted as a unit and that may be encoded for error-control purposes....
basis. Early examples of block codes are repetition codeRepetition codeIn coding theory, the repetition code is one of the most basic error-correcting codes. In order to transmit a message over a noisy channel that may corrupt the transmission in a few places, the idea of the repetition code is to just repeat the message several times. The hope is that the channel...
s, Hamming codeHamming codeIn telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...
s and multidimensional parity-check codeMultidimensional parity-check codeA multidimensional parity-check code is a simple type of error correcting code that operates by arranging the message into a multidimensional grid, and calculating a parity digit for each row and column...
s. They were followed by a number of efficient codes, Reed-Solomon codes being the most notable due to their current widespread use. Turbo codeTurbo codeIn information theory, turbo codes are a class of high-performance forward error correction codes developed in 1993, which were the first practical codes to closely approach the channel capacity, a theoretical maximum for the code rate at which reliable communication is still possible given a...
s and low-density parity-check codeLow-density parity-check codeIn information theory, a low-density parity-check code is a linear error correcting code, a method of transmitting a message over a noisy transmission channel, and is constructed using a sparse bipartite graph...
s (LDPC) are relatively new constructions that can provide almost optimal efficiency.
Shannon's theorem is an important theorem in forward error correction, and describes the maximum information rate at which reliable communication is possible over a channel that has a certain error probability or signal-to-noise ratio
Signal-to-noise ratio
Signal-to-noise ratio is a measure used in science and engineering that compares the level of a desired signal to the level of background noise. It is defined as the ratio of signal power to the noise power. A ratio higher than 1:1 indicates more signal than noise...
(SNR). This strict upper limit is expressed in terms of the channel capacity
Channel capacity
In electrical engineering, computer science and information theory, channel capacity is the tightest upper bound on the amount of information that can be reliably transmitted over a communications channel...
. More specifically, the theorem says that there exist codes such that with increasing encoding length the probability of error on a discrete memoryless channel can be made arbitrarily small, provided that the code rate
Code rate
In telecommunication and information theory, the code rate of a forward error correction code is the proportion of the data-stream that is useful...
is smaller than the channel capacity. The code rate is defined as the fraction k/n of k source symbols and n encoded symbols.
The actual maximum code rate allowed depends on the error-correcting code used, and may be lower. This is because Shannon's proof was only of existential nature, and did not show how to construct codes which are both optimal and have efficient encoding and decoding algorithms.
Hybrid schemes
Hybrid ARQHybrid ARQ
Hybrid automatic repeat request is a combination of high-rate forward error-correcting coding, and ARQ error-control for detectable-but-uncorrectable errors. In standard ARQ, redundant bits are added to data to be transmitted using an error-detecting code such as cyclic redundancy check...
is a combination of ARQ and forward error correction. There are two basic approaches:
- Messages are always transmitted with FEC parity data (and error-detection redundancy). A receiver decodes a message using the parity information, and requests retransmission using ARQ only if the parity data was not sufficient for successful decoding (identified through a failed integrity check).
- Messages are transmitted without parity data (only with error-detection information). If a receiver detects an error, it requests FEC information from the transmitter using ARQ, and uses it to reconstruct the original message.
The latter approach is particularly attractive on an erasure channel when using a rateless erasure code
Fountain code
In coding theory, fountain codes are a class of erasure codes with the property that a potentially limitless sequence of encoding symbols can be generated from a given set of source symbols such that the original source symbols can ideally be recovered from any subset of the encoding symbols of...
.
Applications
Applications that require low latency (such as telephone conversations) cannot use Automatic Repeat reQuest (ARQ); they must use Forward Error CorrectionForward error correction
In telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
(FEC). By the time an ARQ system discovers an error and re-transmits it, the re-sent data will arrive too late to be any good.
Applications where the transmitter immediately forgets the information as soon as it is sent (such as most television cameras) cannot use ARQ; they must use FEC
Forward error correction
In telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
because when an error occurs, the original data is no longer available. (This is also why FEC
Forward error correction
In telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
is used in data storage systems such as RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
and distributed data store
Distributed data store
A distributed data store is a blurred concept and means either a distributed database where users store their information on a number of nodes, or a network in which a user stores their information on a number of peer network nodes ....
).
Applications that use ARQ must have a return channel
Return channel
In communications systems that use star topologies, the return channel is the transmission link from a user terminal to the central hub....
. Applications that have no return channel cannot use ARQ.
Applications that require extremely low error rates (such as digital money transfers) must use ARQ.
The Internet
In a typical TCP/IP stack, error control is performed at multiple levels:- Each EthernetEthernetEthernet is a family of computer networking technologies for local area networks commercially introduced in 1980. Standardized in IEEE 802.3, Ethernet has largely replaced competing wired LAN technologies....
frameData frameIn computer networking and telecommunication, a frame is a digital data transmission unit or data packet that includes frame synchronization, i.e. a sequence of bits or symbols making it possible for the receiver to detect the beginning and end of the packet in the stream of symbols or bits...
carries a CRC-32Cyclic redundancy checkA cyclic redundancy check is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data...
checksumChecksumA checksum or hash sum is a fixed-size datum computed from an arbitrary block of digital data for the purpose of detecting accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and...
. Frames received with incorrect checksums are discarded by the receiver hardware. - The IPv4IPv4Internet Protocol version 4 is the fourth revision in the development of the Internet Protocol and the first version of the protocol to be widely deployed. Together with IPv6, it is at the core of standards-based internetworking methods of the Internet...
header contains a checksum protecting the contents of the header. Packets with mismatching checksums are dropped within the network or at the receiver. - The checksum was omitted from the IPv6IPv6Internet Protocol version 6 is a version of the Internet Protocol . It is designed to succeed the Internet Protocol version 4...
header in order to minimize processing costs in network routing and because current link layerLink LayerIn computer networking, the link layer is the lowest layer in the Internet Protocol Suite , the networking architecture of the Internet . It is the group of methods or protocols that only operate on a host's link...
technology is assumed to provide sufficient error detection (see also RFC 3819). - UDPUser Datagram ProtocolThe User Datagram Protocol is one of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer applications can send messages, in this case referred to as datagrams, to other hosts on an Internet Protocol network without requiring...
has an optional checksum covering the payload and addressing information from the UDP and IP headers. Packets with incorrect checksums are discarded by the operating systemOperating systemAn operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
network stack. The checksum is optional under IPv4, only, because the Data-Link layer checksum may already provide the desired level of error protection. - TCPTransmission Control ProtocolThe Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
provides a checksum for protecting the payload and addressing information from the TCP and IP headers. Packets with incorrect checksums are discarded within the network stack, and eventually get retransmitted using ARQ, either explicitly (such as through triple-ack) or implicitly due to a timeoutTimeout (telecommunication)In telecommunication and related engineering , the term timeout or time-out has several meanings, including...
.
Deep-space telecommunications
Development of error-correction codes was tightly coupled with the history of deep-space missions due to the extreme dilution of signal power over interplanetary distances, and the limited power availability aboard space probes. Whereas early missions sent their data uncoded, starting from 1968 digital error correction was implemented in the form of (sub-optimally decoded) convolutional codeConvolutional code
In telecommunication, a convolutional code is a type of error-correcting code in which* each m-bit information symbol to be encoded is transformed into an n-bit symbol, where m/n is the code rate and...
s and Reed-Muller codes. The Reed-Muller code was well suited to the noise the spacecraft was subject to (approximately matching a bell curve), and was implemented at the Mariner spacecraft for missions between 1969 and 1977.
The Voyager 1
Voyager 1
The Voyager 1 spacecraft is a 722-kilogram space probe launched by NASA in 1977, to study the outer Solar System and eventually interstellar space. Operating for as of today , the spacecraft receives routine commands and transmits data back to the Deep Space Network. At a distance of as of...
and Voyager 2
Voyager 2
The Voyager 2 spacecraft is a 722-kilogram space probe launched by NASA on August 20, 1977 to study the outer Solar System and eventually interstellar space...
missions, which started in 1977, were designed to deliver color imaging amongst scientific information of Jupiter
Jupiter
Jupiter is the fifth planet from the Sun and the largest planet within the Solar System. It is a gas giant with mass one-thousandth that of the Sun but is two and a half times the mass of all the other planets in our Solar System combined. Jupiter is classified as a gas giant along with Saturn,...
and Saturn
Saturn
Saturn is the sixth planet from the Sun and the second largest planet in the Solar System, after Jupiter. Saturn is named after the Roman god Saturn, equated to the Greek Cronus , the Babylonian Ninurta and the Hindu Shani. Saturn's astronomical symbol represents the Roman god's sickle.Saturn,...
. This resulted in increased coding requirements, and thus the spacecraft were supported by (optimally Viterbi-decoded
Viterbi decoder
A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has beenencoded using forward error correction based on a convolutional code....
) convolutional codes that could be concatenated with an outer Golay (24,12,8) code
Binary Golay code
In mathematics and electronics engineering, a binary Golay code is a type of error-correcting code used in digital communications. The binary Golay code, along with the ternary Golay code, has a particularly deep and interesting connection to the theory of finite sporadic groups in mathematics....
. The Voyager 2 probe additionally supported an implementation of a Reed-Solomon code: the concatenated Reed-Solomon-Viterbi (RSV) code allowed for very powerful error correction, and enabled the spacecraft's extended journey to Uranus
Uranus
Uranus is the seventh planet from the Sun. It has the third-largest planetary radius and fourth-largest planetary mass in the Solar System. It is named after the ancient Greek deity of the sky Uranus , the father of Cronus and grandfather of Zeus...
and Neptune
Neptune
Neptune is the eighth and farthest planet from the Sun in the Solar System. Named for the Roman god of the sea, it is the fourth-largest planet by diameter and the third largest by mass. Neptune is 17 times the mass of Earth and is slightly more massive than its near-twin Uranus, which is 15 times...
.
The CCSDS currently recommends usage of error correction codes with performance similar to the Voyager 2 RSV code as a minimum. Concatenated codes are increasingly falling out of favor with space missions, and are replaced by more powerful codes such as Turbo code
Turbo code
In information theory, turbo codes are a class of high-performance forward error correction codes developed in 1993, which were the first practical codes to closely approach the channel capacity, a theoretical maximum for the code rate at which reliable communication is still possible given a...
s or LDPC codes.
The different kinds of deep space and orbital missions that are conducted suggest that trying to find a "one size fits all" error correction system will be an ongoing problem for some time to come. For missions close to earth the nature of the channel
Channel (communications)
In telecommunications and computer networking, a communication channel, or channel, refers either to a physical transmission medium such as a wire, or to a logical connection over a multiplexed medium such as a radio channel...
noise
Noise (electronics)
Electronic noise is a random fluctuation in an electrical signal, a characteristic of all electronic circuits. Noise generated by electronic devices varies greatly, as it can be produced by several different effects...
is different from that of a spacecraft on an interplanetary mission experiences. Additionally, as a spacecraft increases its distance from earth, the problem of correcting for noise gets larger.
Satellite broadcasting (DVB)
The demand for satellite transponderTransponder
In telecommunication, the term transponder has the following meanings:...
bandwidth continues to grow, fueled by the desire to deliver television (including new channels and High Definition TV) and IP data. Transponder availability and bandwidth constraints have limited this growth, because transponder capacity is determined by the selected modulation
Modulation
In electronics and telecommunications, modulation is the process of varying one or more properties of a high-frequency periodic waveform, called the carrier signal, with a modulating signal which typically contains information to be transmitted...
scheme and Forward error correction
Forward error correction
In telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
(FEC) rate.
Overview
- QPSK coupled with traditional Reed Solomon and Viterbi codes have been used for nearly 20 years for the delivery of digital satellite TV.
- Higher order modulation schemes such as 8PSK, 16QAMQuadrature amplitude modulationQuadrature amplitude modulation is both an analog and a digital modulation scheme. It conveys two analog message signals, or two digital bit streams, by changing the amplitudes of two carrier waves, using the amplitude-shift keying digital modulation scheme or amplitude modulation analog...
and 32QAM have enabled the satellite industry to increase transponder efficiency by several orders of magnitude. - This increase in the information rate in a transponder comes at the expense of an increase in the carrier power to meet the threshold requirement for existing antennas.
- Tests conducted using the latest chipsets demonstrate that the performance achieved by using Turbo Codes may be even lower than the 0.8 dBDecibelThe decibel is a logarithmic unit that indicates the ratio of a physical quantity relative to a specified or implied reference level. A ratio in decibels is ten times the logarithm to base 10 of the ratio of two power quantities...
figure assumed in early designs.
Data storage
Error detection and correction codes are often used to improve the reliability of data storage media.A "parity track" was present on the first magnetic tape data storage
Magnetic tape data storage
Magnetic tape data storage uses digital recording on to magnetic tape to store digital information. Modern magnetic tape is most commonly packaged in cartridges and cassettes. The device that performs actual writing or reading of data is a tape drive...
in 1951. The "Optimal Rectangular Code" used in group code recording
Group Code Recording
In computer science, group code recording refers to several distinct but related encoding methods for magnetic media. The first, used in 6250 cpi magnetic tape, is an error-correcting code combined with a run length limited encoding scheme...
tapes not only detects but also corrects single-bit errors.
Some file format
File format
A file format is a particular way that information is encoded for storage in a computer file.Since a disk drive, or indeed any computer storage, can store only bits, the computer must have some way of converting information to 0s and 1s and vice-versa. There are different kinds of formats for...
s, particularly archive formats, include a checksum (most often CRC32) to detect corruption and truncation and can employ redundancy and/or parity file
Parity file
Parity files are files that are created to accompany data files, and are used to preserve data integrity and assist in data recovery. They are useful when data files are transmitted or stored on less-than-perfect media such as newsgroup messages, satellite transmission, or optical disk...
s to recover portions of corrupted data.
Reed Solomon codes
Cross-Interleaved Reed-Solomon Coding
In the compact disc system, cross-interleaved Reed-Solomon code provides error detection and error correction. CIRC adds to every three data bytes one redundant parity byte.-Overview:...
are used in compact disc
Compact Disc
The Compact Disc is an optical disc used to store digital data. It was originally developed to store and playback sound recordings exclusively, but later expanded to encompass data storage , write-once audio and data storage , rewritable media , Video Compact Discs , Super Video Compact Discs ,...
s to correct errors caused by scratches.
Modern hard drives use CRC codes to detect and Reed-Solomon codes to correct minor errors in sector reads, and to recover data from sectors that have "gone bad" and store that data in the spare sectors.
RAID
RAID
RAID is a storage technology that combines multiple disk drive components into a logical unit...
systems use a variety of error correction techniques, to correct errors when a hard drive completely fails.
Error-correcting memory
DRAMDynamic random access memory
Dynamic random-access memory is a type of random-access memory that stores each bit of data in a separate capacitor within an integrated circuit. The capacitor can be either charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1...
memory may provide increased protection against soft error
Soft error
In electronics and computing, a soft error is an error in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to...
s by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation
Cosmic ray
Cosmic rays are energetic charged subatomic particles, originating from outer space. They may produce secondary particles that penetrate the Earth's atmosphere and surface. The term ray is historical as cosmic rays were thought to be electromagnetic radiation...
.
Error-correcting memory controllers traditionally use Hamming code
Hamming code
In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...
s, although some use triple modular redundancy
Triple modular redundancy
In computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the...
.
Interleaving
Interleaving
In computer science and telecommunication, interleaving is a way to arrange data in a non-contiguous way to increase performance.It is typically used:* In error-correction coding, particularly within data transmission, disk storage, and computer memory....
allows distributing the effect of a single cosmic ray potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset
Single event upset
A single event upset is a change of state caused by ions or electro-magnetic radiation striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close...
(SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and the illusion of an error-free memory system may be maintained.
See also
- Forward error correctionForward error correctionIn telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
- Link adaptationLink adaptationLink adaptation, or adaptive coding and modulation , is a term used in wireless communications to denote the matching of the modulation, coding and other signal and protocol parameters to the conditions on the radio link Link adaptation, or adaptive coding and modulation (ACM), is a term used in...
- List of algorithms for error detection and correction
- List of checksum algorithms
- List of error-correcting codes
- Reliability (computer networking)Reliability (computer networking)In computer networking, a reliable protocol is one that provides reliability properties with respect to the delivery of data to the intended recipient, as opposed to an unreliable protocol, which does not provide notifications to the sender as to the delivery of transmitted data.A reliable...
External links
- The on-line textbook: Information Theory, Inference, and Learning Algorithms, by David MacKay, contains chapters on elementary error-correcting codes; on the theoretical limits of error-correction; and on the latest state-of-the-art error-correcting codes, including low-density parity-check codeLow-density parity-check codeIn information theory, a low-density parity-check code is a linear error correcting code, a method of transmitting a message over a noisy transmission channel, and is constructed using a sparse bipartite graph...
s, turbo codeTurbo codeIn information theory, turbo codes are a class of high-performance forward error correction codes developed in 1993, which were the first practical codes to closely approach the channel capacity, a theoretical maximum for the code rate at which reliable communication is still possible given a...
s, and fountain codes. - Compute parameters of linear codes - an on-line interface for generating and computing parameters (e.g. minimum distanceMinimum distanceThe term minimum distance is used in several ways:* In geometry, the minimum distance of a collection of points P in a space is the smallest distance between any two points of the space....
, covering radius) of linear error-correcting codesLinear codeIn coding theory, a linear code is an error-correcting code for which any linear combination of codewords is also a codeword. Linear codes are traditionally partitioned into block codes and convolutional codes, although Turbo codes can be seen as a hybrid of these two types. Linear codes allow for...
. - ECC Page