Soft error
Encyclopedia
In electronics
and computing
, a soft error is an error
in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before.
An error occurence in a computer's memory system that changes an instruction in a program or a data value. Soft errors typically can be remedied by cold booting the computer. A soft error will not damage a system's hardware; the only damage is to the data that is being processed.
There are two types of soft errors:
* chip-level soft error:
These errors occur when the radioactive atoms in the chip's material decay and release alpha particles into the chip.
Because an alpha particle contains a positive charge and kinetic energy, the particle can hit a memory cell and cause the cell to change state to a different value.
The atomic reaction is so tiny that it does not damage the actual structure of the chip. Chip-level errors are rare because modern memory is so stable that it would take a typical computer with a large memory capacity at least 10 years before the radioactive elements of the chip's materials begin to decay.
* system-level soft error:
These errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The computer tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. The bad data bit can even be saved in memory and cause problems at a later time.
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed
, in which case the recovery procedure must include a reboot.
Soft errors involve changes to data — the electrons in a storage circuit, for example — but not changes to the physical circuit itself, the atoms. If the data is rewritten, the circuit will work perfectly again.
Soft errors can occur on transmission lines, in digital logic, analog circuits, magnetic storage, and elsewhere, but are most commonly known in semiconductor storage.
and higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by the critical charge
parameter, Qcrit, the minimum electron charge disturbance needed to change the logic level. A higher Qcrit means fewer soft errors. Unfortunately, a higher Qcrit also means a slower logic gate and a higher power dissipation. Reduction in chip feature size and supply voltage, desirable for many reasons, decreases Qcrit. Thus, the importance of soft errors increases as chip technology advances.
In a logic circuit, Qcrit is defined as the minimum amount of induced charge required at a circuit node to cause a voltage pulse to propagate from that node to the output and be of sufficient duration and magnitude to be reliably latched. Since a logic circuit contains many nodes that may be struck, and each node may be of unique capacitance and distance from output, Qcrit is typically characterized on a per-node basis.
Package radioactive decay usually causes a soft error by alpha particle
emission. The positively charged alpha particle travels through the semiconductor and disturbs the distribution of electrons there. If the disturbance is large enough, a digital
signal can change from a 0 to a 1 or vice versa. In combinational logic
, this effect is transient, perhaps lasting a fraction of a nanosecond, and this has led to the challenge of soft errors in combinational logic mostly going unnoticed. In sequential logic such as latches and RAM, even this transient upset can become stored for an indefinite time, to be read out later. Thus, designers are usually much more aware of the problem in storage circuits.
which culminated in the publication of a number of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays also could cause soft errors. Indeed, in modern devices, cosmic rays may be the predominant cause. Although the primary particle of the cosmic ray does not generally reach the Earth's surface, it creates a shower of energetic secondary particles. At the Earth's surface approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder composed of protons and pions (Ziegler, 1996). This flux of energetic neutrons is typically referred to as "cosmic rays" in the soft error literature. Neutrons are uncharged and cannot disturb a circuit on their own, but undergo neutron capture
by the nucleus of an atom in a chip. This process may result in the production of charged secondaries, such as alpha particles and oxygen nuclei, which can then cause soft errors.
Cosmic ray flux depends on altitude. For the common reference location of 40.7N, 74W at 0 meters (sea level in New York City, NY, USA) the flux is approximately 14 neutrons / cm2/hour. Burying a system in a cave reduces the rate of cosmic-ray induced soft errors to a negligible level. In the lower levels of the atmosphere, the flux increases by a factor of about 2.2 for every 1000 m (1.3 for every 1000 ft) increase in altitude above sea level. Computers operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The rate of upsets in aircraft
may be more than 300 times the sea level upset rate. This is in contrast to package decay induced soft errors, which do not change with location. A model of the energetic neutron flux is presented in (Gordon & Goldhagen, 2004). An online calculator for this model is available at www.seutest.com.
The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number of cosmic-ray soft errors decreases during the active portion of the sunspot cycle and increases during the quiet portion. This counterintuitive result occurs for two reasons. The sun does not generally produce cosmic ray particles with energy above 1 GeV that are capable of penetrating to the Earth's upper atmosphere and creating particle showers, so the changes in the solar flux do not directly influence the number of errors. Further, the increase in the solar flux during an active sun period does have the effect of reshaping the Earth's magnetic field providing some additional shielding against higher energy cosmic rays, resulting in a decrease in the number of particles creating showers. The effect is fairly small in any case resulting in a +/- 7% modulation of the energetic neutron flux in New York City. Other locations are similarly affected.
Energetic neutrons produced by cosmic rays may lose most of their kinetic energy and reach thermal equilibrium with their surroundings as they are scattered by materials. The resulting neutrons are simply referred to as thermal neutrons and have an average kinetic energy of about 25 millielectron-volts at 25°C. Thermal neutrons are also produced by environmental radiation sources such as the decay of naturally occurring uranium or thorium. The thermal neutron flux from sources other than cosmic-ray showers may still be noticeable in an underground location and an important contributor to soft errors for some circuits.
reactions become much more probable and result in fission of certain materials creating charged secondaries as fission byproducts. For some circuits the capture of a thermal neutron by the nucleus of the B-10 isotope of boron is particularly important. This nuclear reaction is an efficient producer of an alpha particle, Li-7 nucleus and gamma ray. Either of the charged particles (alpha or Li-7) may cause a soft error if produced in very close proximity, approximately 5 micrometers, to a critical circuit node. The capture cross section for B-11 is 6 orders of magnitude smaller and does not contribute to soft errors (Baumann et al., 1995)
Boron has been used in BPSG
, the insulator in the interconnection layers of integrated circuits, particularly in the lowest one. The inclusion of boron lowers the melt temperature of the glass providing better reflow and planarization characteristics. In this application the glass is formulated with a boron content of 4% to 5% by weight. Naturally occurring boron is 20% B-10 with the remainder the B-11 isotope. Soft errors are caused by the high level of B-10 in this critical lower layer of some older integrated circuit processes. Boron-11, used at low concentrations as a p-type dopant, does not contribute to soft errors. Integrated circuit manufacturers eliminated borated dielectrics by the 150 nm process node, largely due to this problem.
In critical designs, depleted boron—consisting almost entirely of Boron-11 is used, to avoid this effect and therefore to reduce the soft error rate. Boron-11 is a by-product of the nuclear industry
.
For applications in medical electronic devices this soft error mechanism may be extremely important. Neutrons are produced during high energy cancer radiation therapy using photon beam energies above 10 MV. These neutrons are moderated as they are scattered from the equipment and walls in the treatment room resulting in a thermal neutron flux that is about 40x106 higher than the normal environmental neutron flux. This high thermal neutron flux will generally result in a very high rate of soft errors and consequent circuit upset (Wilkinson et al., 2005), (Franco et al., 2005).
problems, such as inductive or capacitive crosstalk. However, in general, these sources represent a small contribution to the overall soft error rate when compared to radiation effects.
JESD-89 standard.
One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening
. This involves increasing the
capacitance at selected circuit nodes in order to increase its effective Qcrit value. This reduces the range of particle energies
to which the logic value of the node can be upset. Radiation hardening is often accomplished by increasing the size of transistors who share
a drain/source region at the node. Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can
predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.
, incorporating redundant data into each word
to create an error correcting code. Alternatively, roll-back error correction can be used, detecting the soft error with an error-detecting code
such as parity, and rewriting correct data from another source. This technique is often used for write-through cache
memories.
Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. These often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, and/or higher power consumption. The concept of triple modular redundancy
(TMR) can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into majority voting logic, returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly. In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied. Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaluations for consistency. This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy.
Traditionally, DRAM
has had the most attention in the quest to reduce, or work-around soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems (ref. the prevalence of ECC RAM in server computers). Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single alpha particle
. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip.
The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets
in multiple correction words, rather than a multi-bit upset in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error.
that determine whether
a single event upset
(SEU) will propagate to become a soft error are electrical masking, logical masking, and temporal (or timing-window) masking. An SEU is logically masked if its
propagation is blocked from reaching an output latch because off-path gate
inputs prevent a logical transition of that gate's output. An SEU is
electrically masked if the signal is attenuated by the electrical properties of
gates on its propagation path such that the resulting pulse is of insufficient magnitude to be
reliably latched. An SEU is temporally masked if the erroneous pulse reaches
an output latch, but it does occur close enough to when the latch is actually triggered to hold.
If all three masking effects fail to occur, the propagated pulse becomes latched and the output of the logic circuit will be an erroneous value. In the context of circuit operation, this erroneous output value may be considered a soft error event. However, from a microarchitectural-level standpoint, the affected result may not change the output of the currently-executing program. For instance, the erroneous data could be overwritten before use, masked in subsequent logic operations, or simply never be used. If erroneous data does not affect the output of a program, it is considered to be an example of microarchitectural masking.
While many electronic systems have an MTBF that exceeds the expected lifetime of the circuit, the SER may still be unacceptable to the manufacturer or customer. For instance, many failures per million circuits due to soft errors can be expected in the field if the system does not have adequate soft error protection. The failure of even a few products in the field, particularly if catastrophic, can tarnish the reputation of the product and company that designed it. Also, in safety- or cost-critical applications where the cost of system failure far outweighs the cost of the system itself, a 1% chance of soft error failure per lifetime may be too high to be acceptable to the customer. Therefore, it is advantageous to design for low SER when manufacturing a system in high-volume or requiring extremely high reliability.
Electronics
Electronics is the branch of science, engineering and technology that deals with electrical circuits involving active electrical components such as vacuum tubes, transistors, diodes and integrated circuits, and associated passive interconnection technologies...
and computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, a soft error is an error
Error
The word error entails different meanings and usages relative to how it is conceptually applied. The concrete meaning of the Latin word "error" is "wandering" or "straying". Unlike an illusion, an error or a mistake can sometimes be dispelled through knowledge...
in a signal or datum which is wrong. Errors may be caused by a defect, usually understood either to be a mistake in design or construction, or a broken component. A soft error is also a signal or datum which is wrong, but is not assumed to imply such a mistake or breakage. After observing a soft error, there is no implication that the system is any less reliable than before.
An error occurence in a computer's memory system that changes an instruction in a program or a data value. Soft errors typically can be remedied by cold booting the computer. A soft error will not damage a system's hardware; the only damage is to the data that is being processed.
There are two types of soft errors:
* chip-level soft error:
These errors occur when the radioactive atoms in the chip's material decay and release alpha particles into the chip.
Because an alpha particle contains a positive charge and kinetic energy, the particle can hit a memory cell and cause the cell to change state to a different value.
The atomic reaction is so tiny that it does not damage the actual structure of the chip. Chip-level errors are rare because modern memory is so stable that it would take a typical computer with a large memory capacity at least 10 years before the radioactive elements of the chip's materials begin to decay.
* system-level soft error:
These errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The computer tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. The bad data bit can even be saved in memory and cause problems at a later time.
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use error correction to correct soft errors on the fly. However, in many systems, it may be impossible to determine the correct data, or even to discover that an error is present at all. In addition, before the correction can occur, the system may have crashed
Crash (computing)
A crash in computing is a condition where a computer or a program, either an application or part of the operating system, ceases to function properly, often exiting after encountering errors. Often the offending program may appear to freeze or hang until a crash reporting service documents...
, in which case the recovery procedure must include a reboot.
Soft errors involve changes to data — the electrons in a storage circuit, for example — but not changes to the physical circuit itself, the atoms. If the data is rewritten, the circuit will work perfectly again.
Soft errors can occur on transmission lines, in digital logic, analog circuits, magnetic storage, and elsewhere, but are most commonly known in semiconductor storage.
Critical charge
Whether a circuit experiences a soft error depends on the energy of the incoming particle, the geometry of the impact, the location of the strike, and the design of the logic circuit. Logic circuits with higher capacitanceCapacitance
In electromagnetism and electronics, capacitance is the ability of a capacitor to store energy in an electric field. Capacitance is also a measure of the amount of electric potential energy stored for a given electric potential. A common form of energy storage device is a parallel-plate capacitor...
and higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by the critical charge
Electric charge
Electric charge is a physical property of matter that causes it to experience a force when near other electrically charged matter. Electric charge comes in two types, called positive and negative. Two positively charged substances, or objects, experience a mutual repulsive force, as do two...
parameter, Qcrit, the minimum electron charge disturbance needed to change the logic level. A higher Qcrit means fewer soft errors. Unfortunately, a higher Qcrit also means a slower logic gate and a higher power dissipation. Reduction in chip feature size and supply voltage, desirable for many reasons, decreases Qcrit. Thus, the importance of soft errors increases as chip technology advances.
In a logic circuit, Qcrit is defined as the minimum amount of induced charge required at a circuit node to cause a voltage pulse to propagate from that node to the output and be of sufficient duration and magnitude to be reliably latched. Since a logic circuit contains many nodes that may be struck, and each node may be of unique capacitance and distance from output, Qcrit is typically characterized on a per-node basis.
Alpha particles from package decay
Soft errors became widely known with the introduction of dynamic RAM in the 1970s. In these early devices, chip packaging materials contained small amounts of radioactive contaminants. Very low decay rates are needed to avoid excess soft errors, and chip companies have occasionally suffered problems with contamination ever since. It is extremely hard to maintain the material purity needed. Controlling alpha particle emission rates for critical packaging materials to less than a level of 0.001 counts per hour per cm2 (cph/cm2) is required for reliable performance of most circuits. For comparison, the count rate of a typical shoe's sole is between 0.1 and 10 cph/cm2.Package radioactive decay usually causes a soft error by alpha particle
Alpha particle
Alpha particles consist of two protons and two neutrons bound together into a particle identical to a helium nucleus, which is classically produced in the process of alpha decay, but may be produced also in other ways and given the same name...
emission. The positively charged alpha particle travels through the semiconductor and disturbs the distribution of electrons there. If the disturbance is large enough, a digital
Digital
A digital system is a data technology that uses discrete values. By contrast, non-digital systems use a continuous range of values to represent information...
signal can change from a 0 to a 1 or vice versa. In combinational logic
Combinational logic
In digital circuit theory, combinational logic is a type of digital logic which is implemented by boolean circuits, where the output is a pure function of the present input only. This is in contrast to sequential logic, in which the output depends not only on the present input but also on the...
, this effect is transient, perhaps lasting a fraction of a nanosecond, and this has led to the challenge of soft errors in combinational logic mostly going unnoticed. In sequential logic such as latches and RAM, even this transient upset can become stored for an indefinite time, to be read out later. Thus, designers are usually much more aware of the problem in storage circuits.
Cosmic rays creating energetic neutrons and protons
Once the electronics industry had determined how to control package contaminants, it became clear that other causes were also at work. James F. Ziegler led a program of work at IBMIBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
which culminated in the publication of a number of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays also could cause soft errors. Indeed, in modern devices, cosmic rays may be the predominant cause. Although the primary particle of the cosmic ray does not generally reach the Earth's surface, it creates a shower of energetic secondary particles. At the Earth's surface approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder composed of protons and pions (Ziegler, 1996). This flux of energetic neutrons is typically referred to as "cosmic rays" in the soft error literature. Neutrons are uncharged and cannot disturb a circuit on their own, but undergo neutron capture
Neutron capture
Neutron capture is a kind of nuclear reaction in which an atomic nucleus collides with one or more neutrons and they merge to form a heavier nucleus. Since neutrons have no electric charge they can enter a nucleus more easily than positively charged protons, which are repelled...
by the nucleus of an atom in a chip. This process may result in the production of charged secondaries, such as alpha particles and oxygen nuclei, which can then cause soft errors.
Cosmic ray flux depends on altitude. For the common reference location of 40.7N, 74W at 0 meters (sea level in New York City, NY, USA) the flux is approximately 14 neutrons / cm2/hour. Burying a system in a cave reduces the rate of cosmic-ray induced soft errors to a negligible level. In the lower levels of the atmosphere, the flux increases by a factor of about 2.2 for every 1000 m (1.3 for every 1000 ft) increase in altitude above sea level. Computers operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The rate of upsets in aircraft
Aircraft
An aircraft is a vehicle that is able to fly by gaining support from the air, or, in general, the atmosphere of a planet. An aircraft counters the force of gravity by using either static lift or by using the dynamic lift of an airfoil, or in a few cases the downward thrust from jet engines.Although...
may be more than 300 times the sea level upset rate. This is in contrast to package decay induced soft errors, which do not change with location. A model of the energetic neutron flux is presented in (Gordon & Goldhagen, 2004). An online calculator for this model is available at www.seutest.com.
The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number of cosmic-ray soft errors decreases during the active portion of the sunspot cycle and increases during the quiet portion. This counterintuitive result occurs for two reasons. The sun does not generally produce cosmic ray particles with energy above 1 GeV that are capable of penetrating to the Earth's upper atmosphere and creating particle showers, so the changes in the solar flux do not directly influence the number of errors. Further, the increase in the solar flux during an active sun period does have the effect of reshaping the Earth's magnetic field providing some additional shielding against higher energy cosmic rays, resulting in a decrease in the number of particles creating showers. The effect is fairly small in any case resulting in a +/- 7% modulation of the energetic neutron flux in New York City. Other locations are similarly affected.
Energetic neutrons produced by cosmic rays may lose most of their kinetic energy and reach thermal equilibrium with their surroundings as they are scattered by materials. The resulting neutrons are simply referred to as thermal neutrons and have an average kinetic energy of about 25 millielectron-volts at 25°C. Thermal neutrons are also produced by environmental radiation sources such as the decay of naturally occurring uranium or thorium. The thermal neutron flux from sources other than cosmic-ray showers may still be noticeable in an underground location and an important contributor to soft errors for some circuits.
Thermal neutrons
Neutrons that have lost kinetic energy until they are in thermal equilibrium with their surroundings are an important cause of soft errors for some circuits. At low energies many neutron captureNeutron capture
Neutron capture is a kind of nuclear reaction in which an atomic nucleus collides with one or more neutrons and they merge to form a heavier nucleus. Since neutrons have no electric charge they can enter a nucleus more easily than positively charged protons, which are repelled...
reactions become much more probable and result in fission of certain materials creating charged secondaries as fission byproducts. For some circuits the capture of a thermal neutron by the nucleus of the B-10 isotope of boron is particularly important. This nuclear reaction is an efficient producer of an alpha particle, Li-7 nucleus and gamma ray. Either of the charged particles (alpha or Li-7) may cause a soft error if produced in very close proximity, approximately 5 micrometers, to a critical circuit node. The capture cross section for B-11 is 6 orders of magnitude smaller and does not contribute to soft errors (Baumann et al., 1995)
Boron has been used in BPSG
Borophosphosilicate glass
Borophosphosilicate glass, commonly known as BPSG, is a type of silicate glass that includes additives of both boron and phosphorus. Silicate glasses such as PSG and borophosphosilicate glass are commonly used in semiconductor device fabrication for intermetal layers, i.e., insulating layers...
, the insulator in the interconnection layers of integrated circuits, particularly in the lowest one. The inclusion of boron lowers the melt temperature of the glass providing better reflow and planarization characteristics. In this application the glass is formulated with a boron content of 4% to 5% by weight. Naturally occurring boron is 20% B-10 with the remainder the B-11 isotope. Soft errors are caused by the high level of B-10 in this critical lower layer of some older integrated circuit processes. Boron-11, used at low concentrations as a p-type dopant, does not contribute to soft errors. Integrated circuit manufacturers eliminated borated dielectrics by the 150 nm process node, largely due to this problem.
In critical designs, depleted boron—consisting almost entirely of Boron-11 is used, to avoid this effect and therefore to reduce the soft error rate. Boron-11 is a by-product of the nuclear industry
Nuclear power
Nuclear power is the use of sustained nuclear fission to generate heat and electricity. Nuclear power plants provide about 6% of the world's energy and 13–14% of the world's electricity, with the U.S., France, and Japan together accounting for about 50% of nuclear generated electricity...
.
For applications in medical electronic devices this soft error mechanism may be extremely important. Neutrons are produced during high energy cancer radiation therapy using photon beam energies above 10 MV. These neutrons are moderated as they are scattered from the equipment and walls in the treatment room resulting in a thermal neutron flux that is about 40x106 higher than the normal environmental neutron flux. This high thermal neutron flux will generally result in a very high rate of soft errors and consequent circuit upset (Wilkinson et al., 2005), (Franco et al., 2005).
Other causes
Soft errors can also be caused by random noise or signal integritySignal integrity
Signal integrity or SI is a set of measures of the quality of an electrical signal. In digital electronics, a stream of binary values is represented by a voltage waveform. However, digital signals are fundamentally analog in nature, and all signals are subject to effects such as noise,...
problems, such as inductive or capacitive crosstalk. However, in general, these sources represent a small contribution to the overall soft error rate when compared to radiation effects.
Soft error mitigation
A designer can attempt to minimize the rate of soft errors by judicious device design, choosing the right semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. The susceptibility of devices to upsets is described in the industry using the JEDECJEDEC
The JEDEC Solid State Technology Association, formerly known as the Joint Electron Devices Engineering Council , is an independent semiconductor engineering trade organization and standardization body...
JESD-89 standard.
One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening
Radiation hardening
Radiation hardening is a method of designing and testing electronic components and systems to make them resistant to damage or malfunctions caused by ionizing radiation , such as would be encountered in outer space, high-altitude flight, around nuclear reactors, particle accelerators, or during...
. This involves increasing the
capacitance at selected circuit nodes in order to increase its effective Qcrit value. This reduces the range of particle energies
to which the logic value of the node can be upset. Radiation hardening is often accomplished by increasing the size of transistors who share
a drain/source region at the node. Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can
predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.
Correcting soft errors
Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use forward error correctionForward error correction
In telecommunication, information theory, and coding theory, forward error correction or channel coding is a technique used for controlling errors in data transmission over unreliable or noisy communication channels....
, incorporating redundant data into each word
Word
In language, a word is the smallest free form that may be uttered in isolation with semantic or pragmatic content . This contrasts with a morpheme, which is the smallest unit of meaning but will not necessarily stand on its own...
to create an error correcting code. Alternatively, roll-back error correction can be used, detecting the soft error with an error-detecting code
Error detection and correction
In information theory and coding theory with applications in computer science and telecommunication, error detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels...
such as parity, and rewriting correct data from another source. This technique is often used for write-through cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
memories.
Soft errors in logic circuits are sometimes detected and corrected using the techniques of fault tolerant design. These often include the use of redundant circuitry or computation of data, and typically come at the cost of circuit area, decreased performance, and/or higher power consumption. The concept of triple modular redundancy
Triple modular redundancy
In computing, triple modular redundancy is a fault tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the...
(TMR) can be employed to ensure very high soft-error reliability in logic circuits. In this technique, three identical copies of a circuit compute on the same data in parallel and outputs are fed into majority voting logic, returning the value that occurred in at least two of three cases. In this way, the failure of one circuit due to soft error is discarded assuming the other two circuits operated correctly. In practice, however, few designers can afford the greater than 200% circuit area and power overhead required, so it is usually only selectively applied. Another common concept to correct soft errors in logic circuits is temporal (or time) redundancy, in which one circuit operates on the same data multiple times and compares subsequent evaluations for consistency. This approach, however, often incurs performance overhead, area overhead (if copies of latches are used to store data), and power overhead, though is considerably more area-efficient than modular redundancy.
Traditionally, DRAM
Dynamic random access memory
Dynamic random-access memory is a type of random-access memory that stores each bit of data in a separate capacitor within an integrated circuit. The capacitor can be either charged or discharged; these two states are taken to represent the two values of a bit, conventionally called 0 and 1...
has had the most attention in the quest to reduce, or work-around soft errors, due to the fact that DRAM has comprised the majority-share of susceptible device surface area in desktop, and server computer systems (ref. the prevalence of ECC RAM in server computers). Hard figures for DRAM susceptibility are hard to come by, and vary considerably across designs, fabrication processes, and manufacturers. 1980s technology 256 kilobit DRAMS could have clusters of five or six bits flip from a single alpha particle
Alpha particle
Alpha particles consist of two protons and two neutrons bound together into a particle identical to a helium nucleus, which is classically produced in the process of alpha decay, but may be produced also in other ways and given the same name...
. Modern DRAMs have much smaller feature sizes, so the deposition of a similar amount of charge could easily cause many more bits to flip.
The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a multi-cell upset leads to only a number of separate single-bit upsets
Single event upset
A single event upset is a change of state caused by ions or electro-magnetic radiation striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close...
in multiple correction words, rather than a multi-bit upset in a single correction word. So, an error correcting code needs only to cope with a single bit in error in each correction word in order to cope with all likely soft errors. The term 'multi-cell' is used for upsets affecting multiple cells of a memory, whatever correction words those cells happen to fall in. 'Multi-bit' is used when multiple bits in a single correction word are in error.
Soft errors in combinational logic
The three natural masking effects in combinational logicCombinational logic
In digital circuit theory, combinational logic is a type of digital logic which is implemented by boolean circuits, where the output is a pure function of the present input only. This is in contrast to sequential logic, in which the output depends not only on the present input but also on the...
that determine whether
a single event upset
Single event upset
A single event upset is a change of state caused by ions or electro-magnetic radiation striking a sensitive node in a micro-electronic device, such as in a microprocessor, semiconductor memory, or power transistors. The state change is a result of the free charge created by ionization in or close...
(SEU) will propagate to become a soft error are electrical masking, logical masking, and temporal (or timing-window) masking. An SEU is logically masked if its
propagation is blocked from reaching an output latch because off-path gate
inputs prevent a logical transition of that gate's output. An SEU is
electrically masked if the signal is attenuated by the electrical properties of
gates on its propagation path such that the resulting pulse is of insufficient magnitude to be
reliably latched. An SEU is temporally masked if the erroneous pulse reaches
an output latch, but it does occur close enough to when the latch is actually triggered to hold.
If all three masking effects fail to occur, the propagated pulse becomes latched and the output of the logic circuit will be an erroneous value. In the context of circuit operation, this erroneous output value may be considered a soft error event. However, from a microarchitectural-level standpoint, the affected result may not change the output of the currently-executing program. For instance, the erroneous data could be overwritten before use, masked in subsequent logic operations, or simply never be used. If erroneous data does not affect the output of a program, it is considered to be an example of microarchitectural masking.
Soft error rate
Soft error rate (SER) is the rate at which a device or system encounters or is predicted to encounter soft errors. It is typically expressed as either number of failures-in-time (FIT), or mean time between failures (MTBF). The unit adopted for quantifying failures in time is called FIT, equivalent to 1 error per billion hours of device operation. MTBF is usually given in years of device operation. To put it in perspective, 1 year MTBF is equal to approximately 114,077 FIT (approximately ).While many electronic systems have an MTBF that exceeds the expected lifetime of the circuit, the SER may still be unacceptable to the manufacturer or customer. For instance, many failures per million circuits due to soft errors can be expected in the field if the system does not have adequate soft error protection. The failure of even a few products in the field, particularly if catastrophic, can tarnish the reputation of the product and company that designed it. Also, in safety- or cost-critical applications where the cost of system failure far outweighs the cost of the system itself, a 1% chance of soft error failure per lifetime may be too high to be acceptable to the customer. Therefore, it is advantageous to design for low SER when manufacturing a system in high-volume or requiring extremely high reliability.
External links
- Book on "Architecture Design for Soft Errors" by Shubu Mukherjee, published by Elsevier, Inc. Book review by Max Baron of Microprocessor Report (May 27, 2008), “Dr. Shubu Mukherjee’s book is a welcome surprise: books by architecture leaders in major companies are few and far between. Written from the viewpoint of a working engineer, the book describes sources of soft errors and solutions involving device, logic, and architecture design to reduce the effects of soft errors”
- Ionizing Radiation Effects in MOS Devices and Circuits by Tso Ping Ma and PAUL V. Dressendorfer, The first comprehensive overview describing the effects of ionizing radiation on MOS devices, as well as how to design, fabricate, and test integrated circuits intended for use in a radiation environment.
- Radiation Effects And Soft Errors In Integrated Circuits And Electronic Devices by Dan Fleetwood and Ron D Schrimpf, Vanderbilt UniversityVanderbilt UniversityVanderbilt University is a private research university located in Nashville, Tennessee, United States. Founded in 1873, the university is named for shipping and rail magnate "Commodore" Cornelius Vanderbilt, who provided Vanderbilt its initial $1 million endowment despite having never been to the...
, Nashville, TennesseeTennesseeTennessee is a U.S. state located in the Southeastern United States. It has a population of 6,346,105, making it the nation's 17th-largest state by population, and covers , making it the 36th-largest by total land area...
, USA A collection of the most important concepts in Radiation Effects by two pioneers in this field. - Soft Errors in Electronic Memory - A White Paper - A good summary paper with many references - Tezzaron Jan 2004. Concludes that 1000–5000 FIT per Mbit (0.2–1 error per day per Gbyte) is a typical DRAM soft error rate.
- Benefits of Chipkill-Correct ECC for PC Server Main Memory - A 1997 discussion of SDRAM reliability - some interesting information on "soft errors" from cosmic rayCosmic rayCosmic rays are energetic charged subatomic particles, originating from outer space. They may produce secondary particles that penetrate the Earth's atmosphere and surface. The term ray is historical as cosmic rays were thought to be electromagnetic radiation...
s, especially with respect to Error-correcting code schemes - Soft errors' impact on system reliability - Ritesh Mastipuram and Edwin C Wee, Cypress Semiconductor, 2004
- Scaling and Technology Issues for Soft Error Rates - A Johnston - 4th Annual Research Conference on Reliability Stanford University, October 2000
- Evaluation of LSI Soft Errors Induced by Terrestrial Cosmic rays and Alpha Particles - H. Kobayashi, K. Shiraishi, H. Tsuchiya, H. Usuki (all of Sony), and Y. Nagai, K. Takahisa (Osaka University), 2001.
- SELSE Workshop Website - Website for the workshop on the System Effects of Logic Soft Errors
- TRAD Tests & Radiations - A company dedicated to Single events and soft error Test, solutions and products
- iRoC Technologies - A company dedicated to Soft Error solutions and products
- EADS Nucletudes - A company dedicated to hardening system in harsh elctromagnetic and radiative environments