RAID
Encyclopedia
RAID is a storage technology that combines multiple disk drive components into a logical unit. Data is distributed across the drives in one of several ways called "RAID levels", depending on what level of redundancy
and performance (via parallel communication) is required.
RAID is an example of storage virtualization
and was first defined by David A. Patterson, Garth A. Gibson
, and Randy Katz
at the University of California, Berkeley
in 1987. Marketers representing industry RAID manufacturers later attempted to reinvent the term to describe a redundant array of independent disks as a means of dissociating a low-cost expectation from RAID technology.
RAID is now used as an umbrella term
for computer data storage schemes that can divide and replicate data
among multiple physical drives. The physical drives are said to be in a RAID, which is accessed by the operating system
as one single drive. The different schemes or architectures are named by the word RAID followed by a number (e.g., RAID 0, RAID 1). Each scheme provides a different balance between two key goals: increase data reliability and increase input/output
performance.
and many non-standard levels
(mostly proprietary
). RAID levels and their associated data formats are standardised by SNIA in the Common RAID Disk Drive Format (DDF) standard.
Following is a brief textual summary of the most commonly used RAID levels.
The following table provides an overview of the most important parameters of standard RAID levels. In each case:
* Assumes hardware is fast enough to support;
** Assumes independent, identical rate of failure amongst drives
many storage controllers allow RAID levels to be nested. The elements of a RAID may be either individual drives or RAIDs themselves. Nesting more than two deep is unusual.
As there is no basic RAID level numbered larger than 9, nested RAIDs are usually unambiguously described by attaching the numbers indicating the RAID levels, sometimes with a "+" in between. The order of the digits in a nested RAID designation is the order in which the nested array is built: For a RAID 1+0, drives are first combined into multiple level 1 RAIDs that are themselves treated as single drives to be combined into a single RAID 0; the reverse structure is also possible (RAID 0+1).
The final RAID is known as the top array. When the top array is a RAID 0 (such as in RAID 1+0 and RAID 5+0), most vendors omit the "+" (yielding RAID 10 and RAID 50, respectively).
Whether an array runs as RAID 0+1 or RAID 1+0 in practice is often determined by the evolution of the storage system. A RAID controller might support upgrading a RAID 1 array to a RAID 1+0 array on the fly, but require a lengthy offline rebuild to upgrade from RAID 1 to RAID 0+1. With nested arrays, sometimes the path of least disruption prevails over achieving the preferred configuration.
In Boolean logic
, there is an operation called exclusive or (XOR), meaning "one or the other, but not neither and not both." For example:
The XOR operator is central to how parity data is created and used within an array. It is used both for the protection of data, as well as for the recovery of missing data.
As an example, consider a simple RAID made up of 6 drives (4 for data, 1 for parity, and 1 for use as a hot spare), where each drive has only a single byte worth of storage (a '
Let the following data be written to the data drives:
Every time data is written to the data drives, a parity value is calculated in order to be able to recover from a data drive failure. To calculate the parity for this RAID, a bitwise XOR of each drive's data is calculated as follows, the result of which is the parity data:
The parity data
Suppose Drive #3 fails. In order to restore the contents of Drive #3, the same XOR calculation is performed using the data of all the remaining data drives and (as a substitute for Drive #3) the parity value (11100110) stored in Drive #6:
With the complete contents of Drive #3 recovered, the data is written to the hot spare, and the RAID can continue operating.
At this point the dead drive has to be replaced with a working one of the same size. Depending on the implementation, the new drive either takes over as a new hot spare drive and the old hot spare drive continues to act as a data drive of the array, or (as illustrated below) the original hot spare's contents are automatically copied to the new drive by the array controller, allowing the original hot spare to return to its original purpose as an emergency standby drive. The resulting array is identical to its pre-failure state:
This same basic XOR principle applies to parity within RAID groups regardless of capacity or number of drives. As long as there are enough drives present to allow for an XOR calculation to take place, parity can be used to recover data from any single drive failure. (A minimum of three drives must be present in order for parity to be used for fault tolerance, because the XOR operator requires two operands, and a place to store the result).
While this may have been a hurdle in past RAID 5 implementations, the task of parity recalculation and redistribution within modern Storage Area Network
(SAN) appliances is performed as a back-end process transparent to the host, not as an in-line process which competes with existing I/O. (i.e. the RAID controller handles this as a housekeeping task to be performed during a particular spindle's idle timeslices, so as not to disrupt any pending I/O from the host.) The "write penalty" inherent to RAID 5 has been effectively masked since the late 1990s by a combination of improved controller design, larger amounts of cache, and faster drives. The effect of a write penalty when using RAID 5 is mostly a concern when the workload cannot be destaged efficiently from the SAN controller's write cache.
In some of enterprise-level SAN hardware, any writes which are generated by the host are simply stored in a small, mirrored NVRAM cache, such as an SSD drive, acknowledged faster than waiting for the write to go to disk, and finally synchronized with drives when the controller sees fit to do so from an efficiency standpoint. From the host's perspective, an individual write to a RAID 10 volume is no faster than an individual write to a RAID 5 volume, given sufficient amount of write cache; a difference between them becomes apparent when the write cache at the SAN controller level is overwhelmed, so that the SAN appliance must reject or gate further write requests. SAN appliances generally service multiple hosts that compete both for controller cache, potential SSD cache, and spindle time.
The choice between RAID 10 and RAID 5 for the purpose of housing a relational database depends upon a number of factors (spindle availability, cost, business risk, etc.) but, from a performance standpoint, it depends mostly on the type of I/O expected for a particular database application. For databases that are expected to be exclusively or strongly read-biased, RAID 10 is often chosen because it offers a slight speed improvement over RAID 5 on sustained reads and sustained randomized writes. If a database is expected to be strongly write-biased, RAID 5 becomes the more attractive option, since RAID 5 does not suffer from the same write handicap inherent in RAID 10; all spindles in a RAID 5 can be utilized to write simultaneously, whereas only half the members of a RAID 10 can be used. However, for reasons similar to what has eliminated the "read penalty" in RAID 5, the 'write penalty' of RAID 10 has been largely masked by improvements in controller cache efficiency and drive throughput.
What causes RAID 5 to be slightly slower than RAID 10 on sustained reads is the fact that RAID 5 has parity data interleaved within normal data. For every read pass in RAID 5, there is a probability that a read head may need to traverse a region of parity data. The cumulative effect of this is a slight performance drop compared to RAID 10, which does not use parity, and therefore never encounters a circumstance where data underneath a head is of no use. For the vast majority of situations, however, most relational databases housed on RAID 10 perform equally well in RAID 5. The strengths and weaknesses of each type only become an issue in atypical deployments, or deployments on overcommitted hardware. Often, any measurable differences between the two formats are masked by sturctural deficiencies at the host layer, such as poor database maintenance, or sub-optimal I/O configuration settings.
There are, however, other considerations which must be taken into account other than simply those regarding performance. RAID 5 and other non-mirror-based arrays offer a lower degree of resiliency than RAID 10 by virtue of RAID 10's mirroring strategy. In a RAID 10, I/O can continue even in spite of multiple drive failures. By comparison, in a RAID 5 array, any failure involving more than one drive renders the array itself unusable by virtue of parity recalculation being impossible to perform. Thus, RAID 10 is frequently favored because it provides the lowest level of risk.
Additionally, the time required to rebuild data on a hot spare in a RAID 10 is significantly less than in a RAID 5, because all the remaining spindles in a RAID 5 rebuild must participate in the process, whereas only the hot spare and one mirror are required in a RAID 10. Thus, in comparison to a RAID 5, a RAID 10 has a smaller window of opportunity during which a second drive failure could cause array failure.
Modern SAN design largely masks any performance hit while a RAID is in a degraded state, by virtue of being able to perform rebuild operations both in-band or out-of-band with respect to existing I/O traffic. Given the rare nature of drive failures in general, and the exceedingly low probability of multiple concurrent drive failures occurring within the same RAID, the choice of RAID 5 over RAID 10 often comes down to the preference of the storage administrator, particularly when weighed against other factors such as cost, throughput requirements, and physical spindle availability.
In short, the choice between RAID 5 and RAID 10 involves a complicated mixture of factors. There is no one-size-fits-all solution, as the choice of one over the other must be dictated by everything from the I/O characteristics of the database, to business risk, to worst case degraded-state throughput, to the number and type of drives present in the array itself. Over the course of the life of a database, one may even see situations where RAID 5 is initially favored, but RAID 10 slowly becomes the better choice, and vice versa.
The original "Berkeley" RAID classifications are still kept as an important historical reference point and also to recognize that RAID Levels 0–6 successfully define all known data mapping and protection schemes for disk-based storage systems. Unfortunately, the original classification caused some confusion due to the assumption that higher RAID levels imply higher redundancy and performance; this confusion has been exploited by RAID system manufacturers, and it has given birth to the products with such names as RAID-7, RAID-10, RAID-30, RAID-S, etc. Consequently, the new classification describes the data availability characteristics of a RAID system, leaving the details of its implementation to system manufacturers.
.
data. In parity configurations, a RAID protects from catastrophic data loss caused by physical damage or errors on a single drive within the array (or two drives in, say, RAID 6). However, a true backup system has other important features such as the ability to restore an earlier version of data, which is needed both to protect against software
errors that write unwanted data to secondary storage, and also to recover from user error
and malicious data deletion. A RAID can be overwhelmed by catastrophic failure that exceeds its recovery capacity and, of course, the entire array is at risk of physical damage by fire, natural disaster, and human forces, while backups can be stored off-site. A RAID is also vulnerable to controller failure because it is not always possible to migrate a RAID to a new, different controller without data loss.
or by software
. A software solution may be part of the operating system, or it may be part of the firmware and drivers supplied with a hardware RAID controller.
s. Software RAID can be implemented as:
, which allows a system to use logical volumes which can be resized or moved. Often, features like
RAID or snapshots are also supported.
s are designed to organize data across multiple storage devices directly (without needing the help of a third-party logical volume manager).
Software RAID has advantages and disadvantages compared to hardware RAID. The software must run on a host server attached to storage, and the server's processor must dedicate processing time to run the RAID software; the additional processing capacity required for RAID 0 and RAID 1 is low, but parity-based arrays require more complex data processing during write or integrity-checking operations. As the rate of data processing increases with the number of drives in the array, so does the processing requirement. Furthermore, all the buses between the processor and the drive controller must carry the extra data required by RAID, which may cause congestion.
Fortunately, over time, the increase in commodity CPU speed has been consistently greater than the increase in drive throughput; the percentage of host CPU time required to saturate a given number of drives has decreased. For instance, under 100% usage of a single core on a 2.1 GHz Intel "Core2" CPU, the Linux software RAID subsystem (md) as of version 2.6.26 is capable of calculating parity information at 6 GB/s; however, a three-drive RAID 5 array using drives capable of sustaining a write operation at 100 MB/s only requires parity to be calculated at the rate of 200 MB/s, which requires the resources of just over 3% of a single CPU core.
Furthermore, software RAID implementations may employ more sophisticated algorithms than hardware RAID implementations (e.g. drive scheduling and command queueing), and thus, may be capable of better performance.
Another concern with software implementations is the process of booting the associated operating system. For instance, consider a computer being booted from a RAID 1 (mirrored drives); if the first drive in the RAID 1 fails, then a first-stage boot loader might not be sophisticated enough to attempt loading the second-stage boot loader from the second drive as a fallback. In contrast, a RAID 1 hardware controller typically has explicit programming to decide that a drive has malfunctioned and that the next drive should be used. At least the following second-stage boot loaders are capable of loading a kernel
from a RAID 1:
For data safety, the write-back cache
of an operating system or individual drive might need to be turned off in order to ensure that as much data as possible is actually written to secondary storage before some failure (such as a loss of power); unfortunately, turning off the write-back cache has a performance penalty that can be significant depending on the workload and command queuing support. In contrast, a hardware RAID controller may carry a dedicated battery-powered write-back cache of its own, thereby allowing for efficient operation that is also relatively safe. Fortunately, it is possible to avoid such problems with a software controller by constructing a RAID with safer components; for instance, each drive could have its own battery or capacitor on its own write-back cache, and the drive could implement atomicity in various ways, and the entire RAID or computing system could be powered by a UPS
, etc.
Finally, a software RAID controller that is built into an operating system usually uses proprietary data formats and RAID levels, so an associated RAID usually cannot be shared between operating systems as part of a multi boot setup. However, such a RAID may be moved between computers that share the same operating system; in contrast, such mobility is more difficult when using a hardware RAID controller because both computers must provide compatible hardware controllers. Also, if the hardware controller fails, data could become unrecoverable unless a hardware controller of the same type is obtained.
Most software implementations allow a RAID to be created from partitions
rather than entire physical drives. For instance, an administrator could divide each drive of an odd number of drives into two partitions, and then mirror partitions across drives and stripe a volume across the mirrored partitions to emulate IBM's RAID 1E configuration. Using partitions in this way also allows for constructing multiple RAIDs in various RAID levels from the same set of drives. For example, one could have a very robust RAID 1 for important files, and a less robust RAID 5 or RAID 0 for less important data, all using the same set of underlying drives. (Some BIOS-based controllers offer similar features, e.g. Intel Matrix RAID
.) Using two partitions from the same drive in the same RAID puts data at risk if the drive fails; for instance:
On a desktop system, a hardware RAID controller may be an expansion card
connected to a bus (e.g., PCI
or PCIe
), a component integrated into the motherboard
; there are controllers for supporting most types of drive technology, such as IDE/ATA, SATA
, SCSI
, SSA
, Fibre Channel
, and sometimes even a combination. The controller and drives may be in a stand-alone enclosure
, rather than inside a computer, and the enclosure may be directly attached to a computer, or connected via a SAN
.
Most hardware implementations provide a read/write cache
, which, depending on the I/O workload, improves performance. In most systems, the write cache is non-volatile (i.e. battery-protected), so pending writes are not lost in the event of a power failure.
Hardware implementations provide guaranteed performance, add no computational overhead to the host computer, and can support many operating systems; the controller simply presents the RAID as another logical drive
.
Initially, the term "RAID controller" implied that the controller does the processing. However, while a controller without a dedicated RAID chip is often described by a manufacturer as a "RAID controller", it is rarely made clear that the burden of RAID processing is borne by a host computer's central processing unit rather than the RAID controller itself. Thus, this new type is sometimes called "fake" RAID; Adaptec
calls it a "HostRAID".
Moreover, a firmware controller can often only support certain types of hard drive to form the RAID that it manages (e.g. SATA for an Intel Matrix RAID
, as there is neither SCSI nor PATA support in modern Intel ICH southbridges
; however, motherboard makers implement RAID controllers outside of the southbridge on some motherboards).
drive; this is a drive physically installed in the array which is inactive until an active drive fails, when the system automatically replaces the failed drive with the spare, rebuilding the array with the spare drive included. This reduces the mean time to recovery
(MTTR), but does not completely eliminate it. As with non-hot-spare systems, subsequent additional failure(s) in the same RAID redundancy group before the array is fully rebuilt can cause data loss. Rebuilding can take several hours, especially on busy systems.
It is sometimes considered that if drives are procured and installed at the same time, several drives are more likely to fail at about the same time than for unrelated drives, so rapid replacement of a failed drive is important. RAID 6 without a spare uses the same number of drives as RAID 5 with a hot spare and protects data against failure of up to two drives, but requires a more advanced RAID controller. Further, a hot spare can be shared by multiple RAID sets.
: Two different kinds of failure rates are applicable to RAID systems. Logical failure is defined as the loss of a single drive and its rate is equal to the sum of individual drives' failure rates. System failure is defined as loss of data and its rate will depend on the type of RAID. For RAID 0 this is equal to the logical failure rate, as there is no redundancy. For other types of RAID, it will be less than the logical failure rate, potentially very small, and its exact value will depend on the type of RAID, the number of drives employed, the vigilance and alacrity of its human administrators, and chance (improbable events do occur, though infrequently).
Mean time to data loss (MTTDL): In this context, the average time before a loss of data in a given array. Mean time to data loss of a given RAID may be higher or lower than that of its constituent hard drives, depending upon what type of RAID is employed. The referenced report assumes times to data loss are exponentially distributed, so that 63.2% of all data loss will occur between time 0 and the MTTDL.
Mean time to recovery
(MTTR): In arrays that include redundancy for reliability, this is the time following a failure to restore an array to its normal failure-tolerant mode of operation. This includes time to replace a failed drive mechanism and time to re-build the array (to replicate data for redundancy).
Unrecoverable bit error rate (UBE): This is the rate at which a drive will be unable to recover data after application of cyclic redundancy check (CRC) codes and multiple retries.
Write cache reliability: Some RAID systems use RAM
write cache to increase performance. A power failure can result in data loss unless this sort of drive buffer
has a supplementary battery to ensure that the buffer has time to write from RAM to secondary storage before the drive powers down.
Atomic write failure: Also known by various terms such as torn writes, torn pages, incomplete writes, interrupted writes, non-transactional, etc.
In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates those assumptions; failures are in fact statistically correlated. In practice, the chances of a second failure before the first has been recovered (causing data loss) is not as unlikely as for random failures. In a study including about 100 thousand drives, the probability of two drives in the same cluster failing within one hour was observed to be four times larger than was predicted by the exponential statistical distribution
which characterises processes in which events occur continuously and independently at a constant average rate. The probability of two failures within the same 10-hour period was twice as large as that which was predicted by an exponential distribution.
A common assumption is that "server-grade" drives fail less frequently than consumer-grade drives. Two independent studies (one by Carnegie Mellon University
and the other by Google
) have shown that the "grade" of a drive does not relate to the drive's failure rate.
However, very few storage systems provide support for atomic writes, and even fewer specify their rate of failure in providing this semantic. Note that during the act of writing an object, a RAID storage device will usually be writing all redundant copies of the object in parallel, although overlapped or staggered writes are more common when a single RAID processor is responsible for multiple drives. Hence an error that occurs during the process of writing may leave the redundant copies in different states, and furthermore may leave the copies in neither the old nor the new state. The little known failure mode is that delta logging relies on the original data being either in the old or the new state so as to enable backing out the logical change, yet few storage systems provide an atomic write semantic for a RAID.
While the battery-backed write cache may partially solve the problem, it is applicable only to a power failure scenario.
Since transactional support is not universally present in hardware RAID, many operating systems include transactional support to protect against data loss during an interrupted write. Novell NetWare, starting with version 3.x, included a transaction tracking system. Microsoft introduced transaction tracking via the journaling
feature in NTFS
. ext4 has journaling with checksums; ext3 has journaling without checksums but an "append-only" option, or ext3cow (Copy on Write). If the journal itself in a filesystem is corrupted though, this can be problematic. The journaling in NetApp WAFL
file system gives atomicity by never updating the data in place, as does ZFS. An alternative method to journaling is soft updates
, which are used in some BSD-derived system's implementation of UFS
.
This can present as a sector read failure. Some RAID implementations protect against this failure mode by remapping the bad sector
, using the redundant data to retrieve a good copy of the data, and rewriting that good data to the newly mapped replacement sector. The UBE (Unrecoverable Bit Error) rate is typically specified at 1 bit in 1015 for enterprise class drives (SCSI
, FC
, SAS
) , and 1 bit in 1014 for desktop class drives (IDE/ATA/PATA, SATA). Increasing drive capacities and large RAID 5 redundancy groups have led to an increasing inability to successfully rebuild a RAID group after a drive failure because an unrecoverable sector is found on the remaining drives. Double protection schemes such as RAID 6 are attempting to address this issue, but suffer from a very high write penalty.
can mean a significant loss of any data queued in such a cache.
Often a battery is protecting the write cache, mostly solving the problem. If a write fails because of power failure, the controller may complete the pending writes as soon as restarted. This solution still has potential failure cases: the battery may have worn out, the power may be off for too long, the drives could be moved to another controller, and the controller itself could fail. Some systems provide the capability of testing the battery periodically, however this leaves the system without a fully charged battery for several hours.
An additional concern about write cache reliability exists, specifically regarding devices equipped with a write-back cache—a caching system which reports the data as written as soon as it is written to cache, as opposed to the non-volatile medium. The safer cache technique is write-through, which reports transactions as written when they are written to the non-volatile medium.
, alleviates this concern, as the setup is not hardware dependent, but runs on ordinary drive controllers, and allows the reassembly of an array. Additionally, individual drives of a RAID 1 (software and most hardware implementations) can be read like normal drives when removed from the array, so no RAID system is required to retrieve the data. Inexperienced data recovery firms typically have a difficult time recovering data from RAID drives, with the exception of RAID1 drives with conventional data structure.
A fix specific to Western Digital's desktop drives used to be known: A utility called WDTLER.exe could limit a drive's error recovery time; the utility enabled TLER (time limited error recovery)
, which limits the error recovery time to 7 seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (e.g., the Caviar Black line), making such drives unsuitable for use in a RAID.
However, Western Digital enterprise class drives are shipped from the factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. Of course, for non-RAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive..
In late 2010, the Smartmontools
program began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in a RAID.
(MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.
Particularly, the array must be monitored, and any failures detected and dealt with promptly. Failure to do so will result in the array continuing to run in a degraded state, vulnerable to further failures. Ultimately more failures may occur, until the entire array becomes inoperable, resulting in data loss and downtime. In this case, any protection the array may provide merely delays this.
The operator must know how to detect failures or verify healthy state of the array, identify which drive failed, have replacement drives available, and know how to replace a drive and initiate a rebuild of the array.
4,092,732 titled "System for recovering data stored in failed memory unit." The claims
for this patent describe what would later be termed RAID 5 with full stripe
writes. This 1978 patent also mentions that drive mirroring or duplexing (what would later be termed RAID 1) and protection with dedicated parity (that would later be termed RAID 4) were prior art
at that time.
The term RAID was first defined by David A. Patterson, Garth A. Gibson
and Randy Katz
at the University of California, Berkeley, in 1987. They studied the possibility of using two or more drives to appear as a single device to the host system and published a paper: "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the SIGMOD
conference.
This specification suggested a number of prototype RAID levels, or combinations of drives. Each had theoretical advantages and disadvantages. Over the years, different implementations of the RAID concept have appeared. Most differ substantially from the original idealized RAID levels, but the numbered names have remained. This can be confusing, since one implementation of RAID 5, for example, can differ substantially from another. RAID 3 and RAID 4 are often confused and even used interchangeably.
One of the early uses of RAID 0 and 1 was the Crosfield Electronics Studio 9500 page layout system based on the Python workstation. The Python workstation was a Crosfield managed international development using PERQ
3B electronics, benchMark Technology's Viper display system and Crosfield's own RAID and fibre-optic network controllers. RAID 0 was particularly important to these workstations as it dramatically sped up image manipulation for the pre-press markets. Volume production started in Peterborough, England in early 1987.
also exist, and are often referred to, similarly to RAID, by standard acronyms, several tongue-in-cheek. A single drive is referred to as a SLED (Single Large Expensive Disk/Drive), by contrast with RAID, while an array of drives without any additional control (accessed simply as independent drives) is referred to, even in a formal context such as equipment specification, as a JBOD (Just a Bunch Of Disks). Simple concatenation is referred to as a "span".
Redundancy (engineering)
In engineering, redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe....
and performance (via parallel communication) is required.
RAID is an example of storage virtualization
Storage Virtualization
Storage virtualization or storage virtualisation is a concept and term used within computer science. Specifically, storage systems may use virtualization concepts as a tool to enable better functionality and more advanced features within the storage system.Broadly speaking, a 'storage system' is...
and was first defined by David A. Patterson, Garth A. Gibson
Garth A. Gibson
Garth Gibson is a Computer Scientist from Carnegie Mellon University. Born in Aurora, Ontario, he holds a Ph.D. and an MSc in Computer Science from the University of California, Berkeley and a B.Math in Computer Science from the University of Waterloo. Dr...
, and Randy Katz
Randy Katz
Randy Howard Katz is a distinguished professor at University of California, Berkeley of the Electrical Engineering and Computer Science Department.Katz received an A.B. from Cornell University , MS from UC Berkeley , and Ph.D...
at the University of California, Berkeley
University of California, Berkeley
The University of California, Berkeley , is a teaching and research university established in 1868 and located in Berkeley, California, USA...
in 1987. Marketers representing industry RAID manufacturers later attempted to reinvent the term to describe a redundant array of independent disks as a means of dissociating a low-cost expectation from RAID technology.
RAID is now used as an umbrella term
Umbrella term
An umbrella term is a word that provides a superset or grouping of concepts that all fall under a single common category. Umbrella term is also called a hypernym. For example, cryptology is an umbrella term that encompasses cryptography and cryptanalysis, among other fields...
for computer data storage schemes that can divide and replicate data
Data (computing)
In computer science, data is information in a form suitable for use with a computer. Data is often distinguished from programs. A program is a sequence of instructions that detail a task for the computer to perform...
among multiple physical drives. The physical drives are said to be in a RAID, which is accessed by the operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
as one single drive. The different schemes or architectures are named by the word RAID followed by a number (e.g., RAID 0, RAID 1). Each scheme provides a different balance between two key goals: increase data reliability and increase input/output
Input/output
In computing, input/output, or I/O, refers to the communication between an information processing system , and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it...
performance.
Standard levels
A number of standard schemes have evolved which are referred to as levels. There were five RAID levels originally conceived, but many more variations have evolved, notably several nested levelsNested RAID levels
Levels of nested RAID, also known as hybrid RAID, combine two or more of the standard levels of RAID to gain performance, additional redundancy, or both.-Nesting:...
and many non-standard levels
Non-standard RAID levels
Although all RAID implementations differ from the specification to some extent, some companies have developed non-standard RAID implementations that differ substantially from the standard...
(mostly proprietary
Proprietary software
Proprietary software is computer software licensed under exclusive legal right of the copyright holder. The licensee is given the right to use the software under certain conditions, while restricted from other uses, such as modification, further distribution, or reverse engineering.Complementary...
). RAID levels and their associated data formats are standardised by SNIA in the Common RAID Disk Drive Format (DDF) standard.
Following is a brief textual summary of the most commonly used RAID levels.
- RAID 0 (block-level stripingData stripingIn computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, in a way that accesses of sequential segments are made to different physical storage devices. Striping is useful when a processing device requests access to data more quickly than a...
without parityParity bitA parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code....
or mirroringDisk mirroringIn data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability...
) has no (or zero) redundancy. It provides improved performance and additional storage but no fault tolerance. Hence simple stripe sets are normally referred to as RAID 0. Any drive failure destroys the array, and the likelihood of failure increases with more drives in the array (at a minimum, catastrophic data loss is almost twice as likely compared to single drives without RAID). A single drive failure destroys the entire array because when data is written to a RAID 0 volume, the data is broken into fragments called blocks. The number of blocks is dictated by the stripe size, which is a configuration parameter of the array. The blocks are written to their respective drives simultaneously on the same sector. This allows smaller sections of the entire chunk of data to be read off the drive in parallel, increasing bandwidth. RAID 0 does not implement error checking, so any error is uncorrectable. More drives in the array means higher bandwidth, but greater risk of data loss.
- In RAID 1 (mirroring without parity or striping), data is written identically to multiple drives, thereby producing a "mirrored set"; at least 2 drives are required to constitute such an array. While more constituent drives may be employed, many implementations deal with a maximum of only 2; of course, it might be possible to use such a limited level 1 RAID itself as a constituent of a level 1 RAID, effectively masking the limitation. The array continues to operate as long as at least one drive is functioning. With appropriate operating system support, there can be increased read performance, and only a minimal write performance reduction; implementing RAID 1 with a separate controller for each drive in order to perform simultaneous reads (and writes) is sometimes called multiplexing (or duplexing when there are only 2 drives).
- In RAID 2 (bit-level striping with dedicated Hamming-code parity), all disk spindle rotation is synchronized, and data is striped such that each sequential bitBitA bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states...
is on a different drive. Hamming-codeHamming codeIn telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only...
parity is calculated across corresponding bits and stored on at least one parity drive.
- In RAID 3 (byte-level striping with dedicated parity), all disk spindle rotation is synchronized, and data is striped so each sequential byteByteThe byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
is on a different drive. Parity is calculated across corresponding bytes and stored on a dedicated parity drive.
- RAID 4 (block-level striping with dedicated parity) is identical to RAID 5 (see below), but confines all parity data to a single drive. In this setup, files may be distributed between multiple drives. Each drive operates independently, allowing I/O requests to be performed in parallel. However, the use of a dedicated parity drive could create a performance bottleneckBottleneckA bottleneck is a phenomenon where the performance or capacity of an entire system is limited by a single or limited number of components or resources. The term bottleneck is taken from the 'assets are water' metaphor. As water is poured out of a bottle, the rate of outflow is limited by the width...
; because the parity data must be written to a single, dedicated parity drive for each block of non-parity data, the overall write performance may depend a great deal on the performance of this parity drive.
- RAID 5 (block-level striping with distributed parity) distributes parity along with the data and requires all drives but one to be present to operate; the array is not destroyed by a single drive failure. Upon drive failure, any subsequent reads can be calculated from the distributed parity such that the drive failure is masked from the end user. However, a single drive failure results in reduced performance of the entire array until the failed drive has been replaced and the associated data rebuilt. Additionally, there is the potentially disastrous RAID 5 write holeRAID 5 write holeThe RAID 5 write hole is a known data corruption issue with the redundant hard disk setup known as RAID 5, caused by non-atomic writes.- Issue :Imagine the following configuration with 2 disks and a parity disk in a RAID 5 configuration:...
.
- RAID 6 (block-level striping with double distributed parity) provides fault tolerance of two drive failures; the array continues to operate with up to two failed drives. This makes larger RAID groups more practical, especially for high-availability systems. This becomes increasingly important as large-capacity drives lengthen the time needed to recover from the failure of a single drive. Single-parity RAID levels are as vulnerable to data loss as a RAID 0 array until the failed drive is replaced and its data rebuilt; the larger the drive, the longer the rebuild takes. Double parity gives additional time to rebuild the array without the data being at risk if a single additional drive fails before the rebuild is complete.
The following table provides an overview of the most important parameters of standard RAID levels. In each case:
- Array space efficiency is given as an expression in terms of the number of drives, ; this expression designates a value between 0 and 1, representing the fraction of the sum of the drives' capacities that is available for use. For example, if 3 drives are arranged in RAID 3, this gives an array space efficiency of (approximately 66%); thus, if each drive in this example has a capacity of 250 GB, then the array has a total capacity of 750 GB but the capacity that is usable for data storage is only 500 GB.
- Array failure rate is given as an expression in terms of the number of drives, , and the drive failure rate, (which is assumed to be identical and independent for each drive). For example, if each of 3 drives has a failure rate of 5% over the next 3 years, and these drives are arranged in RAID 3, then this gives an array failure rate of over the next 3 years.
Level | Description | Minimum # of drives | Space Efficiency | Fault Tolerance | Array Failure Rate** | Read Benefit | Write Benefit | Image |
---|---|---|---|---|---|---|---|---|
RAID 0 | Block-level striping Data striping In computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, in a way that accesses of sequential segments are made to different physical storage devices. Striping is useful when a processing device requests access to data more quickly than a... without parity Parity bit A parity bit is a bit that is added to ensure that the number of bits with the value one in a set of bits is even or odd. Parity bits are used as the simplest form of error detecting code.... or mirroring Disk mirroring In data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability... . |
2 | 1 | 0 (none) | 1−(1−r)n | nX | nX | |
RAID 1 | Mirroring Disk mirroring In data storage, disk mirroring or RAID1 is the replication of logical disk volumes onto separate physical hard disks in real time to ensure continuous availability... without parity or striping. |
2 | 1/n | n−1 drives | rn | nX | 1X | |
RAID 2 | Bit-level striping with dedicated Hamming-code Hamming code In telecommunication, Hamming codes are a family of linear error-correcting codes that generalize the Hamming-code invented by Richard Hamming in 1950. Hamming codes can detect up to two and correct up to one bit errors. By contrast, the simple parity code cannot correct errors, and can detect only... parity. |
3 | 1 − 1/n ⋅ log2(n-1) | RAID 2 can recover from 1 drive failure or repair corrupt data or parity when a corrupted bit's corresponding data and parity are good. | variable | variable | variable | |
RAID 3 | Byte-level striping with dedicated parity. | 3 | 1 − 1/n | 1 drive | n(n−1)r2 | (n−1)X | (n−1)X* | |
RAID 4 | Block-level striping with dedicated parity. | 3 | 1 − 1/n | 1 drive | n(n−1)r2 | (n−1)X | (n−1)X* | |
RAID 5 | Block-level striping with distributed parity. | 3 | 1 − 1/n | 1 drive | n(n−1)r2 | (n−1)X* | (n−1)X* | |
RAID 6 | Block-level striping with double distributed parity. | 4 | 1 − 2/n | 2 drives | n(n-1)(n-2)r3 | (n−2)X* | (n−2)X* |
* Assumes hardware is fast enough to support;
** Assumes independent, identical rate of failure amongst drives
Nested (hybrid) RAID
In what was originally termed hybrid RAID,many storage controllers allow RAID levels to be nested. The elements of a RAID may be either individual drives or RAIDs themselves. Nesting more than two deep is unusual.
As there is no basic RAID level numbered larger than 9, nested RAIDs are usually unambiguously described by attaching the numbers indicating the RAID levels, sometimes with a "+" in between. The order of the digits in a nested RAID designation is the order in which the nested array is built: For a RAID 1+0, drives are first combined into multiple level 1 RAIDs that are themselves treated as single drives to be combined into a single RAID 0; the reverse structure is also possible (RAID 0+1).
The final RAID is known as the top array. When the top array is a RAID 0 (such as in RAID 1+0 and RAID 5+0), most vendors omit the "+" (yielding RAID 10 and RAID 50, respectively).
- RAID 0+1: striped sets in a mirrored set (minimum four drives; even number of drives) provides fault tolerance and improved performance but increases complexity.
- The key difference from RAID 1+0 is that RAID 0+1 creates a second striped set to mirror a primary striped set. The array continues to operate with one or more drives failed in the same mirror set, but if drives fail on both sides of the mirror the data on the RAID system is lost.
- RAID 1+0: (a.k.a. RAID 10) mirrored sets in a striped set (minimum four drives; even number of drives) provides fault tolerance and improved performance but increases complexity.
- The key difference from RAID 0+1 is that RAID 1+0 creates a striped set from a series of mirrored drives. The array can sustain multiple drive losses so long as no mirror loses all its drives.
- RAID 5+1: mirrored striped set with distributed parity (some manufacturers label this as RAID 53).
Whether an array runs as RAID 0+1 or RAID 1+0 in practice is often determined by the evolution of the storage system. A RAID controller might support upgrading a RAID 1 array to a RAID 1+0 array on the fly, but require a lengthy offline rebuild to upgrade from RAID 1 to RAID 0+1. With nested arrays, sometimes the path of least disruption prevails over achieving the preferred configuration.
RAID Parity
Many RAID levels employ an error protection scheme called "parity". Most use the simple XOR parity described in this section, but RAID 6 uses two separate parities based respectively on addition and multiplication in a particular Galois Field or Reed-Solomon error correction. XOR parity calculation is a widely used method in information technology to provide fault tolerance in a given set of data.In Boolean logic
Boolean logic
Boolean algebra is a logical calculus of truth values, developed by George Boole in the 1840s. It resembles the algebra of real numbers, but with the numeric operations of multiplication xy, addition x + y, and negation −x replaced by the respective logical operations of...
, there is an operation called exclusive or (XOR), meaning "one or the other, but not neither and not both." For example:
The XOR operator is central to how parity data is created and used within an array. It is used both for the protection of data, as well as for the recovery of missing data.
As an example, consider a simple RAID made up of 6 drives (4 for data, 1 for parity, and 1 for use as a hot spare), where each drive has only a single byte worth of storage (a '
-
' represents a bit, the value of which doesn't matter at this point in the discussion):Let the following data be written to the data drives:
Every time data is written to the data drives, a parity value is calculated in order to be able to recover from a data drive failure. To calculate the parity for this RAID, a bitwise XOR of each drive's data is calculated as follows, the result of which is the parity data:
-
00101010 XOR 10001110 XOR 11110111 XOR 10110101 = 11100110
The parity data
11100110
is then written to the dedicated parity drive:Suppose Drive #3 fails. In order to restore the contents of Drive #3, the same XOR calculation is performed using the data of all the remaining data drives and (as a substitute for Drive #3) the parity value (11100110) stored in Drive #6:
-
00101010 XOR 10001110 XOR 11100110 XOR 10110101 = 11110111
With the complete contents of Drive #3 recovered, the data is written to the hot spare, and the RAID can continue operating.
At this point the dead drive has to be replaced with a working one of the same size. Depending on the implementation, the new drive either takes over as a new hot spare drive and the old hot spare drive continues to act as a data drive of the array, or (as illustrated below) the original hot spare's contents are automatically copied to the new drive by the array controller, allowing the original hot spare to return to its original purpose as an emergency standby drive. The resulting array is identical to its pre-failure state:
This same basic XOR principle applies to parity within RAID groups regardless of capacity or number of drives. As long as there are enough drives present to allow for an XOR calculation to take place, parity can be used to recover data from any single drive failure. (A minimum of three drives must be present in order for parity to be used for fault tolerance, because the XOR operator requires two operands, and a place to store the result).
RAID 10 versus RAID 5 in Relational Databases
A common opinion (and one which serves to illustrate the dynamics of proper RAID deployment) is that RAID 10 is inherently better for relational databases than RAID 5, because RAID 5 requires the recalculation and redistribution of parity data on a per-write basis.While this may have been a hurdle in past RAID 5 implementations, the task of parity recalculation and redistribution within modern Storage Area Network
Storage area network
A storage area network is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices...
(SAN) appliances is performed as a back-end process transparent to the host, not as an in-line process which competes with existing I/O. (i.e. the RAID controller handles this as a housekeeping task to be performed during a particular spindle's idle timeslices, so as not to disrupt any pending I/O from the host.) The "write penalty" inherent to RAID 5 has been effectively masked since the late 1990s by a combination of improved controller design, larger amounts of cache, and faster drives. The effect of a write penalty when using RAID 5 is mostly a concern when the workload cannot be destaged efficiently from the SAN controller's write cache.
In some of enterprise-level SAN hardware, any writes which are generated by the host are simply stored in a small, mirrored NVRAM cache, such as an SSD drive, acknowledged faster than waiting for the write to go to disk, and finally synchronized with drives when the controller sees fit to do so from an efficiency standpoint. From the host's perspective, an individual write to a RAID 10 volume is no faster than an individual write to a RAID 5 volume, given sufficient amount of write cache; a difference between them becomes apparent when the write cache at the SAN controller level is overwhelmed, so that the SAN appliance must reject or gate further write requests. SAN appliances generally service multiple hosts that compete both for controller cache, potential SSD cache, and spindle time.
The choice between RAID 10 and RAID 5 for the purpose of housing a relational database depends upon a number of factors (spindle availability, cost, business risk, etc.) but, from a performance standpoint, it depends mostly on the type of I/O expected for a particular database application. For databases that are expected to be exclusively or strongly read-biased, RAID 10 is often chosen because it offers a slight speed improvement over RAID 5 on sustained reads and sustained randomized writes. If a database is expected to be strongly write-biased, RAID 5 becomes the more attractive option, since RAID 5 does not suffer from the same write handicap inherent in RAID 10; all spindles in a RAID 5 can be utilized to write simultaneously, whereas only half the members of a RAID 10 can be used. However, for reasons similar to what has eliminated the "read penalty" in RAID 5, the 'write penalty' of RAID 10 has been largely masked by improvements in controller cache efficiency and drive throughput.
What causes RAID 5 to be slightly slower than RAID 10 on sustained reads is the fact that RAID 5 has parity data interleaved within normal data. For every read pass in RAID 5, there is a probability that a read head may need to traverse a region of parity data. The cumulative effect of this is a slight performance drop compared to RAID 10, which does not use parity, and therefore never encounters a circumstance where data underneath a head is of no use. For the vast majority of situations, however, most relational databases housed on RAID 10 perform equally well in RAID 5. The strengths and weaknesses of each type only become an issue in atypical deployments, or deployments on overcommitted hardware. Often, any measurable differences between the two formats are masked by sturctural deficiencies at the host layer, such as poor database maintenance, or sub-optimal I/O configuration settings.
There are, however, other considerations which must be taken into account other than simply those regarding performance. RAID 5 and other non-mirror-based arrays offer a lower degree of resiliency than RAID 10 by virtue of RAID 10's mirroring strategy. In a RAID 10, I/O can continue even in spite of multiple drive failures. By comparison, in a RAID 5 array, any failure involving more than one drive renders the array itself unusable by virtue of parity recalculation being impossible to perform. Thus, RAID 10 is frequently favored because it provides the lowest level of risk.
Additionally, the time required to rebuild data on a hot spare in a RAID 10 is significantly less than in a RAID 5, because all the remaining spindles in a RAID 5 rebuild must participate in the process, whereas only the hot spare and one mirror are required in a RAID 10. Thus, in comparison to a RAID 5, a RAID 10 has a smaller window of opportunity during which a second drive failure could cause array failure.
Modern SAN design largely masks any performance hit while a RAID is in a degraded state, by virtue of being able to perform rebuild operations both in-band or out-of-band with respect to existing I/O traffic. Given the rare nature of drive failures in general, and the exceedingly low probability of multiple concurrent drive failures occurring within the same RAID, the choice of RAID 5 over RAID 10 often comes down to the preference of the storage administrator, particularly when weighed against other factors such as cost, throughput requirements, and physical spindle availability.
In short, the choice between RAID 5 and RAID 10 involves a complicated mixture of factors. There is no one-size-fits-all solution, as the choice of one over the other must be dictated by everything from the I/O characteristics of the database, to business risk, to worst case degraded-state throughput, to the number and type of drives present in the array itself. Over the course of the life of a database, one may even see situations where RAID 5 is initially favored, but RAID 10 slowly becomes the better choice, and vice versa.
New RAID classification
In 1996, the RAID Advisory Board introduced an improved classification of RAID systems. It divides RAID into three types:- Failure-resistant (systems that protect against loss of data due to drive failure).
- Failure-tolerant (systems that protect against loss of data access due to failure of any single component).
- Disaster-tolerant (systems that consist of two or more independent zones, either of which provides access to stored data).
The original "Berkeley" RAID classifications are still kept as an important historical reference point and also to recognize that RAID Levels 0–6 successfully define all known data mapping and protection schemes for disk-based storage systems. Unfortunately, the original classification caused some confusion due to the assumption that higher RAID levels imply higher redundancy and performance; this confusion has been exploited by RAID system manufacturers, and it has given birth to the products with such names as RAID-7, RAID-10, RAID-30, RAID-S, etc. Consequently, the new classification describes the data availability characteristics of a RAID system, leaving the details of its implementation to system manufacturers.
- Failure-resistant disk systems (FRDS) (meets a minimum of criteria 1–6)
- Protection against data loss and loss of access to data due to drive failure
- Reconstruction of failed drive content to a replacement drive
- Protection against data loss due to a "write hole"
- Protection against data loss due to host and host I/O bus failure
- Protection against data loss due to replaceable unit failure
- Replaceable unit monitoring and failure indication
- Failure-tolerant disk systems (FTDS) (meets a minimum of criteria 1–15)
- Disk automatic swap and hot swap
- Protection against data loss due to cache failure
- Protection against data loss due to external power failure
- Protection against data loss due to a temperature out of operating range
- Replaceable unit and environmental failure warning
- Protection against loss of access to data due to device channel failure
- Protection against loss of access to data due to controller module failure
- Protection against loss of access to data due to cache failure
- Protection against loss of access to data due to power supply failure
- Disaster-tolerant disk systems (DTDS) (meets a minimum of criteria 1–21)
- Protection against loss of access to data due to host and host I/O bus failure
- Protection against loss of access to data due to external power failure
- Protection against loss of access to data due to component replacement
- Protection against loss of data and loss of access to data due to multiple drive failures
- Protection against loss of access to data due to zone failure
- Long-distance protection against loss of data due to zone failure
Non-standard levels
Many configurations other than the basic numbered RAID levels are possible, and many companies, organizations, and groups have created their own non-standard configurations, in many cases designed to meet the specialised needs of a small niche group. Most of these non-standard RAID levels are proprietaryProperty
Property is any physical or intangible entity that is owned by a person or jointly by a group of people or a legal entity like a corporation...
.
- Storage Computer Corporation used to call a cached version of RAID 3 and 4, RAID 7. Storage Computer Corporation is now defunct.
- EMC CorporationEMC CorporationEMC Corporation , a Financial Times Global 500, Fortune 500 and S&P 500 company, develops, delivers and supports information infrastructure and virtual infrastructure hardware, software, and services. EMC is headquartered in Hopkinton, Massachusetts, USA.Former Intel executive Richard Egan and his...
used to offer RAID S as an alternative to RAID 5 on their SymmetrixSymmetrixThe Symmetrix system is EMC Corporation's flagship enterprise storage array. The Symmetrix development has been led by Moshe Yanai, who joined EMC in 1987, until shortly before his leaving EMC in 2001. There have been many generations of Symmetrix hardware, with the first appearing in 1990 and the...
systems. Their latest generations of Symmetrix, the DMX and the V-Max series, do not support RAID S (instead they support RAID 1, RAID 5 and RAID 6.) - The ZFSZFSIn computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity verification against data corruption modes , support for high storage capacities, integration of the concepts of filesystem and volume management,...
filesystem, available in Solaris, OpenSolarisOpenSolarisOpenSolaris was an open source computer operating system based on Solaris created by Sun Microsystems. It was also the name of the project initiated by Sun to build a developer and user community around the software...
and FreeBSDFreeBSDFreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
, offers RAID-Z, which solves RAID 5's write hole problem. - Hewlett-PackardHewlett-PackardHewlett-Packard Company or HP is an American multinational information technology corporation headquartered in Palo Alto, California, USA that provides products, technologies, softwares, solutions and services to consumers, small- and medium-sized businesses and large enterprises, including...
's Advanced Data Guarding (ADG) is a form of RAID 6. - NetApp's Data ONTAP uses RAID-DP (also referred to as "double", "dual", or "diagonal" parity), is a form of RAID 6, but unlike many RAID 6 implementations, it does not use distributed parity as in RAID 5. Instead, two unique parity drives with separate parity calculations are used. This is a modification of RAID 4 with an extra parity drive.
- Accusys Triple Parity (RAID TP) implements three independent parities by extending RAID 6 algorithms on its FC-SATA and SCSI-SATA RAID controllers to tolerate a failure of 3 drives.
- LinuxLinuxLinux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
MD RAID10 (RAID 10) implements a general RAID driver that defaults to a standard RAID 1 with 2 drives, and a standard RAID 1+0 with four drives, but can have any number of drives, including odd numbers. MD RAID 10 can run striped and mirrored, even with only two drives with the f2 layout (mirroring with striped reads, giving the read performance of RAID 0; normal Linux software RAID 1 does not stripe reads, but can read in parallel). - Hewlett-PackardHewlett-PackardHewlett-Packard Company or HP is an American multinational information technology corporation headquartered in Palo Alto, California, USA that provides products, technologies, softwares, solutions and services to consumers, small- and medium-sized businesses and large enterprises, including...
's EVA series arrays implement vRAID - vRAID-0, vRAID-1, vRAID-5, and vRAID-6; vRAID levels are closely aligned to Nested RAID levels: vRAID-1 is actually a RAID 1+0 (or RAID 10), vRAID-5 is actually a RAID 5+0 (or RAID 50), etc. - IBMIBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
(among others) has implemented a RAID 1E (Level 1 Enhanced). It requires a minimum of 3 drives. It is similar to a RAID 1+0 array, but it can also be implemented with either an even or odd number of drives. The total available RAID storage is n/2. - HadoopHadoopApache Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data...
has a RAID system that generates a parity file by xor-ing a stripe of blocks in a single HDFS file.
Data backup
A RAID system used as secondary storage is not an alternative to backing upBackup
In information technology, a backup or the process of backing up is making copies of data which may be used to restore the original after a data loss event. The verb form is back up in two words, whereas the noun is backup....
data. In parity configurations, a RAID protects from catastrophic data loss caused by physical damage or errors on a single drive within the array (or two drives in, say, RAID 6). However, a true backup system has other important features such as the ability to restore an earlier version of data, which is needed both to protect against software
Computer software
Computer software, or just software, is a collection of computer programs and related data that provide the instructions for telling a computer what to do and how to do it....
errors that write unwanted data to secondary storage, and also to recover from user error
User Error
A user error is an error made by the human user of a complex system, usually a computer system, in interacting with it. Although the term is sometimes used by Human Computer Interaction practitioners, the more formal human error term is used in the context of human reliability.User Error and...
and malicious data deletion. A RAID can be overwhelmed by catastrophic failure that exceeds its recovery capacity and, of course, the entire array is at risk of physical damage by fire, natural disaster, and human forces, while backups can be stored off-site. A RAID is also vulnerable to controller failure because it is not always possible to migrate a RAID to a new, different controller without data loss.
Implementations
The distribution of data across multiple drives can be managed either by dedicated hardwareHardware
Hardware is a general term for equipment such as keys, locks, hinges, latches, handles, wire, chains, plumbing supplies, tools, utensils, cutlery and machine parts. Household hardware is typically sold in hardware stores....
or by software
Computer software
Computer software, or just software, is a collection of computer programs and related data that provide the instructions for telling a computer what to do and how to do it....
. A software solution may be part of the operating system, or it may be part of the firmware and drivers supplied with a hardware RAID controller.
Software-based RAID
Software RAID implementations are now provided by many operating systemOperating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
s. Software RAID can be implemented as:
- a layer that abstracts multiple devices, thereby providing a single virtual deviceVirtualizationVirtualization, in computing, is the creation of a virtual version of something, such as a hardware platform, operating system, a storage device or network resources....
(e.g. Linux's mdMdadmmdadm is a Linux utility used to manage software RAID devices.The name is derived from the md device nodes it administers or manages, and it replaced a previous utility mdctl...
). - a more generic logical volume manager (provided with most server-class operating systems, e.g. VeritasVERITAS File SystemThe VERITAS File System, , is an extent-based file system. It was originally developed by VERITAS Software. Through an OEM agreement, VxFS is used as the primary filesystem of the HP-UX operating system...
or LVMLogical Volume Manager (Linux)LVM is a logical volume manager for the Linux kernel; it manages disk drives and similar mass-storage devices, in particular large ones. The term "volume" refers to a disk drive or partition thereof...
). - a component of the file system (e.g. ZFSZFSIn computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity verification against data corruption modes , support for high storage capacities, integration of the concepts of filesystem and volume management,...
or BtrfsBtrfsBtrfs is a GPL-licensed copy-on-write file system for Linux.Development began at Oracle Corporation in 2007....
).
Volume manager support
Server class operating systems typically provide logical volume managementLogical volume management
In computer storage, logical volume management or LVM provides a method of allocating space on mass-storage devices that is more flexible than conventional partitioning schemes...
, which allows a system to use logical volumes which can be resized or moved. Often, features like
RAID or snapshots are also supported.
- VinumVinum volume managerVinum, is a logical volume manager, also called Software RAID, allowing implementations of the RAID-0, RAID-1 and RAID-5 models, both individually and in combination. Vinum is part of the base distribution of the FreeBSD operating system. Versions exist for NetBSD, OpenBSD and DragonFly BSD. Vinum...
is a logical volume manager supporting RAID-0, RAID-1, and RAID-5. Vinum is part of the base distribution of the FreeBSDFreeBSDFreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
operating system, and versions exist for NetBSDNetBSDNetBSD is a freely available open source version of the Berkeley Software Distribution Unix operating system. It was the second open source BSD descendant to be formally released, after 386BSD, and continues to be actively developed. The NetBSD project is primarily focused on high quality design,...
, OpenBSDOpenBSDOpenBSD is a Unix-like computer operating system descended from Berkeley Software Distribution , a Unix derivative developed at the University of California, Berkeley. It was forked from NetBSD by project leader Theo de Raadt in late 1995...
, and DragonFly BSDDragonFly BSDDragonFly BSD is a free Unix-like operating system created as a fork of FreeBSD 4.8. Matthew Dillon, an Amiga developer in the late 1980s and early 1990s and a FreeBSD developer between 1994 and 2003, began work on DragonFly BSD in June 2003 and announced it on the FreeBSD mailing lists on July...
. - Solaris SVM supports RAID 1 for the boot filesystem, and adds RAID 0 and RAID 5 support (and various nested combinations) for data drives.
- Linux LVMLogical Volume Manager (Linux)LVM is a logical volume manager for the Linux kernel; it manages disk drives and similar mass-storage devices, in particular large ones. The term "volume" refers to a disk drive or partition thereof...
supports RAID 0 and RAID 1. - HP's OpenVMSOpenVMSOpenVMS , previously known as VAX-11/VMS, VAX/VMS or VMS, is a computer server operating system that runs on VAX, Alpha and Itanium-based families of computers. Contrary to what its name suggests, OpenVMS is not open source software; however, the source listings are available for purchase...
provides a form of RAID 1 called "Volume shadowing", giving the possibility to mirror data locally and at remote cluster systems.
File system support
Some advanced file systemFile system
A file system is a means to organize data expected to be retained after a program terminates by providing procedures to store, retrieve and update data, as well as manage the available space on the device which contain it. A file system organizes data in an efficient manner and is tuned to the...
s are designed to organize data across multiple storage devices directly (without needing the help of a third-party logical volume manager).
- ZFSZFSIn computing, ZFS is a combined file system and logical volume manager designed by Sun Microsystems. The features of ZFS include data integrity verification against data corruption modes , support for high storage capacities, integration of the concepts of filesystem and volume management,...
supports equivalents of RAID 0, RAID 1, RAID 5 (RAID Z), RAID 6 (RAID Z2), and a triple parity version RAID Z3, and any nested combination of those like 1+0. ZFS is the native file system on Solaris, and also available on FreeBSD. - BtrfsBtrfsBtrfs is a GPL-licensed copy-on-write file system for Linux.Development began at Oracle Corporation in 2007....
supports RAID 0, RAID 1, and RAID 10 (RAID 5 and 6 are under development).
Other support
Many operating systems provide basic RAID functionality independently of volume management.- Apple's Mac OS X ServerMac OS X ServerMac OS X Server is a Unix server operating system from Apple Inc. The server edition of Mac OS X is architecturally identical to its desktop counterpart, except that it includes work group management and administration software tools...
and Mac OS XMac OS XMac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
support RAID 0, RAID 1, and RAID 1+0. - FreeBSDFreeBSDFreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...
supports RAID 0, RAID 1, RAID 3, and RAID 5, and all nestings via GEOMGEOMGEOM is the main storage framework for the FreeBSD operating system. It is available in FreeBSD 5.0 and higher and provides a standardized way to access storage layers. GEOM is modular and allows for geom modules to connect to the framework. For example, the geom_mirror module will provide RAID1 or...
modules and ccd. - LinuxLinuxLinux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
's md supports RAID 0, RAID 1, RAID 4, RAID 5, RAID 6, and all nestings. Certain reshaping/resizing/expanding operations are also supported. - MicrosoftMicrosoftMicrosoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
's server operating systems support RAID 0, RAID 1, and RAID 5. Some of the Microsoft desktop operating systems support RAID such as Windows XP Professional which supports RAID level 0 in addition to spanning multiple drives but only if using dynamic disks and volumes. Windows XP can be modified to support RAID 0, 1, and 5. - NetBSDNetBSDNetBSD is a freely available open source version of the Berkeley Software Distribution Unix operating system. It was the second open source BSD descendant to be formally released, after 386BSD, and continues to be actively developed. The NetBSD project is primarily focused on high quality design,...
supports RAID 0, RAID 1, RAID 4, and RAID 5, and all nestings via its software implementation, named RAIDframe. - OpenBSDOpenBSDOpenBSD is a Unix-like computer operating system descended from Berkeley Software Distribution , a Unix derivative developed at the University of California, Berkeley. It was forked from NetBSD by project leader Theo de Raadt in late 1995...
aims to support RAID 0, RAID 1, RAID 4, and RAID 5 via its software implementation softraid. - FlexRAIDFlexRAIDFlexRAID is a current ambition to create a smart RAID system that focuses on protecting data rather than protecting basic block device clusters....
(for Linux and Windows) is a snapshot RAID implementation.
Software RAID has advantages and disadvantages compared to hardware RAID. The software must run on a host server attached to storage, and the server's processor must dedicate processing time to run the RAID software; the additional processing capacity required for RAID 0 and RAID 1 is low, but parity-based arrays require more complex data processing during write or integrity-checking operations. As the rate of data processing increases with the number of drives in the array, so does the processing requirement. Furthermore, all the buses between the processor and the drive controller must carry the extra data required by RAID, which may cause congestion.
Fortunately, over time, the increase in commodity CPU speed has been consistently greater than the increase in drive throughput; the percentage of host CPU time required to saturate a given number of drives has decreased. For instance, under 100% usage of a single core on a 2.1 GHz Intel "Core2" CPU, the Linux software RAID subsystem (md) as of version 2.6.26 is capable of calculating parity information at 6 GB/s; however, a three-drive RAID 5 array using drives capable of sustaining a write operation at 100 MB/s only requires parity to be calculated at the rate of 200 MB/s, which requires the resources of just over 3% of a single CPU core.
Furthermore, software RAID implementations may employ more sophisticated algorithms than hardware RAID implementations (e.g. drive scheduling and command queueing), and thus, may be capable of better performance.
Another concern with software implementations is the process of booting the associated operating system. For instance, consider a computer being booted from a RAID 1 (mirrored drives); if the first drive in the RAID 1 fails, then a first-stage boot loader might not be sophisticated enough to attempt loading the second-stage boot loader from the second drive as a fallback. In contrast, a RAID 1 hardware controller typically has explicit programming to decide that a drive has malfunctioned and that the next drive should be used. At least the following second-stage boot loaders are capable of loading a kernel
Kernel (computing)
In computing, the kernel is the main component of most computer operating systems; it is a bridge between applications and the actual data processing done at the hardware level. The kernel's responsibilities include managing the system's resources...
from a RAID 1:
- LILOLILO (boot loader)LILO is a generic boot loader for Linux.-Overview:LILO does not depend on a specific file system, and can boot an operating system from floppy disks and hard disks. One of up to sixteen different images can be selected at boot time. Various parameters, such as the root device, can be set...
(for Linux). - Some configurations of the GRUBGNU GRUBGNU GRUB is a boot loader package from the GNU Project. GRUB is the reference implementation of the Multiboot Specification, which provides a user the choice to boot one of multiple operating systems installed on a computer or select a specific kernel configuration available on a particular...
. - The boot loader for FreeBSD.
- The boot loader for NetBSD.
For data safety, the write-back cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
of an operating system or individual drive might need to be turned off in order to ensure that as much data as possible is actually written to secondary storage before some failure (such as a loss of power); unfortunately, turning off the write-back cache has a performance penalty that can be significant depending on the workload and command queuing support. In contrast, a hardware RAID controller may carry a dedicated battery-powered write-back cache of its own, thereby allowing for efficient operation that is also relatively safe. Fortunately, it is possible to avoid such problems with a software controller by constructing a RAID with safer components; for instance, each drive could have its own battery or capacitor on its own write-back cache, and the drive could implement atomicity in various ways, and the entire RAID or computing system could be powered by a UPS
Uninterruptible power supply
An uninterruptible power supply, also uninterruptible power source, UPS or battery/flywheel backup, is an electrical apparatus that provides emergency power to a load when the input power source, typically mains power, fails...
, etc.
Finally, a software RAID controller that is built into an operating system usually uses proprietary data formats and RAID levels, so an associated RAID usually cannot be shared between operating systems as part of a multi boot setup. However, such a RAID may be moved between computers that share the same operating system; in contrast, such mobility is more difficult when using a hardware RAID controller because both computers must provide compatible hardware controllers. Also, if the hardware controller fails, data could become unrecoverable unless a hardware controller of the same type is obtained.
Most software implementations allow a RAID to be created from partitions
Disk partitioning
Disk partitioning is the act of dividing a hard disk drive into multiple logical storage units referred to as partitions, to treat one physical disk drive as if it were multiple disks. Partitions are also termed "slices" for operating systems based on BSD, Solaris or GNU Hurd...
rather than entire physical drives. For instance, an administrator could divide each drive of an odd number of drives into two partitions, and then mirror partitions across drives and stripe a volume across the mirrored partitions to emulate IBM's RAID 1E configuration. Using partitions in this way also allows for constructing multiple RAIDs in various RAID levels from the same set of drives. For example, one could have a very robust RAID 1 for important files, and a less robust RAID 5 or RAID 0 for less important data, all using the same set of underlying drives. (Some BIOS-based controllers offer similar features, e.g. Intel Matrix RAID
Intel Matrix RAID
Intel Rapid Storage Technology is a computer storage technology marketed by Intel.It is a firmware RAID system, rather than hardware RAID or software RAID.- Description:It first appeared in the ICH6R "southbridge" chip...
.) Using two partitions from the same drive in the same RAID puts data at risk if the drive fails; for instance:
- A RAID 1 across partitions from the same drive makes all the data inaccessible if the single drive fails.
- Consider a RAID 5 composed of 4 drives, 3 of which are 250 GB and one of which is 500 GB; the 500 GB drive is split into 2 partitions, each of which is 250 GB. Then, a failure of the 500 GB drive would remove 2 underlying 'drives' from the array, causing a failure of the entire array.
Hardware-based RAID
Hardware RAID controllers use proprietary data layouts, so it is not usually possible to span controllers from different manufacturers. They do not require processor resources, the BIOS can boot from them, and tighter integration with the device driver may offer better error handling.On a desktop system, a hardware RAID controller may be an expansion card
Expansion card
The expansion card in computing is a printed circuit board that can be inserted into an expansion slot of a computer motherboard or backplane to add functionality to a computer system via the expansion bus.One edge of the expansion card holds the contacts that fit exactly into the slot...
connected to a bus (e.g., PCI
Peripheral Component Interconnect
Conventional PCI is a computer bus for attaching hardware devices in a computer...
or PCIe
PCI Express
PCI Express , officially abbreviated as PCIe, is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP bus standards...
), a component integrated into the motherboard
Motherboard
In personal computers, a motherboard is the central printed circuit board in many modern computers and holds many of the crucial components of the system, providing connectors for other peripherals. The motherboard is sometimes alternatively known as the mainboard, system board, or, on Apple...
; there are controllers for supporting most types of drive technology, such as IDE/ATA, SATA
Serial ATA
Serial ATA is a computer bus interface for connecting host bus adapters to mass storage devices such as hard disk drives and optical drives...
, SCSI
SCSI
Small Computer System Interface is a set of standards for physically connecting and transferring data between computers and peripheral devices. The SCSI standards define commands, protocols, and electrical and optical interfaces. SCSI is most commonly used for hard disks and tape drives, but it...
, SSA
Serial Storage Architecture
Serial Storage Architecture is a serial transport protocol used to attach disk drives to servers. It was invented by Ian Judd of IBM in 1990...
, Fibre Channel
Fibre Channel
Fibre Channel, or FC, is a gigabit-speed network technology primarily used for storage networking. Fibre Channel is standardized in the T11 Technical Committee of the InterNational Committee for Information Technology Standards , an American National Standards Institute –accredited standards...
, and sometimes even a combination. The controller and drives may be in a stand-alone enclosure
Disk enclosure
A disk enclosure is essentially a specialized chassis designed to hold and power disk drives while providing a mechanism to allow them to communicate to one or more separate computers. Drive enclosures provide power to the drives therein and convert the data sent across their native data bus into a...
, rather than inside a computer, and the enclosure may be directly attached to a computer, or connected via a SAN
Storage area network
A storage area network is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices...
.
Most hardware implementations provide a read/write cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
, which, depending on the I/O workload, improves performance. In most systems, the write cache is non-volatile (i.e. battery-protected), so pending writes are not lost in the event of a power failure.
Hardware implementations provide guaranteed performance, add no computational overhead to the host computer, and can support many operating systems; the controller simply presents the RAID as another logical drive
Logical disk
A logical disk is a device that provides an area of usable storage capacity on one or more physical disk drive components in a computer system. Other terms that are used to mean the same thing are partition, logical volume, and in some cases a virtual disk .The disk is described as logical because...
.
Firmware/driver-based RAID
A RAID implemented at the level of an operating system is not always compatible with the system's boot process, and it is generally impractical for desktop versions of Windows (as described above). However, hardware RAID controllers are expensive and proprietary. To fill this gap, cheap "RAID controllers" were introduced that do not contain a dedicated RAID controller chip, but simply a standard drive controller chip with special firmware and drivers; during early stage bootup, the RAID is implemented by the firmware, and once the operating system has been more completely loaded, then the drivers take over control. Consequently, such controllers may not work when driver support is not available for the host operating system.Initially, the term "RAID controller" implied that the controller does the processing. However, while a controller without a dedicated RAID chip is often described by a manufacturer as a "RAID controller", it is rarely made clear that the burden of RAID processing is borne by a host computer's central processing unit rather than the RAID controller itself. Thus, this new type is sometimes called "fake" RAID; Adaptec
Adaptec
Adaptec is a computer hardware brand owned by PMC-Sierra that is used on some of its host adapters for connecting storage devices to computers. The production line of Adaptec is in Indonesia. Products are made to interface with SCSI, Serial ATA, and Serial attached SCSI. Some of its host adapters...
calls it a "HostRAID".
Moreover, a firmware controller can often only support certain types of hard drive to form the RAID that it manages (e.g. SATA for an Intel Matrix RAID
Intel Matrix RAID
Intel Rapid Storage Technology is a computer storage technology marketed by Intel.It is a firmware RAID system, rather than hardware RAID or software RAID.- Description:It first appeared in the ICH6R "southbridge" chip...
, as there is neither SCSI nor PATA support in modern Intel ICH southbridges
Southbridge (computing)
The southbridge is one of the two chips in the core logic chipset on a personal computer motherboard, the other being the northbridge. The southbridge typically implements the slower capabilities of the motherboard in a northbridge/southbridge chipset computer architecture. In Intel chipset...
; however, motherboard makers implement RAID controllers outside of the southbridge on some motherboards).
Hot spares
Both hardware and software RAIDs with redundancy may support the use of a hot spareHot spare
A hot spare or hot standby is used as a failover mechanism to provide reliability in system configurations. The hot spare is active and connected as part of a working system. When a key component fails, the hot spare is switched into operation...
drive; this is a drive physically installed in the array which is inactive until an active drive fails, when the system automatically replaces the failed drive with the spare, rebuilding the array with the spare drive included. This reduces the mean time to recovery
Mean time to recovery
Mean time to recovery is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses , up to whole systems which have to be repaired or replaced.The MTTR would usually be part of a maintenance contract, where the user would pay...
(MTTR), but does not completely eliminate it. As with non-hot-spare systems, subsequent additional failure(s) in the same RAID redundancy group before the array is fully rebuilt can cause data loss. Rebuilding can take several hours, especially on busy systems.
It is sometimes considered that if drives are procured and installed at the same time, several drives are more likely to fail at about the same time than for unrelated drives, so rapid replacement of a failed drive is important. RAID 6 without a spare uses the same number of drives as RAID 5 with a hot spare and protects data against failure of up to two drives, but requires a more advanced RAID controller. Further, a hot spare can be shared by multiple RAID sets.
Data scrubbing
Data scrubbing is periodic reading and checking by the RAID controller of all the blocks in a RAID, including those not otherwise accessed. This allows bad blocks to be detected before they are used.Reliability terms
Failure rateFailure rate
Failure rate is the frequency with which an engineered system or component fails, expressed for example in failures per hour. It is often denoted by the Greek letter λ and is important in reliability engineering....
: Two different kinds of failure rates are applicable to RAID systems. Logical failure is defined as the loss of a single drive and its rate is equal to the sum of individual drives' failure rates. System failure is defined as loss of data and its rate will depend on the type of RAID. For RAID 0 this is equal to the logical failure rate, as there is no redundancy. For other types of RAID, it will be less than the logical failure rate, potentially very small, and its exact value will depend on the type of RAID, the number of drives employed, the vigilance and alacrity of its human administrators, and chance (improbable events do occur, though infrequently).
Mean time to data loss (MTTDL): In this context, the average time before a loss of data in a given array. Mean time to data loss of a given RAID may be higher or lower than that of its constituent hard drives, depending upon what type of RAID is employed. The referenced report assumes times to data loss are exponentially distributed, so that 63.2% of all data loss will occur between time 0 and the MTTDL.
Mean time to recovery
Mean time to recovery
Mean time to recovery is the average time that a device will take to recover from any failure. Examples of such devices range from self-resetting fuses , up to whole systems which have to be repaired or replaced.The MTTR would usually be part of a maintenance contract, where the user would pay...
(MTTR): In arrays that include redundancy for reliability, this is the time following a failure to restore an array to its normal failure-tolerant mode of operation. This includes time to replace a failed drive mechanism and time to re-build the array (to replicate data for redundancy).
Unrecoverable bit error rate (UBE): This is the rate at which a drive will be unable to recover data after application of cyclic redundancy check (CRC) codes and multiple retries.
Write cache reliability: Some RAID systems use RAM
Random-access memory
Random access memory is a form of computer data storage. Today, it takes the form of integrated circuits that allow stored data to be accessed in any order with a worst case performance of constant time. Strictly speaking, modern types of DRAM are therefore not random access, as data is read in...
write cache to increase performance. A power failure can result in data loss unless this sort of drive buffer
Disk buffer
In computer storage, disk buffer is the embedded memory in a hard drive acting as a buffer between the rest of the computer and the physical hard disk platter that is used for storage...
has a supplementary battery to ensure that the buffer has time to write from RAM to secondary storage before the drive powers down.
Atomic write failure: Also known by various terms such as torn writes, torn pages, incomplete writes, interrupted writes, non-transactional, etc.
Correlated failures
The theory behind the error correction in RAID assumes that failures of drives are independent. Given these assumptions, it is possible to calculate how often they can fail and to arrange the array to make data loss arbitrarily improbable.In practice, the drives are often the same age (with similar wear) and subject to the same environment. Since many drive failures are due to mechanical issues (which are more likely on older drives), this violates those assumptions; failures are in fact statistically correlated. In practice, the chances of a second failure before the first has been recovered (causing data loss) is not as unlikely as for random failures. In a study including about 100 thousand drives, the probability of two drives in the same cluster failing within one hour was observed to be four times larger than was predicted by the exponential statistical distribution
Exponential distribution
In probability theory and statistics, the exponential distribution is a family of continuous probability distributions. It describes the time between events in a Poisson process, i.e...
which characterises processes in which events occur continuously and independently at a constant average rate. The probability of two failures within the same 10-hour period was twice as large as that which was predicted by an exponential distribution.
A common assumption is that "server-grade" drives fail less frequently than consumer-grade drives. Two independent studies (one by Carnegie Mellon University
Carnegie Mellon University
Carnegie Mellon University is a private research university in Pittsburgh, Pennsylvania, United States....
and the other by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
) have shown that the "grade" of a drive does not relate to the drive's failure rate.
Atomicity
This is a little understood and rarely mentioned failure mode for redundant storage systems that do not utilize transactional features. Database researcher Jim Gray wrote "Update in Place is a Poison Apple" during the early days of relational database commercialization. However, this warning largely went unheeded and fell by the wayside upon the advent of RAID, which many software engineers mistook as solving all data storage integrity and reliability problems. Many software programs update a storage object "in-place"; that is, they write a new version of the object on to the same secondary storage addresses as the old version of the object. While the software may also log some delta information elsewhere, it expects the storage to present "atomic write semantics," meaning that the write of the data either occurred in its entirety or did not occur at all.However, very few storage systems provide support for atomic writes, and even fewer specify their rate of failure in providing this semantic. Note that during the act of writing an object, a RAID storage device will usually be writing all redundant copies of the object in parallel, although overlapped or staggered writes are more common when a single RAID processor is responsible for multiple drives. Hence an error that occurs during the process of writing may leave the redundant copies in different states, and furthermore may leave the copies in neither the old nor the new state. The little known failure mode is that delta logging relies on the original data being either in the old or the new state so as to enable backing out the logical change, yet few storage systems provide an atomic write semantic for a RAID.
While the battery-backed write cache may partially solve the problem, it is applicable only to a power failure scenario.
Since transactional support is not universally present in hardware RAID, many operating systems include transactional support to protect against data loss during an interrupted write. Novell NetWare, starting with version 3.x, included a transaction tracking system. Microsoft introduced transaction tracking via the journaling
Journaling file system
A journaling file system is a file system that keeps track of the changes that will be made in a journal before committing them to the main file system...
feature in NTFS
NTFS
NTFS is the standard file system of Windows NT, including its later versions Windows 2000, Windows XP, Windows Server 2003, Windows Server 2008, Windows Vista, and Windows 7....
. ext4 has journaling with checksums; ext3 has journaling without checksums but an "append-only" option, or ext3cow (Copy on Write). If the journal itself in a filesystem is corrupted though, this can be problematic. The journaling in NetApp WAFL
Write Anywhere File Layout
The Write Anywhere File Layout is a file layout that supports large, high-performance RAID arrays, quick restarts without lengthy consistency checks in the event of a crash or power failure , and growing the filesystems size quickly. It was designed by NetApp for use in its storage appliances...
file system gives atomicity by never updating the data in place, as does ZFS. An alternative method to journaling is soft updates
Soft updates
In computer file systems, soft updates is an approach to maintaining disk integrity after a crash or power outage. They are an alternative to journaling file systems....
, which are used in some BSD-derived system's implementation of UFS
Unix File System
The Unix file system is a file system used by many Unix and Unix-like operating systems. It is also called the Berkeley Fast File System, the BSD Fast File System or FFS...
.
This can present as a sector read failure. Some RAID implementations protect against this failure mode by remapping the bad sector
Bad Sector
Bad Sector is an ambient/noise project formed in 1992 in Tuscany, Italy by Massimo Magrini. While working at the Computer Art Lab of ISTI in Pisa , he developed original gesture interfaces that he uses in live performances: 'Aerial Painting Hand' , 'UV-Stick' Bad Sector is an ambient/noise...
, using the redundant data to retrieve a good copy of the data, and rewriting that good data to the newly mapped replacement sector. The UBE (Unrecoverable Bit Error) rate is typically specified at 1 bit in 1015 for enterprise class drives (SCSI
SCSI
Small Computer System Interface is a set of standards for physically connecting and transferring data between computers and peripheral devices. The SCSI standards define commands, protocols, and electrical and optical interfaces. SCSI is most commonly used for hard disks and tape drives, but it...
, FC
Fibre Channel
Fibre Channel, or FC, is a gigabit-speed network technology primarily used for storage networking. Fibre Channel is standardized in the T11 Technical Committee of the InterNational Committee for Information Technology Standards , an American National Standards Institute –accredited standards...
, SAS
Serial Attached SCSI
Serial Attached SCSI is a computer bus used to move data to and from computer storage devices such as hard drives and tape drives. SAS depends on a point-to-point serial protocol that replaces the parallel SCSI bus technology that first appeared in the mid 1980s in data centers and workstations,...
) , and 1 bit in 1014 for desktop class drives (IDE/ATA/PATA, SATA). Increasing drive capacities and large RAID 5 redundancy groups have led to an increasing inability to successfully rebuild a RAID group after a drive failure because an unrecoverable sector is found on the remaining drives. Double protection schemes such as RAID 6 are attempting to address this issue, but suffer from a very high write penalty.
Write cache reliability
The drive system can acknowledge the write operation as soon as the data is in the cache, not waiting for the data to be physically written. This typically occurs in old, non-journaled systems such as FAT32, or if the Linux/Unix "writeback" option is chosen without any protections like the "soft updates" option (to promote I/O speed whilst trading-away data reliability). A power outage or system hang such as a BSODBlue Screen of Death
To forse a BSOD Open regedit.exe,Then search: HKLM\SYSTEM\CurrentControlSet\services\i8042prt\ParametersThen make a new DWORD called "CrashOnCtrlScroll" And set the value to 1....
can mean a significant loss of any data queued in such a cache.
Often a battery is protecting the write cache, mostly solving the problem. If a write fails because of power failure, the controller may complete the pending writes as soon as restarted. This solution still has potential failure cases: the battery may have worn out, the power may be off for too long, the drives could be moved to another controller, and the controller itself could fail. Some systems provide the capability of testing the battery periodically, however this leaves the system without a fully charged battery for several hours.
An additional concern about write cache reliability exists, specifically regarding devices equipped with a write-back cache—a caching system which reports the data as written as soon as it is written to cache, as opposed to the non-volatile medium. The safer cache technique is write-through, which reports transactions as written when they are written to the non-volatile medium.
Equipment compatibility
The methods used to store data by various RAID controllers are not necessarily compatible, so that it may not be possible to read a RAID on different hardware, with the exception of RAID 1, which is typically represented as plain identical copies of the original data on each drive. Consequently a non-drive hardware failure may require the use of identical hardware to recover the data, and furthermore an identical configuration has to be reassembled without triggering a rebuild and overwriting the data. Software RAID however, such as implemented in the Linux kernelLinux kernel
The Linux kernel is an operating system kernel used by the Linux family of Unix-like operating systems. It is one of the most prominent examples of free and open source software....
, alleviates this concern, as the setup is not hardware dependent, but runs on ordinary drive controllers, and allows the reassembly of an array. Additionally, individual drives of a RAID 1 (software and most hardware implementations) can be read like normal drives when removed from the array, so no RAID system is required to retrieve the data. Inexperienced data recovery firms typically have a difficult time recovering data from RAID drives, with the exception of RAID1 drives with conventional data structure.
Data recovery in the event of a failed array
With larger drive capacities the odds of a drive failure during rebuild are not negligible. In that event, the difficulty of extracting data from a failed array must be considered. Only a RAID 1 (mirror) stores all data on each drive in the array. Although it may depend on the controller, some individual drives in a RAID 1 can be read as a single conventional drive; this means a damaged RAID 1 can often be easily recovered if at least one component drive is in working condition. If the damage is more severe, some or all data can often be recovered by professional data recovery specialists. However, other RAID levels (like RAID level 5) present much more formidable obstacles to data recovery.Drive error recovery algorithms
Many modern drives have internal error recovery algorithms that can take upwards of a minute to recover and re-map data that the drive fails to read easily. Frequently, a RAID controller is configured to drop a component drive (that is, to assume a component drive has failed) if the drive has been unresponsive for 8 seconds or so; this might cause the array controller to drop a good drive because that drive has not been given enough time to complete its internal error recovery procedure. Consequently, desktop drives can be quite risky when used in a RAID, and so-called enterprise class drives limit this error recovery time in order to obviate the problem.A fix specific to Western Digital's desktop drives used to be known: A utility called WDTLER.exe could limit a drive's error recovery time; the utility enabled TLER (time limited error recovery)
Time-Limited Error Recovery
Time-Limited Error Recovery is a name used by Western Digital for a hard disk drive firmware bugfix that allows improved error handling in a RAID environment...
, which limits the error recovery time to 7 seconds. Around September 2009, Western Digital disabled this feature in their desktop drives (e.g., the Caviar Black line), making such drives unsuitable for use in a RAID.
However, Western Digital enterprise class drives are shipped from the factory with TLER enabled. Similar technologies are used by Seagate, Samsung, and Hitachi. Of course, for non-RAID usage, an enterprise class drive with a short error recovery timeout that cannot be changed is therefore less suitable than a desktop drive..
In late 2010, the Smartmontools
Smartmontools
Smartmontools is a set of utility programs to control and monitor storage systems using the Self-Monitoring, Analysis and Reporting Technology System built into most modern ATA, Serial ATA and SCSI hard drives. Smartmontools provide early warning signs of problems with a hard drive, allowing a...
program began supporting the configuration of ATA Error Recovery Control, allowing the tool to configure many desktop class hard drives for use in a RAID.
Recovery time is increasing
Drive capacity has grown at a much faster rate than transfer speed, and error rates have only fallen a little in comparison. Therefore, larger capacity drives may take hours, if not days, to rebuild. The re-build time is also limited if the entire array is still in operation at reduced capacity. Given a RAID with only one drive of redundancy (RAIDs 3, 4, and 5), a second failure would cause complete failure of the array. Even though individual drives' mean time between failureMean time between failure
Mean time between failures is the predicted elapsed time between inherent failures of a system during operation. MTBF can be calculated as the arithmetic mean time between failures of a system. The MTBF is typically part of a model that assumes the failed system is immediately repaired , as a...
(MTBF) have increased over time, this increase has not kept pace with the increased storage capacity of the drives. The time to rebuild the array after a single drive failure, as well as the chance of a second failure during a rebuild, have increased over time.
Operator skills, correct operation
In order to provide the desired protection against physical drive failure, a RAID must be properly set up and maintained by an operator with sufficient knowledge of the chosen RAID configuration, array controller (hardware or software), failure detection and recovery. Unskilled handling of the array at any stage may exacerbate the consequences of a failure, and result in downtime and full or partial loss of data that might otherwise be recoverable.Particularly, the array must be monitored, and any failures detected and dealt with promptly. Failure to do so will result in the array continuing to run in a degraded state, vulnerable to further failures. Ultimately more failures may occur, until the entire array becomes inoperable, resulting in data loss and downtime. In this case, any protection the array may provide merely delays this.
The operator must know how to detect failures or verify healthy state of the array, identify which drive failed, have replacement drives available, and know how to replace a drive and initiate a rebuild of the array.
Other problems
While RAID may protect against physical drive failure, the data is still exposed to operator, software, hardware and virus destruction. Many studies cite operator fault as the most common source of malfunction, such as a server operator replacing the incorrect drive in a faulty RAID, and disabling the system (even temporarily) in the process. Most well-designed systems include separate backup systems that hold copies of the data, but do not allow much interaction with it. Most copy the data and remove the copy from the computer for safe storage.History
Norman Ken Ouchi at IBM was awarded a 1978 U.S. patentUnited States patent law
United States patent law was established "to promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;" as provided by the United States Constitution. Congress implemented these...
4,092,732 titled "System for recovering data stored in failed memory unit." The claims
Claim (patent)
Patent claims are the part of a patent or patent application that defines the scope of protection granted by the patent. The claims define, in technical terms, the extent of the protection conferred by a patent, or the protection sought in a patent application...
for this patent describe what would later be termed RAID 5 with full stripe
Data striping
In computer data storage, data striping is the technique of segmenting logically sequential data, such as a file, in a way that accesses of sequential segments are made to different physical storage devices. Striping is useful when a processing device requests access to data more quickly than a...
writes. This 1978 patent also mentions that drive mirroring or duplexing (what would later be termed RAID 1) and protection with dedicated parity (that would later be termed RAID 4) were prior art
Prior art
Prior art , in most systems of patent law, constitutes all information that has been made available to the public in any form before a given date that might be relevant to a patent's claims of originality...
at that time.
The term RAID was first defined by David A. Patterson, Garth A. Gibson
Garth A. Gibson
Garth Gibson is a Computer Scientist from Carnegie Mellon University. Born in Aurora, Ontario, he holds a Ph.D. and an MSc in Computer Science from the University of California, Berkeley and a B.Math in Computer Science from the University of Waterloo. Dr...
and Randy Katz
Randy Katz
Randy Howard Katz is a distinguished professor at University of California, Berkeley of the Electrical Engineering and Computer Science Department.Katz received an A.B. from Cornell University , MS from UC Berkeley , and Ph.D...
at the University of California, Berkeley, in 1987. They studied the possibility of using two or more drives to appear as a single device to the host system and published a paper: "A Case for Redundant Arrays of Inexpensive Disks (RAID)" in June 1988 at the SIGMOD
SIGMOD
SIGMOD is the Association for Computing Machinery's Special Interest Group on Management of Data, which specializes in large-scale data management problems and databases....
conference.
This specification suggested a number of prototype RAID levels, or combinations of drives. Each had theoretical advantages and disadvantages. Over the years, different implementations of the RAID concept have appeared. Most differ substantially from the original idealized RAID levels, but the numbered names have remained. This can be confusing, since one implementation of RAID 5, for example, can differ substantially from another. RAID 3 and RAID 4 are often confused and even used interchangeably.
One of the early uses of RAID 0 and 1 was the Crosfield Electronics Studio 9500 page layout system based on the Python workstation. The Python workstation was a Crosfield managed international development using PERQ
PERQ
The PERQ, also referred to as the Three Rivers PERQ or ICL PERQ, was a pioneering workstation computer produced in the early 1980s....
3B electronics, benchMark Technology's Viper display system and Crosfield's own RAID and fibre-optic network controllers. RAID 0 was particularly important to these workstations as it dramatically sped up image manipulation for the pre-press markets. Volume production started in Peterborough, England in early 1987.
Non-RAID drive architectures
Non-RAID drive architecturesNon-RAID drive architectures
The most widespread standard for configuring multiple hard drives is RAID , which comes in a number of standard configurations and non-standard configurations...
also exist, and are often referred to, similarly to RAID, by standard acronyms, several tongue-in-cheek. A single drive is referred to as a SLED (Single Large Expensive Disk/Drive), by contrast with RAID, while an array of drives without any additional control (accessed simply as independent drives) is referred to, even in a formal context such as equipment specification, as a JBOD (Just a Bunch Of Disks). Simple concatenation is referred to as a "span".
See also
- Standard RAID levelsStandard RAID levelsThe standard RAID levels are a basic set of RAID configurations and employ striping, mirroring, or parity.The standard RAID levels can be modified for other benefits ; there are also non-standard RAID levels, and non-RAID drive architectures, which may be offered as alternatives to RAID architectures...
- Disk array controllerDisk array controllerA disk array controller is a device which manages the physical disk drives and presents them to the computer as logical units. It almost always implements hardware RAID, thus it is sometimes referred to as RAID controller. It also often provides additional disk cache.A disk array controller name is...
- Redundant Array of Inexpensive NodesRedundant Array of Inexpensive NodesReliable/redundant array of independent/inexpensive nodes is an architectural approach to computing and network-attached computer storage , that combines commodity or low-cost computing hardware with management software to address the reliability and availability shortcomings of non-redundant NAS...
- Redundant Array of Independent MemoryRedundant array of independent memoryA redundant array of independent memory is a design feature found in certain computers' main random access memory. RAIM utilizes additional memory modules and striping algorithms to protect against the failure of any particular module and keep the memory system operating continuously...
(RAIM) - Stable storageStable storageStable storage is a classification of computer data storage technology that guarantees atomicity for any given write operation and allows software to be written that is robust against some hardware and power failures...
- Hard drives
- Disk arrayDisk arrayA disk array is a disk storage system which contains multiple disk drives. It is differentiated from a disk enclosure, in that an array has cache memory and advanced functionality, like RAID and virtualization.Components of a typical disk array include:...
- Storage area networkStorage area networkA storage area network is a dedicated network that provides access to consolidated, block level data storage. SANs are primarily used to make storage devices, such as disk arrays, tape libraries, and optical jukeboxes, accessible to servers so that the devices appear like locally attached devices...
(SAN)