COEN 180
Disk Arrays

Disk Arrays

In many applications such as databases, the bottleneck of disk use is not capacity but the number of operations that a disk can sustain per minute. This was true in 1990, when RAIDs appeared on the scene, but it is becoming truer as disk capacity improvements outrun disk performance improvements. Parallelizing accesses by choosing many small disks over a single large disk however damages the overall reliability of the system. Mirroring (where two disks are "exact" mirrors of each other) has been long used (e.g. by Tandem) to provide higher reliability in settings where the consequences of a hard drive failure are bad enough to justify the price for the second drive. Mirrored disks have a small performance advantage, because a read request can be serviced from the one disk that has less access time. This advantage is almost negated because a write needs to be served by both disks, however, a typical workload has more reads than writes, and write response time is less important. At a time when disk files were still expensive, RAID level 5 (see below) emerged (at different locations, though R. Katz at Berkeley was the first to publish) to address the cost problem.

Multiple disks allow faster access to large objects by striping. In striping, a large object is broken up into components that are then distributed over as many disks as there are components. Access to the object is then faster, since done in parallel.

RAIDs and RAID Levels

The key idea behind RAID is to store the block-wise parity at another disk. The mathematics is fairly simple. If x1, x2, ... xn are data bytes stored in the same position in the same block, but on n disks, and if p = x1^ x2^ ... ^xn is the bitwise parity of the data, then x1 is the bitwise exclusive or of all other data item xi and p, i.e. x1 = p^x2, ... xn and similarly for all other data items. Thus, if one disk fails, we can reconstruct its contents from the remaining disks.

In the first RAID (Redundant Array of Inexpensive Disks, later, Redundant Array of Independent Disks) paper by Katz et al., 5 different RAID levels were distinguished. RAID Level 0 is a JABOD (Just a Bunch of Disks). RAID Level 1 pairs up disks and mirrors them. RAID Level 4 has a dedicated parity disk and RAID Level 5 distributes the parity cyclically through the RAID.

A write operation to any data block has now to update the parity block. If many blocks in the same reliability group have changed, we can simply recalculate the parity block. For example, if all n data blocks change, then we have to now write n+1 data blocks. If however only one data block has changed, then we use the so-called small write operation. For simplicity of notation, assume that data item D1a is changing to D1a' in the situation depicted in Figure 1. We can calculate that the new parity is P'=D1a'^D1b^D1c^D1d = D1a'^D1a^D1a^D1b^D1c^D1d = D1a'^D1a^P, where we use the fact that D1a^D1a cancel each other out. Thus: The new parity is the old parity xored by the delta of the data. With other words, to update the parity we read the old data, XOR it with the new data, read the old parity data, calculate the new parity data via an XOR operation, and then write new data and parity. Ignoring data transfer times, this operation cost three latencies and one seek at each disk. At a mirrored disk, we would have only one latency and one seek at each disk.

We can avoid the small write penalty by logging the parity updates (XOR of old and new data). The logged parity update is stored in a buffer area on disk. At certain times, the buffer is swept and the parity updates applied. Since writes typically come in bursts, many of the parity updates are to the same parity block.

If a disk has failed, the data is only available in implicit form. Whenever a data block located on the failed disk is read, then the system reads all data blocks and the parity block in the same reliability group in order to reconstruct the data. A write accesses all data disks in order to calculate and then write the new parity. After a failure, the system can walk systematically through the array replacing the parity data with the reconstructed data. At the end of the process, the disk array has no longer any parity data and thus converted itself into a JABOD, but has not lost any data.

Figure 1: RAID Levels 4 (top row) and 5. Data items D1a, D1b, D1c, and D1d from a reliability group to which we add P1=D1a^D1b^D1c^D1d, etc.

Distributed Sparing and Clustering

Administering storage centers is a major cost factor, further, replacing a failed disk with a fresh one is a major data loss risk, since in deployment often enough the false disk is replaced. For this reason, most RAID contain a spare disk. The idea is to automatically discover a failed disk, reconstruct the data on the spare disk, and only then ask the operator to replace the failed disk. An operator mistake at this point is not fatal. Jai Menon (IBM Almaden) observed that distributing the spare disk just as Level 5 RAID distributes the parity disk of Level 4 RAID offers better performance while the spare disk is not in use.

Figure 2: Distributed Sparing (bottom) lowers the load of individual disks.

After a failure, all disks in a Level 5 RAID are busy with reconstruction. All blocks in the disk array are somewhat involved, either as provider of data or as the recipient of reconstructed blocks. As a consequence, the load at the individual disks forces the reconstruction effort to go slowly. Since reconstruction time is a major factor in the mean time to data loss of the disk array, array architectures with smaller numbers of disks in a reliability group are attractive. We could build arrays like that by simply adding Level 5 RAIDs into the ensemble. However, it is advantageous to spread around the reconstruction load over all the disks. This process is called declustering. In a declustered disk array, blocks on different disks are grouped into a Level 5 RAID like reliability group.

Higher Resilience

Disk arrays can employ more powerful erasure correcting codes. One possibility discussed in the early Berkeley papers places a disk in two (or more) reliability groups, organized e.g. as a square, with rows and columns containing an added parity disk. Even after declustering, this arrangement suffers from a large overhead of parity data. There are better codes such as EvenOdd (Blaum et al, IBM Almaden) that place disks in a reliability group and add two parity disks. Alternatives use Reed Solomon codes and allow very large groups. An example of such a deployment is the Kohinoor project at Microsoft Research, Mountain View, aimed presumably at storing the data for hotmail. This would use 257 disks group including three parity disks.


The time for RAID Level 5 has come and largely gone. With the introduction of higher capacity disks and slowly sinking prices, disk capacity is the bottleneck in less and less storage centers, but transactions (with other words: actuators) are. Therefore, instead of Level 5, most RAIDs nowadays employ mirroring and sparing. An exception would be the area of "deep storage", where archival or semi-archival data would be stored in large disk farms. We can expect a resurgence of RAID Level 5 strategies with other storage media, such as MEMS or distributed main memory, where capacity is the dominant cost factor.
© 2003 Thomas Schwarz, S.J., COEN, SCU                                                                              SCU            COEN            T. Schwarz            COEN 180