Multiple disks allow faster access to large objects by striping. In striping, a large object is broken up into components that are then distributed over as many disks as there are components. Access to the object is then faster, since done in parallel.
In the first RAID (Redundant Array of Inexpensive Disks, later, Redundant Array of Independent Disks) paper by Katz et al., 5 different RAID levels were distinguished. RAID Level 0 is a JABOD (Just a Bunch of Disks). RAID Level 1 pairs up disks and mirrors them. RAID Level 4 has a dedicated parity disk and RAID Level 5 distributes the parity cyclically through the RAID.
A write operation to any data block has now to update the parity block. If many blocks in the same reliability group have changed, we can simply recalculate the parity block. For example, if all n data blocks change, then we have to now write n+1 data blocks. If however only one data block has changed, then we use the so-called small write operation. For simplicity of notation, assume that data item D1a is changing to D1a' in the situation depicted in Figure 1. We can calculate that the new parity is P'=D1a'^D1b^D1c^D1d = D1a'^D1a^D1a^D1b^D1c^D1d = D1a'^D1a^P, where we use the fact that D1a^D1a cancel each other out. Thus: The new parity is the old parity xored by the delta of the data. With other words, to update the parity we read the old data, XOR it with the new data, read the old parity data, calculate the new parity data via an XOR operation, and then write new data and parity. Ignoring data transfer times, this operation cost three latencies and one seek at each disk. At a mirrored disk, we would have only one latency and one seek at each disk.
We can avoid the small write penalty by logging the parity updates (XOR of old and new data). The logged parity update is stored in a buffer area on disk. At certain times, the buffer is swept and the parity updates applied. Since writes typically come in bursts, many of the parity updates are to the same parity block.
If a disk has failed, the data is only available in implicit form. Whenever a data block located on the failed disk is read, then the system reads all data blocks and the parity block in the same reliability group in order to reconstruct the data. A write accesses all data disks in order to calculate and then write the new parity. After a failure, the system can walk systematically through the array replacing the parity data with the reconstructed data. At the end of the process, the disk array has no longer any parity data and thus converted itself into a JABOD, but has not lost any data.
Figure 1: RAID Levels 4 (top row) and 5. Data items D1a, D1b, D1c, and D1d from a reliability group to which we add P1=D1a^D1b^D1c^D1d, etc.
After a failure, all disks in a Level 5 RAID are busy with reconstruction. All blocks in the disk array are somewhat involved,
either as provider of data or as the recipient of reconstructed blocks. As a consequence, the load at the individual disks
forces the reconstruction effort to go slowly. Since reconstruction time is a major factor in the mean time to data loss
of the disk array, array architectures with smaller numbers of disks in a reliability group are attractive. We could build
arrays like that by simply adding Level 5 RAIDs into the ensemble. However, it is advantageous to spread around the
reconstruction load over all the disks. This process is called declustering. In a declustered disk array, blocks on different disks
are grouped into a Level 5 RAID like reliability group.
Figure 2: Distributed Sparing (bottom) lowers the load of individual disks.
|© 2003 Thomas Schwarz, S.J., COEN, SCU SCU COEN T. Schwarz COEN 180|