RedShark Replay: Phil Rhodes explains the background and the (then) latest developments in the ever-evolving world of RAID (first published in July 2013).
I can't wait until my computer contains no moving parts whatsoever. The first parts that fail on computers are fans, and the second are hard disks. They're also comparatively slow, heavy, power hungry, and they make an irritating noise. For all these reasons and more, everyone was very pleased when companies like OCZ and Micron started making noises about solid state disks in the late 2000s, although as I recall we weren't quite so pleased about the prices. Even in 2013, solid state storage is six or seven times the price per gigabyte of spinning disks, so people who need to store trivial little quantities of data such as stereographic feature-length productions at 4K resolution and 60 frames per second will need to look elsewhere for storage technology.
All of which is a complicated way of saying that we are, for the immediate future, still going to be storing things on traditional hard disks. These disks are neither large, fast nor reliable enough to store huge, valuable data, so we may have to keep using them in the form of arrays for some time to come. Most people are familiar with the idea of using several hard disks in cooperation to achieve better reliability or higher performance. This article is about exactly how we do that, and how some of the traditional approaches that are used by other industries don't necessarily suit film and TV work, even though they're widely used.
To evaluate these techniques, we really need to understand what makes hard disks fail. Disks store information on circular platters, often made of glass for stability, which are coated in a particulate magnetic medium. Information used to guide the magnetic pickups over the appropriate area of the disk, as well as to gauge its rotational speed, is factory-encoded onto the disk and referred to as Servo Information. User data is written among the servo information. Tracks are concentric circles (not a single spiral, as on a CD or LP) and are, of course, microscopically tiny. Tracks are broken up radially into chunks called sectors, each of which can usually store a few thousand bytes, and is the smallest individual part of the disk that can be read or written. Guiding the pickup over the right microscopically-tiny area of the disk, recognising with the pulses of magnetism thereon, and making sense of the signal received is reliant on extremely precise mechanical alignments. Most hard disks that aren't destroyed by abuse fail because of wear in the bearings that support the stack of platters, or the Read/Write heads. Mistreatment such as shock can cause the heads to actually contact the surface of the disk, damaging both. Heat exacerbates wear by spoiling mechanical tolerances with thermal expansion and causing lubricating oils to thin out. As with any high precision mechanical device, it's disturbingly easy to kill or contribute to the death of a hard disk while it's running.
The Simplest Approach
If we want more reliability, the simplest approach is to copy the data onto two physically separate disks. Doing something more than once because you’re afraid that one of the attempts might fail is a natural reaction to unreliability, so it's not very surprising that this was quickly automated.
Systems which do this were the first to be developed and named a Redundant Array of Inexpensive Devices (or sometimes Disks), or RAID-1. Inexpensive is a relative term, but even if you're only a small-scale business, it isn't that difficult to find data that's worth a £100 investment in a second disk. The system is relatively easy to implement, since Writing to the disk array simply involves a duplication of Read or Write operations to the same sector on a second disk, while Reading from it simply requires the data to be compared to see if there's been a problem.
While that's a good approach to mitigating reliability issues, RAID-1 does not in itself improve performance more than very slightly. In fact it may be slightly slower than a single disk, because tiny manufacturing tolerances in the disks mean that one will always complete an operation slightly before the other, and in all cases the computer system must wait for both disks to finish the requested operation before more work can be done. Performance is therefore limited to that of the slowest disk in each case.
Achieving better performance, on the other hand, is just as easy. To take a film-making example, if we have a video sequence of frames that requires 100 megabytes of data per second for playback but our hard disk is capable of only 50, we can simply use two disks; storing alternate frames on each one, and Reading from them alternately. This is highly effective, but it doesn't help reliability – in fact, the combined system is now (broadly speaking) twice as subject to failure as it once was, since the data on each disk is useless without the other. As such, it is referred to as RAID-0.
Needless to say, the lack of performance of RAID-1 and the decreased reliability of RAID-0 led bright minds to conjure up a combination, where data is duplicated across disks for redundancy, called mirroring, and split between disks for performance, called striping, simultaneously. With the crippling lack of imagination common to the technical industries, this is referred to as RAID-10. This is usually arranged such that several pairs of disks in RAID-1 are then each treated as a single disk and combined in RAID-0, such that both of the disks in any of the several RAID-1 pairs must fail before data is lost.
Importantly, the alternate configuration, RAID-01, where two sets of several disks are striped together in two identical high-performance RAID-0 and the resulting sets combined in RAID-1, is less reliable. In this case, the simultaneous failure of any two disks where one is in each RAID-0 half will cause the entire array to fail. In short, make mirrors, and stripe them. Don't make stripes, and mirror them.
Problems of the Combined Approach
The combined approach of RAID-10 is simple and effective, but not without problems. It is space-inefficient, generally requiring twice the amount of hard disk space to be available to build the array than will eventually be made available for actual storage. The redundancy, and thus the increased reliability, of RAID-10 comes simply from duplication. The efficiency of RAID-10 is therefore 50%. Alternatives to this have long existed, and the most commonly encountered are RAID-5 and RAID-6. A detailed discussion of the algorithms in use is beyond the scope of this article, but briefly, the approach is to store data split across at least two disks, plus parity data stored on a third. The parity data can be used to recover the contents of either of the two data disks as long as at least one of them survives, plus the parity data can always be recalculated based on the data (modern implementations intermix the parity and data, but the fundamental approach is the same). Therefore, any one of the three disks may fail, and the data can be recovered. The efficiency of RAID-5, where there are at least four disks in the array, is 75%. RAID-5 has become very popular, since there's a perception that it is cheaper because it uses fewer hard disks to achieve a similar level of reliable storage. It also splits data up across several disks which should improve performance.
Practically speaking though, it can be hard to make RAID-5 arrays which are very fast at the same cost as RAID-10 for the same amount of space. While Reading a fully-working RAID-5 array is relatively straightforward, Writing to it is complicated by the need to update the parity data for each sector Written. Since individual files as presented to the user may bridge more than one sector, and because it is necessary to know the entire contents of the sector in order to update the parity information, Writing a single file that's much smaller than a sector may actually require two sectors to be Read and the parity calculated and Written back to disk, and that's before we've actually got to the point of Writing the data we actually wanted to Write.
Mitigating this problem requires RAID array controllers with intelligent software and high performance electronics (which are usually a plug-in card for a workstation). These controllers are expensive – perhaps very expensive, worth the cost of at least several SATA hard disks, and even more for the very best-performing models. This can more than offset the additional disks required for the same amount of RAID-10 space. And finally, RAID-5 suffers a difficult problem referred to as the Write Hole. Any data Written to the disk requires at least two writes to two separate hard disks: one to update the data, and one to update the parity. Because two separate operations are involved, this is not atomic: one of the operations may fail, perhaps due to a sudden power outage. In this case, where the data and parity do not agree, we have no way of telling which is correct, and crucially this will not be noticed until the RAID controller attempts to Read the data in question. If the data is overwritten, no problem. If it isn't, it's impossible to tell whether the data is right, or whether the parity is right, and the resulting error is not recoverable.
Consequences of RAID-5 Degradation
While a Write Hole is rare, and may also affect RAID-1 or RAID-10 in exceptional circumstances, things get worse for RAID-5. When a RAID-10 array suffers a failed disk, or in the lingo of the field becomes degraded, it doesn't lose much performance. Reading each of the underlying RAID-1 arrays is not significantly slower (in fact, as we saw above, it may actually be very slightly faster) when one of the duplicated disks is out of action. When a RAID-5 array degrades, each disk sector must be recovered by Reading information from the good drives and performing a mathematical reconstruction. This means that degraded RAID-5 arrays are often very, very slow. You may not have lost your data, but you have probably lost your ability to work on it, and in an critical deadline situation, this may be just as bad.
This same speed issue means that repairing a failed RAID-5 array, by removing the failed disk and inserting a new one, can also be extremely slow. Some RAID-5 arrays can take a day or more to rebuild, which is both highly inconvenient and beckons disaster: the likelihood of a second disk in the RAID-5 array failing while the recovery is in progress can become significant. Failing, in this instance, may mean something as simple as having an unrecoverable read error on a single sector. This is something which, according to the drive manufacturers' expected error rates, is actually highly likely as to rebuild the parity and data information, the entire array must be read. If this happens, you lose your data. Horror stories of this occurring are easy to come by. Your narrator has several of his own. It's horrifyingly commonplace.
RAID-10, by comparison, is extremely easy to rebuild. The failed disk of the degraded RAID-1 pair is replaced, and the contents of the surviving disk copied onto it. This is quick and easy and minimises the chances of a further failure.
Using a Degraded Array
Continuing to use a degraded array is, of course, not without risk. A single disk failure can then destroy the data on the array. This is a risk worth taking in some circumstances, but it remains the case that RAID-5 is generally less reliable in this instance than RAID-10. Usually, a workstation-level RAID-5 can tolerate the loss of only one disk. The loss of any further disk destroys the data. A RAID-10 with a single failed disk can tolerate the failure of any disk other than the remaining good disk in the initially degraded pair. It is only slightly less reliable than it was before, whereas the degraded RAID-5 is much less reliable than a single hard disk.
I am at pains to point out here that I fully appreciate that there are features of RAID-5 (or more specifically of the RAID-5 controllers people often use) which are often useful. RAID-5 arrays are often expandable, without having to move all the data off it (though this could in theory be done with RAID-10, I'm not aware it ever has been). The remote management capabilities and other convenient administrative niceties are often better than lower-end products. And yes, if you are particularly sensitive to things like space, power consumption, or air conditioning requirements, RAID-5 will get you more protected storage for less of those things. But those sorts of concerns are usually those of facilities where the arrays are made of more reliable, smaller capacity-per-disk “enterprise” drives. They're also huge, and the fiber-channel networks are so comprehensive that speed is barely a problem no matter what RAID level is applied. For all these reasons, I have long recommended RAID-10 for workstations and intermediate-level shared storage; it is both better and usually cheaper.
RAID-6 and the Future of Storage
The industry is not blind to this situation and there is a solution to many of the problems of RAID-5. With that engineer's imagination in play again, we're talking about RAID-6. Strictly speaking this means any RAID which can tolerate simultaneous failures of any two disks, using the same fundamental approach as with RAID-5, but with the parity information stored twice. This does make RAID-6 fundamentally less efficient in terms of storage overhead than RAID-5, but as we've seen, it may be necessary for current RAID-5 users to start using RAID-6 simply to maintain the same level of security, in the face of ever expanding drive and array capacities.
The future may lie in things like Sun's ZFS filesystem, which is not so much a RAID system but a new approach to managing hard disks which may hint at what the future holds. It offers several RAID options, referred to as RAID-Z1 through RAID-Z3. These provide similar redundancy levels to RAID-5 and RAID-6 but with a considerably improved ability to recover from some of the problems that bedevil RAID-5. The minutiae of how RAID-Z and the underlying ZFS filesystems work are beyond the scope of an article like this, but particularly, there are improved data-integrity checks which make it possible to check the validity of data even when part of a redundant array has failed. Rebuilds can also be considerably faster on disks that aren't full, as only the data space that's been used will be rewritten. A RAID-5 controller doesn't work in cooperation with a file-system, so it can't know which parts of the disk are useful data and which aren't. There are disadvantages to this arrangement – the close cooperation between ZFS and RAID-Z means that it's difficult to imagine a hardware RAID-Z controller, and the CPU loading for RAID-Z would be similar to a conventional software controller for RAID-5 or 6, providing a similar level of redundancy.
Because ZFS and the associated RAID systems were developed at Sun, we'd be forgiven for assuming that they were principally intended for server as opposed to workstation applications, and the design does seem to be aimed mainly at increased reliability as opposed to increased performance. In practice, RAID-Z exists only as a piece of software, not as a plug-in card, and it may outperform the sort of cheap RAID-5 that's often advertised as a feature of desktop computer motherboards simply through better software engineering. I spoke to Matt Ahrens, a software architect at Delphix who was co-founder of the ZFS project at Sun in the early 2000s, who concurred with the view that RAID-10 is preferable to either RAID-5/6 or RAID-Z for high performance work such as video editing. The main barrier to entry at the time of writing is that the Linux version of ZFS is described as “production ready”, but isn't fully performance-optimised. There is only a read-only version of ZFS for Windows and the Mac versions are variously described as not entirely ready for the bigtime, so the only place you could easily apply a mature RAID-Z implementation to video editing is with Lightworks on Linux or over a network to a Linux server.
Regardless of what you choose to do, the approach of copying data manually onto two separate disks (or a magnetic tape such as LTO etc.) still makes sense. RAID protects against hardware failure, not human factors. Accidentally formatting the wrong disk will still cause problems. RAID is not backup. And before anyone gets any big ideas about making RAIDs out of solid state disks, know that some of those are somewhat less reliable (in terms of absolute error rates) than the spinning metal we were trying to avoid in the first place.