Joe Bloggs was tired but happy. His latest article for Exceptional Geowhizicist was finally complete. He pressed the print button, then leaned back and stretched his legs. Click. His monitor went black and the printer fell ominously silent. Only the whine of his disk drive spinning down gave testimony to his evening's work. It conjured up images of data irretrievably flowing down a drain. Gritting his teeth, he flipped off the power switches on the dead equipment, then crawled under the desk to switch on the power bar he had just kicked.
Pick an ending.
- Joe's equipment powered up with no data corruption and he was back on track in a few minutes.
- Joe lost his final draft but was able to reconstruct it from the previous draft and his notes.
- Joe's file system was mildly corrupted and he had to reconstruct his paper from old printouts retrieved from the recycle bin.
- Joe's file system was critically corrupted. He had to reinstall his word processor from scratch and recover from his handwritten drafts. (Fortunately, after reading the January 1998 Recorder, he had purchased a silver plated Parker 88 Place vendome ballpoint in an attempt to change his techno-geek image.) However, his custom tweaked copy of Zork IV was gone, causing him to break down and cry.
Joe experienced an inconvenience and was able to recover (except for Zork) but it could have been a disaster. Suppose his cooling fan had failed, causing the power supply to overheat and ignite. Suppose one of his kids had bumped the reset button or fed toast to his CD-ROM drive. Suppose a power surge due to lightning had fried his motherboard. He could have lost several years' worth of work or missed a critical deadline. In this article I will review some typical causes and effects of computing disasters and discuss ways to reduce their frequency, if possible, and to recover, if necessary.
A computing disaster can be loosely defined as a significant loss of availability of data or computational capability. This can range from the merely inconvenient, such as Joe experienced, to the truly catastrophic where a large corporation is brought to a standstill. There are two major approaches to minimizing the effects of disaster: prevention and recovery. Preventive measures occur before a disaster and preclude it or reduce its frequency. Recovery measures reduce the time and cost of restoring serviceability after a disaster. For common events prevention is usually the best approach while for rare events recovery is usually best. To illustrate the range of disasters that might befall a computer system and how a systems manager could address them, consider the following hypothetical but plausible scenarios.
- Accidental overwriting or deletion of a file. This is a prime candidate for prevention. If periodic copies of working files are made, damage will be limited to the work lost between the current and the last copy. The costs of this scheme are limited to the disk space and time required to make a copy.
- Minor file system corruption or loss of data in memory, caused by sudden power loss. Prevention includes such things as routing cables appropriately (so users or operators don't kick them out), educating users not to turn the power off or reboot without a system administrator's authorization, and using an Uninterruptible Power Supply (UPS). A UPS is a battery powered auxiliary power source which lies between an electrical outlet and the computer it protects. If input power is lost it will generate power from its batteries to run the computer. UPS's vary in capacity from the personal, providing perhaps five minutes of run time for a personal computer, to the corporate, providing over an hour for a hundred workstations. Loss of data in memory is a concern when a long-running job is killed without having saved any output. The user can only hope that the UPS will last until input power is restored. It is often possible to structure long seismic processing jobs so they write intermediate datasets, reducing lost time in the event of a shutdown. Preventative costs range from the trivial for cable routing and user education through several hundred dollars per workstation for UPS protection. Data can be recovered from archival backups, provided these have been made and are current. Tape drives are commonly used for this. They range in cost from a few hundred dollars to several thousand, depending on their capacity and speed. The former would be suitable for a single personal computer and the latter for a small network of UNIX workstations.
- Complete loss of a disk, caused by unpredictable hardware failure. This type of failure is particularly disastrous for file systems spanning multiple disks, as the loss of a single disk will destroy the entire file system. Until recently, the only reasonable approach to this was backup to, and recovery from, a tape archive. Now, however, multiple disk file systems can benefit from RAID (Redundant Array of Inexpensive Disks) units. These spread data over multiple disks with enough redundancy to recover the entire file system should one disk fail. In that case, after replacement of the bad disk, the unit would rebuild its redundancy. Should a second disk fail before the rebuild completes, or the RAID hardware itself fail, the file system would be destroyed. Thus, although RAID significantly reduces the frequency of data loss, it does not eliminate it. Backups of critical data are still important. RAID can be very expensive, costing tens of thousands of dollars for an eight disk system but without it multidisk file systems are truly "disasters waiting to happen."
- Complete loss of a network itself, caused by hardware failure or power loss. A UPS can provide emergency power for network hardware as well as for computers. Although the failure of network switches does not usually lead to data loss, it can idle all computers on the network. The best approach to this type of failure is to have network experts on-call and a ready supply of spare hardware.
- Complete loss of a corporate network and all its computers, due to fire. This is a rare but potentially catastrophic event. Its prevention is well beyond the scope of this article. Given a well formulated and implemented disaster recovery plan, recovery would be relatively straightforward, even if expensive. Without such a plan it could be a nightmare. For a smooth recovery key personnel would need to be available, hardware acquired and configured, software reinstalled, and data recovered from backups. It is essential that backups be stored separately from the computers and their network to prevent their destruction from the same event. With a disaster of this magnitude, insurance, financing, and management are essential, in addition to technical expertise, for a successful and timely recovery.
- Although these hypothetical disasters vary widely in cause and effect there is a common procedure which is required for successful recovery - data backups. These may be made on floppy disk, tape, WORM disk or some other medium. The type of medium is not critical but the backups must be current and reliable. Corporate computers are usually backed up transparently by system administrators, so users needn't be concerned. With personal computers, however, it is usually the user's responsibility to design and implement a prevention and recovery plan. My advice is simple: make periodic copies of working files; acquire and use a backup unit; and verify your backups.