Backups are central to any data protection strategy, but by some estimates more than half of all backups fail either in whole or in part. When you look at the reasons for why they are failing, the same issues come up again and again. Below is a list of the common problems that cause backup failure, in decreasing order of frequency.
Media failure ranks at the top of nearly every list of reasons backups and restores fail. For this reason, it's important to treat your backup media with respect and use it intelligently.
In the case of tape, this means making sure you follow the vendor's directions for handling and storage, replacing the tapes regularly and cleaning the drives according to the manufacturer's schedule. It also means discarding any suspicious tapes.
Don't assume disk-based backup protects you from media-related failures. While the incidence of media-related failures is considerably lower with disk than tape, failures still occur.
For example, SATA disk arrays are often used in backups because they cost less and backups can usually get by with lower performance systems. However, it's a mistake to equate "lower performance" with "lower reliability." Saving money by using backup arrays that don't have features like redundant power supplies and hot spare disks leaves data at risk.
In spite of its No. 2 ranking, human error is probably the most prolific cause of backup failures. For example, if tapes are improperly stored between uses, is the resulting failure a media failure or a human error? Usually there's a significant component of human error in any backup failure.
The best safeguard against human error in backups is to train those involved to follow best practices. Make sure that the people performing backups and restores understand exactly what they need to do -- and what not to do.
It is also a good idea to take the person out of the loop as much as possible. Ideally, backups should not require any human action. Be especially cautious of situations where backup isn't part of someone's main duties -- for instance, someone in a branch office who's been asked to make a backup tape every night.
Sometimes new software or new versions of software can cause backup failures. For example, Service Pack 2 (SP2) for Windows XP turns on the firewall by default. When Microsoft released SP2, a lot of network backups failed because the backup software wasn't designed to work through a firewall.
More commonly, the problem is misconfiguration. Modern backup software is extremely flexible; in other words, you have a lot of options to choose from and choosing the wrong options can result in incomplete backups or backups that fail totally.
A related problem is that backup configurations are no more static than anything else in a modern storage environment. As resources are added and shifted and priorities change, the list of files to be backed up needs to change as well.
Tape drives, libraries, disk arrays and other backup hardware can also fail. Most of the causes and failure conditions for backup hardware are the same as for other kinds of hardware, but there are a few conditions that are specific to backup systems.
For example, drift produces a particularly nasty kind of failure in tape drives. As the drive ages, the heads slowly wander out of alignment. As a result, other drives can't read the tape -- and the drive can't read a tape it wrote some time ago. The nasty part of this is that the drive can almost always read a tape it just wrote, so the tape passes an immediate verification step in the backup process without complaint.
Backing up over a network increases efficiency by reducing the number of backup devices. However, it also introduces another point of failure into the backup process. Everything from a failed or flaky HBA to a misconfigured switch can cause a backup to fail.
This is a less prolific source of backup failures because the network, LAN or SAN, is used for much more than just backup, so problems will tend to become obvious before they can hurt your backups.
How to fix backup failures
Whatever the cause of failure, the best way to keep them from damaging your organization is to verify your backups by performing regular test restores. Testing your backups regularly won't prevent backup failures, but they can help in noticing the issue and this will allow you to fix the problem before you really need those backups and you get a nasty surprise.
Rick Cook has been writing about mass storage since the days when the term meant an 80 K floppy disk. The computers he learned on used ferrite cores and magnetic drums. For the last 20 years, he has been a freelance writer specializing in storage and other computer issues.