Most CIOs will say yes, they have a disaster recovery strategy with policies and processes in place that govern
how people and technologies will function in the event of an extended outage. But how sure are these same CIOs that their strategies actually work in the event of a real disaster? IT organizations need to give their data recovery solutions a reality check by addressing some of the following questions and issues:
How do you plan to recover systems?
Whether virtualization thrives in your environment or you are still in the physical systems world, you should ensure that when you need to run your IT systems from an alternate location, the facility is available and able to handle the workload necessary to run a near-normal operation. Sizing of systems, therefore, is critical. You cannot randomly set up systems and hope they will work. This is either a sure recipe for disaster or a total waste of money. And the real acid test for this is to simulate the workload on these systems to measure their performance.
With virtualization, this challenge becomes interesting, in a good way. You now have the ability to mobilize your virtual machines so your operating systems images are available at the press of a button. Consistency is key, though -- the more "one-offs" in your environment, the greater the risk of some application not working correctly.
The recovery of systems is necessary for not only applications, but also the accompanying infrastructure support protocols and applications like Active Directory, authentication and authorization domains, domain name servers, Lightweight Directory Access Protocol and Network Time Protocol. These services are often neglected and can affect the overall recovery timelines.
Networking readiness at alternate site
Your IP networking stack at an alternate site needs to be ready to kick in and accept incoming traffic on demand. Similar to systems, you need to ensure that it is sized appropriately so there is no meltdown when real traffic starts flowing through it. You also have to work with your service providers to ensure that your "face to the Internet," i.e., public-facing websites, virtual private network gateways, Web load balancers, traffic distribution engines, firewalls, intranet access points and other critical access points are available via or at the alternate location.
Reconsider tape as part of a tiered data recovery solution
Tape has a place in your data recovery solution -- a backup plan, not a primary recovery solution. Relying on only tape to recover your core applications and data is risky and time consuming, but it is very effective when time is not of the essence.
A data recovery solution needs to be tiered with a mix of solutions that work best for the particular applications or data types all tied together by a central console -- something that brings all of these solutions under a single umbrella for ease of execution.
Keep it simple
A simple data recovery solution is the one that has the least amount of customization and is implemented with out-of-the box technologies. If you find yourself creating full-time positions to support it, it's time for a reality check. It's not about proving how well someone can script; it's a matter of how easily the solution can work.
When disaster strikes and you need to recover your environment, you need to be able to do so swiftly and with 100% success. There is no room for error and no option to redo it -- you only have one shot at making it work. In such situations, a simple solution allows one to focus on the nontechnical aspects of getting your business back up and running. It simply means that your data recovery solutions should not have too many hidden dependencies or customized (and nontrivial) components or one-offs. The solution has to be designed with the possibility in mind that the person executing the recovery is not the person who designed and/or implemented it.
Avoid single points of failure in your disaster recovery strategy
A single point of failure in any data recovery solution isn't good. If you think the entire solution rests on the shoulders of a single employee in the organization, I'd rate this risk higher than losing your core business as a result of a power failure. Sure, no one is indispensible, but can your business afford to be put at risk each time an employee leaves?
The learning curve for out-of-the-box solutions, on the other hand, isn't that steep. For one thing, such products are often well documented and most vendors offer some sort of formal training that allows you to have more than one person trained to manage it.
They say practice makes perfect. It cannot be anymore true in the case of your data recovery solution -- put it through its paces and make it work for you.
Ashish Nadkarni is a principal consultant at GlassHouse Technologies Inc., an independent consulting firm focusing on transforming IT infrastructure. Let us know what you think about this story: email firstname.lastname@example.org.