Over the years I have pontificated plenty on some ways to approach building a disaster recovery/business continuity...
plan. My advice has included realizing that the recovery and continuity of technology are just portions of what a disaster recovery/business continuity plan should include. It's important to prioritize the elements of the plan based on business risks and to focus redundancy on only those systems and processes that you cannot live without for even a short period of time.
Because I am sure my advice made plenty of sense, you have certainly done all of that. But now it is time to test your disaster recovery/business continuity (DR/BC) plan. How can you test a plan in a meaningful way -- a way that helps you identify and resolve gaps but without doing any damage to the organization (or your reputation)?
In my experience, there are four possible ways to test a DR/BC plan. These are:
- Think about an emergency and mentally walk through your response.
- Accidentally cause an emergency and trigger the plan as you recover from your mistake.
- Postpone your testing until there is an actual emergency and use that actual emergency to test your plan in real time.
- Plot out a pragmatic, phased method that tests your DR/BC plan without doing any damage to the organization (or your reputation).
I have learned the hard way that the first three approaches leave a lot to be desired.
Approach #1 is sort of like having a fire drill but telling the people involved in the drill to imagine their leaving the building and gathering in their assigned location -- all of which becomes meaningless in a real fire.
Approach #2 works but leaves quite a bit of damage in its wake. A few years ago our database locked up completely and shut down the business. We were dead in the water and so launched our DR/BC plan to recover our systems. We learned a lot about the gaps in our plan but lost a lot of credibility in the process -- to do similar testing across all of the organization and for all types of risks, I would be having a lot of unplanned downtime. And my goal in life is zero unplanned downtime.
Approach #3 is very high risk because our DR/BC plans usually have some gaps -- sometimes significant gaps -- and so waiting for a real emergency to test our plans opens us to the repercussions from exposing those gaps.
Align DR/BC plan test with routine maintenance
That leaves us with Approach #4.
This approach takes some time and thinking because we want to do the testing in logical phases. And those phases need to align to the overall DR/BC plan as well as to the needs of the organization. This approach also requires a high level of coordination with the non-IT side of the organization as you will be testing specific sub-systems and processes that people use every day.
For example, to test the recovery from a phone system outage, you will need to include client support and sales and logistics -- that is, anyone who utilizes the phone system to conduct the mission critical business of the organization.
Continuing the example of testing a phone system outage, your phased test plan might coincide with some planned maintenance of the phone system -- why take down the system any more than needed? As you plan the maintenance window, involve the rest of the organization so that everyone involved triggers and uses the DR/BC plan. With the phones down, how do you find out the status of employees? How do your clients contact you? How do you track shipments? For the testing, create some sort of a "war room" to track the execution of the subset of the DR/BC plan as well as the issues and gaps and all of the interesting and disconcerting things you discover. Take advantage of every possible planned maintenance window to test your plan.
DR/BC plan testing for mission-critical systems
But, what about the systems and processes that can never go down (or that you can never let go down), even for planned maintenance? If those systems and processes are that mission critical, it is highly likely that those systems have some type of redundancy. In that case, test using the redundant versions of the system. Carve off a portion of the organization the system supports -- for example, 10% of the call center -- and connect them to the redundant system that you are about to test.
Remember, it is the DR/BC plan that you are testing and so you need to somehow create an end-to-end replica of the use of the entire system and process. Using a portion of the business accomplishes that goal without shutting down the entire organization.
One last thing: The first few times you conduct your phased testing, there will be chaos, disorder, panic and frustration. So my final advice is to be prepared for the worst and then bring some popcorn to consume as you observe a portion of the organization recovering from an emergency -- a Keystone Kops movie pales in comparison to the antics that ensue.
Advances in DR testing technology
DRaaS is most popular with SMBs
Best practices for DR/BC plans, including testing