Some days I long to be the benevolent dictator of all of our IT. Sure, I am the leader of IT, but I am not in complete control. My control over IT decreases as we distribute IT to our remote sites. I can specify the equipment they use, but I can't control what they do with the equipment. I can define processes and practices for how our remote sites use IT, but I can't monitor what our remote sites actually do. This has implications for my remote-site disaster recovery and business continuity plans.
But it begins with the everyday. For example, an employee at one of our remote sites considers himself, erroneously, to be quite adept at IT. So, when the site needs IT support, the site does not call the service desk. Instead, employees call Carl. Carl does his magic and shuts down the site. Carl then calls the service desk so we can bail him out. Try as I might, I cannot get the remote site to call us first. But if I were really in control of IT, I would exercise my rights as a benevolent dictator and have Carl put in stocks.
Back to disaster recovery and business continuity, which are among my remote-site sore spots. Let's face it: With the trouble we have managing backup and restore at our primary data centers, our remote sites barely have a chance. So, we need to make the process as simple and straightforward as possible.
I start by really
- If A systems are down for a few minutes, the business is at risk.
- B systems can be down for a few hours before the business is at risk.
- C systems can be down for a long time before the business is at risk. (One time, one of our C systems was down for six weeks before anyone outside IT noticed.)
I focus both my data center and remote-site disaster recovery and business continuity plans and tools on the A systems and data. Sorting our systems this way sometimes means that I don't need to worry about remote-site backup and recovery at all -- and what is simpler than that?
If, after stratifying systems into A, B, and C categories I still need to provide disaster recovery and business continuity to remote sites, I turn my attention to instructions that are written well, simply and clearly. These cover backup and recovery procedures, how to communicate in the event of an event, what to tell customers, etc.
For our procedures, we first find a willing or unwilling person or team in IT to write the first draft. We include lots of pictures to accompany our well-, simply and clearly written procedures. We then have someone in IT try out the instructions to see if he can accomplish the task. Once we fill in any of the gaps that this first test revealed, we ask one of our end users to follow the instructions. With this done, we send out the instructions and then cross our fingers that nothing really goes wrong.
More disaster recovery resources
When we are feeling bold and brave, we have one of our remote sites test our procedures by simulating an event. Can people there get access to and install their data backups and get back in business? What happens if they lose network and Internet connectivity? Can they contact employees and customers if the phone system is out? With just a few scenarios, we can usually find and fix the holes we have in our processes. With the right attitude, these tests end up being somewhat entertaining -- especially if you enjoy watching people wander around the neighborhood trying to remember the location of their safe gathering spot.
Sort, document, and test. That is what I do to improve the chances my remote sites will recover from what can go wrong. Not quite the control of a benevolent dictator -- but I will take what I can get.
Niel Nickolaisen is CIO at Western Governors University in Salt Lake City. He is a frequent speaker, presenter and writer on IT's dual role enabling strategy and delivering operational excellence. Write to him at firstname.lastname@example.org.
This was first published in March 2009