What exactly is business continuity? For a long time, I thought of business continuity as being a subset of my disaster recovery plans. If there was a disaster, I would launch my business continuity plan in order to recover from the disaster.
Then, one of my CIO peers told me about his study of system downtime.
This CIO did an in-depth analysis of the root cause of each of his system downtime incidents. He traced each of the incidents all the way back to the reason for the downtime. His root cause analysis showed that over 70% of his system downtime was self-inflicted by his IT department.
For example, someone might make an untested or unvalidated change to a production system and temporarily bring the system down. Someone might deploy a new version of custom code that conflicts with the production version of the database. Someone might wonder where the other end of that power cord leads and decide to pull the cord from the outlet. This CIO quickly figured out that he could make a 70% improvement in system uptime if his IT staff just stopped doing things that brought down the production systems. Best of all, these improvements were completely within his and his staff's control.
After my friend shared his results with me, business continuity took on a new meaning. Rather than being a part of my disaster recovery plans, continuous business operations should be my standard mode of operation. I still need the ability to recover from a disaster, but that is a subset of my ability to run a credible, reliable IT department.
Setting a new goal of continuous business operations, I then performed my own analysis. Not an analysis of the reasons for our system downtime but a gap analysis of our IT processes. I was looking for process weaknesses that would result in us not testing a patch before we applied it, in someone not knowing where the power cord was connected, in someone not knowing about an incompatibility between development tools and the database.
For my gap analysis, I decided to find the person in IT who knew the least about technology. I commissioned that person to "walk" our various processes and identify the holes that the uninitiated might accidentally fall through. Fortunately, I had the perfect uninformed person -- me!
I did not launch grueling, exhaustive process reviews. Instead, I asked people to describe the process for applying a patch, moving a code change into production, verifying changes, etc. From these descriptions, our analysis consisted of three parts:
- First, the purpose of the gap analysis was to see how our processes could help people succeed. It wasn't to blame people -- blaming people rarely results in process improvements.
- Second, what holes in the process did we need to fill to reduce the opportunity for mistakes?
- Third, how could we simplify the process so everyone could understand and follow it? (I learned a long time ago that the more complex the process, the more likely the mistakes.)
As we implemented these process improvements, our business continuity increased. We still have unintended downtime, but it's rare and each time it happens, we have the opportunity to further refine our IT processes for change management, project management, service management, communication, etc.
And should we ever have a disaster to recover from, we have a tight set of processes that will help us recover better and faster.
More from The Real Niel
Download a chapter from Stand Up and Deliver: Accelerating Business Agility