When was the last time you reviewed and tested the disaster recovery plan for your data warehouse? Do you even have a disaster recovery plan for your data warehouse?
Times have changed
Ten years ago, there was little need to create a disaster recovery plan for data warehouses and the reports and applications they support. At the time, the vast majority of data warehouses were loaded in batch on a monthly basis from a half-dozen or so source systems. Most loads were fairly small and even the biggest data warehouses were less than a couple of hundred gigabytes in size. Not surprisingly, most data warehousing teams didn't have a disaster recovery plan, let alone a backup strategy. The common sentiment back then was that if the data warehouse crashed, you could simply refresh the data warehouse in its entirety from source systems once everything came back online.
How protected are you?
Research shows that a majority of organizations are confident in the resiliency of their IT systems. Most have disaster recovery plans that safeguard the business from short- and long-term disruptions. Maybe the disaster recovery plan even includes the data warehouse, the servers it runs on, and the reports and applications it supports. Since many data warehouses now run within corporate data centers governed by IT policies that include business continuity and disaster recovery planning, it is a good bet your organization has insured its data warehousing assets to some degree.
Unfortunately, most disaster recovery plans don't go far enough to protect an organization against costly disruptions. Disaster recovery planning is insurance, and most companies only insure what they can afford, not what they need.
Has your organization prioritized the business processes and applications that are critical to its operations? If the data warehouse is a top priority, what about the extract, transform and load (ETL) engines that populate the data warehouse and the BI servers that generate and distribute critical reports? A chain is only as strong as its weakest link, and a data warehouse is a complex environment that comprises multiple systems and applications and interdependencies with internal and external systems. The data warehousing environment can't be fully restored until every one of its components is brought back online.
When was the last time you really tested the disaster recovery plan for the data warehouse? If you practiced recovering from a database failure, you completed only part of the test. You need to restore clients, servers, networks, storage, applications and databases to fully simulate a recovery situation. And if you conducted your tests a year ago, it's a good chance that your plans are out of date. Since a data warehouse is an adaptable system, it is constantly changing to answer new questions that business people ask. So, the queries, reports, metadata, ETL workflows, aggregates and so on have probably changed since your previous test. Moreover, the questions that business people ask during an emergency may be very different than what they normally ask.
The key to resiliency is not just flexible, redundant systems, it's also the people. During a disaster, there is a lot of chaos and confusion. Many key personnel may be absent or unable to work or access systems. Thus, you need redundancy in not only your systems, but also in your staff assignments. Your team should all be schooled in what to do in a variety of emergency situations -- and be ready to play multiple roles as needed.
Disaster recovery puts a premium on good-quality, up-to-date, end-to-end metadata, something that few organizations have successfully implemented. Metadata is critical for performing impact assessments -- when something in a source system changes, you need to know how it will affect every other component in the system down to metrics within end-user reports. In an emergency situation, data warehousing teams can be seriously hamstrung in their ability to meet recovery time objectives (i.e., time to recover business functions), critical data point (the point in time from which data must be recovered), and recovery point objectives (time to recover data) without access to a dynamic, comprehensive metadata management system.
Of course, the data is the heart and soul of a data warehousing environment, and organizations must devise a good strategy for safeguarding data against power failures, network outages, floods, storms or other disasters. Most organizations perform backups to low-cost tape that are shipped and stored offsite. While it takes a long time to recover a data warehouse from tape, most of this data is historical and doesn't have high value during an emergency. To protect more recent information, organizations should replicate or snapshot data as it moves through the ETL process and store it on disk in a disaster recovery system, which archives or deletes the data after an appropriate period of time, usually a few days or weeks. Most data warehousing teams understand the need to manage the lifecycle of data warehousing information.
Unfortunately, these teams often don't anticipate disaster striking twice. Ideally, the online backup system should be maintained off site so a data center problem doesn't disrupt both the primary and backup systems. (This obviously is more costly and requires high-speed network connections.) They also don't have a backup for the backup if the off-site system goes down. Most also don't envision a disaster lasting more than a few days. Given that many businesses are still not fully functional in the wake of Katrina, we need to extend the duration for how long we expect disasters to last. Finally, many off-site backup systems don't protect companies from viruses that propagate internally. An off-site system should have an internal gate that delays real-time propagation by several hours to safeguard against software attacks.
It's not fun being the voice of gloom and doom, and no one wants to spend money to avert something that may never happen. But it seems to me that we are witnessing an inflection point in the number of crises, disasters and geopolitical tensions caused by environmental degradation and political polarization. There is nothing like word of a good old-fashioned disaster to impel us to dust off our disaster recovery plans. It's better than waiting for a real-life disaster to test the effectiveness of our plans.
Wayne Eckerson is director of research at The Data Warehousing Institute, a provider of in-depth, high-quality training and education in the data warehousing and business intelligence fields. He can be reached at firstname.lastname@example.org.