Here is the story of one midmarket company that vastly improved the recovery times in its disaster recovery plans...
by replacing tape backups with virtualization and replication -- solving the bare-metal recovery problem of its 70-box Windows server farm in the process.
It is also the story of an IT executive in an industry so accustomed to managing risk (insurance) that he was able to get approval for an expenditure that was "a bit shocking" (approximately $900,000) by laying out what IT could and could not do and leaving it to the business to calculate the cost.
North Carolina Farm Bureau Mutual Insurance Company Inc. is a midsized business with an oversized need for speedy and reliable disaster recovery processes. Based in Raleigh, the property and casualty insurer operates in all 100 counties of the Tar Heel state. It has 650 employees at its home office, another 2,100 people spread out over the counties and direct written premiums of just less than $900 million a year.
"Our goal is to cover business-critical functions within 48 hours of a disaster declaration," said Steve Zeidman, information systems division manager.
Cash flow is king at insurance companies, Zeidman said. The ability to mail bills, to verify coverage for claims processors, to take loss notices from customers and to pay claims relies on digital processes. The IT infrastructure is multi-tiered: a mainframe and a large Windows farm with data centralized at the home office. For years, the disaster recovery plans involved backing up and recovering data from tapes, using Tivoli Storage Manager. Then about two years ago, the insurer began to reassess those capabilities because recovery times were too slow.
Bare metal recovery problems, tape too time-consuming
With its time-tested backup procedures, the mainframe was a known quantity. IT could confidently assure the business that it would be back up within 24 hours of an outage, with 24 hours' worth of data lost. The 70 Windows servers were another story, Zeidman recalls.
"We found out the server farm was essentially unrecoverable. It would have taken weeks to recover it, if we had continued along with the methods we were using," he said.
Bare-metal recovery was a problem, Zeidman said, referring to the process of rebuilding a computer after a catastrophic failure. Some configurations require that the hardware configuration used for restoring the data be identical to the hardware configuration used for the backup -- a steep climb even when the servers are the same make.
"We could never get the hardware to match close enough without running into driver issues, which would cause the server to 'blue screen,'" he said.
In addition, recovering from tape proved far too time-consuming. "Due to tape mounts and unmounts, we found restores were consuming more than five hours per server. This was the point where we determined we absolutely needed data to be resident at the recovery center," Zeidman said.
Enter server virtualization, replication
IT decided to use virtualization and data replication to make that happen.
Thanks to physical-to-virtual conversion, the insurer's disaster recovery plans now involve five cold IBM servers running VMware, located at the firm's disaster recovery (DR) center 500 miles away in Sterling, N.Y. (The company initially tried to turn its IBM blade servers into virtual servers but found the blade technology "was not suited to that," Zeidman said, so it replaced them with the large IBM servers.) An OC-3 circuit connects headquarters to disk arrays that store data replicated via IBM Global Mirror and NetApp SnapMirror at the Sterling site, replacing backup tapes for the 12 terabytes of business-critical data.
On the mainframe, both the recovery point objective (RPO) and recovery time objective (RTO) have dropped from 24 hours to 15 minutes.
As for the recalcitrant server farm, since the VMware implementation in 2007, Zeidman's team has gotten recovery time down from 70 hours to 36 hours and is confident it can shave more hours off when it tests again next month. The RPO is 15 minutes.
Expert advice on virtualization and DR
Companies looking to server virtualization for their disaster recovery plans need to realize that quicker RTOs require more than just virtualization, one analyst cautioned. "Server virtualization by itself won't buy a user much if the production data and virtual system image are still being backed up to tape," said Gartner Inc. analyst John Morency, via email. "The real-time savings comes from a combination of using disk-to-disk replication for most, if not all, of the production data, along with a standardized approach -- e.g., use of VMware Site Recovery Manager -- for virtual machine backup and failover (this approach also using disk-to-disk)."
Richard Jones, service director for Midvale, Utah-based Burton Group Inc.'s data center strategies group, said companies like one recent client find time savings from virtualization in other areas besides recovery. "To their surprise, their DR testing time -- they test every six months -- dropped by more than 50%, and the number of IT staff required to perform the test also decreased by nearly half," he said.
Gap analysis lays foundation for business decision
Speedy and reliable DR, no surprise, does not come cheap. Zeidman said the cost of his company's solution came to 0.1% of its direct written premiums, leaving the reader to do the math ($900,000). Zeidman did not do a cost analysis or a business impact analysis for the solution.
"My input was purely from an operational level. I presented management with my opinion of what we could do and could not do in a disaster recovery and what parts of the business would be missing. I told them the cost of getting to the point of where we needed to be and they accepted," Zeidman said.
"The initial cost was a bit shocking, but they understood the benefit," he added.
Business is not always so accommodating. A recent Harris poll of 220 IT managers and 277 business line leaders for SunGard Availability Services shows a significant disconnect between IT and business executives when it comes to disaster recovery preparedness. While both groups overwhelmingly agree that information availability is important to a business' success, 74% of IT managers believe DR and business continuity are important to business success, vs. 49% of business executives who think so.
"We have very smart management in insurance. We buy reinsurance, for example, to make sure we can pay any catastrophic claim," Zeidman said. "They saw this as another layer of reinsurance."
Let us know what you think about the story; email: Linda Tucci, Senior News Writer