Under pressure to improve recovery time objectives and minimize data loss, IT veterans at MetLife Inc. took advantage...
of decreasing storage costs and an underleveraged data center to devise a multimillion-dollar disaster recovery strategy that met those goals, eliminated a security risk and will earn a double-digit ROI in five years.
MetLife's triple play mirrors several trends in disaster recovery right now, according to experts, including a movement among CIOs to bring DR back in-house, often from tape storage providers.
At MetLife Inc., the nation's largest life insurer, two IT vice presidents set out to improve data recovery times for mainframe data. The company's recovery time objective (RTO), once a full three days at best when data was backed up only on tape and stored off-site, had improved to 28 hours through on-site data mirroring, with tapes also still in use. But data loss stood at 17 to 41 hours' worth of data -- not good enough in MetLife's fast-changing environment, said Tom Meenan, vice president of IT risk management.
"That's almost two days of business operations," said Meenan, who led the project with Bob Zandoli, vice president of strategic planning services.
Moreover, the 40,000 tapes per year still being shipped off to Iron Mountain for storage represented a significant security risk should they be stolen or lost.
Thus the objectives of the disaster recovery strategy encompassed improving the RTO, limiting data loss and eliminating the off-site tape security risk -- plus paying for itself. And before the IT duo was done, they had reduced the RTO to four to six hours and the potential data loss to 15 minutes -- all by upgrading in-house technology and ending the age-old practice of using backup tapes.
A disaster recovery strategy for more than disasters
Financial services is not the only sector to scrutinize disaster recovery strategies through the lens of downtime. The cost of downtime topped the list of drivers for disaster recovery upgrades, along with keeping ahead of the competition, according to Forrester Research Inc.'s most recent survey of storage and DR decision makers.
"Times have changed," said Stephanie Balaouras, an analyst at the Cambridge, Mass.-based consultancy. "Companies now recognize that it's not just catastrophic disasters they need to worry about, but that even mundane threats can cause costly downtime."
Indeed, at MetLife, the 56,000-employee New York insurance giant, business had evolved from life insurance provider to full-blown financial services company, and with that transformation came a growing dependence on technology, increasing regulatory obligations and a reputation to protect.
Supporting MetLife was a primary Tier 3 data center in Rensselaer, N.Y., where the mainframe lived, and a second Tier 3 data center in Scranton, Pa., 120 miles away. It mainly stored Unix and Windows servers.
Meenan and Zandoli proposed moving from a single site with the off-site tape strategy to using a second site with remote disk mirroring. The company had the floor space; it just needed a mainframe.
The project would also allow MetLife to bring DR in-house, an important consideration for an IT organization that puts a premium on customer service. According to the Forrester survey, the trend toward insourcing is gathering steam. Frustration with outside providers is one reason.
"Many companies are bringing DR back in-house due to more demanding recovery objectives than can be met with tape to a shared IT infrastructure and the affordability of lower-cost technologies," Balaouras said. That said, DR providers are also working to develop more affordable ways to meet stringent recovery objectives, she noted.
Affording mainframe mirroring
One more issue factored into the planning: Zandoli's team of engineers had set its sights on providing MetLife with the so-called agile data center touted by consultancies like Gartner Inc. This is an adaptable technology infrastructure that offers flexibility, high availability, efficiency, customer responsiveness and superior data protection as the business changes -- all in a cost-effective manner. To achieve this, the team would need to refresh the storage disks in the data center.
"We were able to put that disk refresh in the DR project, and with Moore's Law, we were able to make this self-funding because as we refreshed the disks, we were able to reduce the run rate," Zandoli said.
We were able to put that disk refresh in the DR project, and with Moore's Law, we were able to make this self-funding.
Bob Zandoli, vice president, MetLife Inc.
MetLife uses IBM storage, specifically 199 terabyte (TB) RISC, 177 TB SISC, for a total of 376 TB. Refreshing those disks with a cost savings of 70% over the past refresh three years ago allowed for the purchase of a new, comparable mainframe for close to the same amount as purchasing only the storage disks in the past, Zandoli said. Indeed, he framed the mainframe purchase, which is used primarily for replication , as "not costing a penny."
MetLife's primary data center runs about 18,000 MIPS every day. The secondary site, with the capacity to bring back all the data processing from the primary site, normally runs 2,000 MIPS daily. Because the engines are not running until you need them, Zandoli said, the expense of "software licensing does not become a huge issue." The company put in an OC-48 line between the two centers, but "again that cost was absorbed by the ability to renegotiate a great deal for storage," he said.
In addition to leveraging the Scranton data center facility, MetLife reaped savings by reducing its storage infrastructure needs and expenses, as well as ending its use of third party vendors that transported, stored and recovered the disks in case of an emergency. These cost savings allowed for this project to be funded without extra cost to the company. The multimillion-dollar project cost about $1.5 million in the first year, factoring in the cost savings. It was self-funding by the second year, and will yield a double-digit return on investment after five years. The project took a year to complete.
Every good deal requires a concession or two
So what did the guys have to sacrifice?
The 120-mile distance between the two sites means they may lose a bit more data than if the two centers were 90 miles apart, the distance required for synchronous transfer of data. But the additional loss is, "at worst, 15 minutes, and we believe we will lose one minute," Meenan said.
Conversely, the 120 miles distance between the two sites falls short of the 200-mile separation sometimes recommended in case of catastrophic disasters such as hurricanes, earthquakes or forest fires. Meenan and Zandoli, however, point to a 30-page risk assessment that identifies snow as the only big weather affecting both sites at the same time.
"Guess what: We've already lived through snowstorms," Zandoli said, and they have time-tested plans that get people to their jobs.
They could have put a "bunker" data center in between the two sites as part of the disaster recovery strategy, adding millions of dollars to the cost, or built a new data center 90 miles away, both options they considered -- and rejected as not worth the money.
Let us know what you think about the story; email Linda Tucci, Senior News Writer.