Senior network administrators Dave Trupkin and Greg Hearn told attendees at Gartner's Data Center Conference that they had begun converting a select number of the utility's 200 servers to
As virtualization improved performance, Trupkin thought the results would increase acceptance of the technology in his organization. But the utility's developers were slow to do so. Every problem the developers had was now a virtualization problem, whereas before it would have been an application problem, Trupkin said.
"We still didn't have full acceptance of VMWare," Trupkin said.
"Until we had the disaster," Hearn added.
In late summer 2004, two power lines overheated and fused together, causing an electrical failure at the utility. Hearn and Trupkin found that half of their servers were down. Those that were hooked up to uninterruptible power supplies (UPS) had stayed online. When the power came back that day, the network administrators decided to bring all the servers back up. But the power began to flicker in and out.
Trupkin and Hearn realized quickly that they had to bring the entire infrastructure back down before the power went out again. They did not have enough time to shut everything down, including a VMWare ESX host server.
When the power came back on a short time later, Trupkin realized that the ESX host was still running the way it was when it went down. The 25 virtual servers it hosted were cut off from their back-end storage, but instead of crashing, the servers simply "waited patiently until storage came back online and then the servers resumed from where they left off," Trupkin said.
Hearn added, "That was awesome."
But then the power went off a third time with a surge that burned out the facility's transformers and UPS devices.
The data center was without power for three days. Management acquired some generators and UPS devices in order to bring up some critical systems. There wasn't enough power to bring up everything. Management wanted some critical applications brought back online, such as the network, a customer billing application, email and some other line-of-business applications.
But Trupkin and Hearn realized they would need to bring up some other infrastructure servers that these critical applications were dependent upon.
"Thank God Dave had the foresight to move most of our DNS and domain controllers and a lot of our auxiliary services over to VMWare," Hearn said. "Dave brought up the virtual machines, which allowed us to bring up what management wanted us to bring up. If we didn't have those two VMWare hosts at that time, there would have been no way to bring up what we had brought up."
With the ruined transformers and UPS devices, the utility was unable to bring power back into the data center facility. It had to rely on generator power for eight weeks. It scrounged for more generators so it could bring all its systems back online.
After this event, the utility redesigned its disaster recovery (DR) plan with virtualization in mind.
"We had a DR plan in place, but the existing plan was kind of on paper only," Trupkin said. The utility took a "tape and pray" approach, with all servers on tape backups. The plan had been to restore from tape all the servers, which they believed would take seven to 10 days.
But the ruggedness and portability of virtualization led the utility to leverage the technology in its new DR plan. The utility instituted a "virtual first" policy, where all applications must run on virtual servers unless the vendors can prove why they shouldn't. Trupkin and Hearn also expanded to 15 ESX host servers.
LVVWD opened a backup data center in an offsite collocation facility, where images of all the virtual servers are stored for quick recovery during a disaster.
Sheldon Lipinsky, chief of network and hardware support at Canada's Public Safety and Emergency Preparedness department, said his organization is starting to virtualize its 100 servers, but the idea of integrating the technology into a DR plan was a novel approach for him.
"When we started virtualizing our servers, the idea of them being part of a disaster recovery plan was not on my mind," he said.
Lipinsky said that, given his organization's legislative mandate to have a business continuity plan in place, he would begin discussing the concept with his colleagues and explore whether they should virtualize their critical applications and any systems they depend on.
Let us know what you think about the story; email: Shamus McGillicuddy, News Writer