This content is part of the Essential Guide: A CIO's guide to cloud computing investments

Essential Guide

Browse Sections

Amazon cloud outage: A CIO survivor's guide

Industry watchers sound off on cloud worries following Tuesday's disruption. Also in Searchlight: FCC issues stay on privacy rule; Wendy's will serve up self-ordering kiosks.

When websites slowed Tuesday and failed to load webpages, the reaction started with "Is it just me?" posts on Twitter. It went on to news headlines announcing that Amazon's cloud storage service was down, disrupting tens of thousands of sites, and then to worries about Amazon Web Services being too big to fail.

All this is bound to make CIOs, who have been turning to cloud computing for promised benefits such as agility and flexibility, think about alternative courses of cloud action. Should they spread their applications among different cloud providers, invest more in hybrid cloud -- which combines use of the public cloud with an internal private cloud -- or find a way for apps to function even when a cloud provider doesn't?

The first thing to do after an event like this week's Amazon cloud outage, industry observers say, is keep calm -- and evaluate your IT architecture.

"Decide what you're happy with, what you're not happy with," said Lydia Leong, an analyst at market research outfit Gartner.

What depends on where

One of the first things many IT organizations will examine, Leong said, is their incident response -- how they react to calls for help.

"One of the ironies of this outage was that for many organizations the IT team currently uses Slack to coordinate incident response when there's an outage of some sort," Leong said, referring to the popular messaging app. "Well, in this case, Slack was impacted by the AWS issues and so if you were trying to use Slack to coordinate, you had a problem."

The outage started with a slowdown of Amazon's Simple Storage Service (S3), which stores trillions of files, photos and video. AWS issued a report Thursday on what happened: A team was making repairs to the S3 billing system, which was sluggish and required a few servers to be shut down, when an incorrect command was entered and ended up taking down many more servers. To get them back up, a reboot was necessary -- so for almost four hours many websites didn't have access to their info stored on the service.

The failure happened in a data center in Amazon's large "US East-1" region, in Virginia; customers with data stored in other regions -- there are four in the U.S. -- were mostly unaffected.

"Everybody should just have a look at what their dependency is on a particular S3 region," said Dave Bartoletti, an analyst at Forrester Research. And then they should do failure tests on their most critical web applications "to see how [they] would respond to a loss of S3 in a particular region."

He gave the example of Netflix, which runs entirely on AWS. The online video network wrote code it called Chaos Monkey to "run around on their servers" and break things. That helps it see how critical applications can handle failure.

Building a stronger application

After failure testing, an organization needs to determine its risk tolerance, Bartoletti said: How much downtime could it absorb in the event of a future outage without hurting business?

Some will choose to design applications so that they are not reliant on just one region of a vendor's cloud. So if servers go down in one region, a website can access the information it needs from another one.

"But your app has to know how to find that," Bartoletti said. "It's not necessarily more expensive to run it that way. It just might take a little more time to architect your app properly."

Leong said building applications for better resiliency -- meaning they can operate regardless of a failure at a cloud provider's data centers -- is inherently hard to do.

"Building a resilient application becomes more and more complex as you start to say, 'Well, I want to be in these multiple data centers, and I want to be able to recover in the following ways.' Every additional nine of availability, so to speak, gets exponentially more expensive," Leong said, referring to the "nines" of application availability.

Is multiple too many?

Even more complex -- and costly -- she said, would be using multiple cloud vendors, an approach floated in headlines following the Amazon cloud outage. That means putting application data in AWS and the same data in Microsoft Azure, say, or Google Cloud Platform.

Each provider does things differently and has different feature sets, Leong said.

"So you have to decide, 'Well, do I need this feature? If I wanted to use, let's say, Azure, can I work around the fact that that feature doesn't exist [with another cloud provider]? Can I implement that feature myself? If I implement that feature myself, what is the cost and complexity of doing that? Does that make my overall solution less reliable?'"

It's not unusual for companies to run different applications in different vendors' clouds, but running one application across different providers is not the norm. When asked how many do, Leong said, "Almost none."

While Bartoletti said it's not "crazy" for companies, especially big ones with big budgets, to consider replicating an application in several providers' clouds, "I don't think that would necessarily give you that much more uptime."

"You don't see other providers gloating, saying, 'AWS failed. Come here; our storage is better,' because a lot of their storage platforms are designed at roughly the same availability level as Amazon's."

'Not a magical entity'

An important thing to keep in mind, Bartoletti said, is S3 doesn't go down often -- it has been knocked offline just a few times in the 10 years AWS has been running it.

"It's their oldest service, and so they've been making it better and better over time," he said.

Leong agreed that big blackouts are rare and will continue to be -- but she said it’s helpful to see Tuesday's Amazon cloud outage as a reality check.

"Every time there is a cloud outage, people get reminded that the cloud isn't a magical entity. People are so used to close to 100% availability."

CIO news roundup for week of Feb. 27

The Amazon cloud outage was the "in" topic in tech headlines this week. Here's what else made news:

FCC halts data security requirements. The U.S. Federal Communications Commission on Wednesday issued a temporary stay on a data security regulation that was scheduled to go into effect this week. A privacy rule passed in October would have required broadband internet service providers to engage in data security practices like protecting customer data against unauthorized use and providing notification of data breaches. "We still believe that jurisdiction over broadband providers' privacy and data security practices should be returned to the FTC [Federal Trade Commission]. … The federal government shouldn't favor one set of companies over another -- and certainly not when it comes to a marketplace as dynamic as the internet," said FCC chairman Ajit Pai in a joint statement with FTC acting chairman Maureen Ohlhausen. The rule was not consistent with the FTC's privacy framework and the commissions will work together to establish a technology-neutral privacy framework, the statement added.

5G takes center stage at MWC. Commercial 5G networks will begin to be deployed at the beginning of the next decade, and there will be 1.1 billion 5G connections by 2025, according to a GSM Association (GSMA) report published at the Mobile World Congress on Monday. The report outlines five goals for the 5G era and predicts that 5G technology will transform mobile technology's role in society. "5G is an opportunity to create an agile, purpose-built network. … But it is vital that all stakeholders work together to ensure that 5G is successfully standardized, regulated and brought to market," Mats Granryd, director general of the GSMA, said in a statement.

Fast-food automation. In a bid to improve automation, enhance customer experience and reduce labor costs, fast-food chain Wendy's announced it will install self-ordering kiosks in about 1,000 of its stores by the end of 2017. Use of kiosks might result in shifting of labor to other areas instead of immediately replacing workers, Darren Tristano, vice president at Technomic, told The Columbus Dispatch. In other fast-food tech news, McDonald's unveiled its new global growth plan and announced it will roll out mobile ordering and payment at 20,000 restaurants worldwide by the end of the year. The company is also testing curbside pickup in the U.S. and will deploy "enhanced technology to elevate and modernize the customer experience," McDonald's CEO Steve Easterbrook said during the company's annual Investor Day event this week.

Assistant editor Mekhala Roy contributed to this week's news roundup.

Check out our previous Searchlight roundups on 5G wireless technologythe IoT explosion and Asilomar AI Principles.

Next Steps

How the Amazon cloud outage started

Largest AWS region has taken heat for reliability in past

Pros and cons of a multicloud strategy

Strategies for dealing with a catastrophic cloud failure

Dig Deeper on Cloud computing for business