Q&A: Why did the internet ‘cloud’ burst, and will it happen again?

On Oct. 20, a routine update at Amazon Web Services, one of the world’s largest providers of virtual computer services, exposed an existing software flaw within the systems and temporarily broke much of the internet, disrupting consumer access to banks, social media, shopping sites and even dating apps.

Services were down for the better part of a day, with some sites taking longer to recover.

To find out how and why the outage occurred, and what users can do to avoid future outages, UVA Today reached out to Neal Magee, the faculty director of systems architecture and an associate professor at the University of Virginia School of Data Science.

Q. What is the cloud, and why is it there? 

A. The word “cloud” can mean different things to different people. For my own work, which relates to cloud services, the public cloud means providers like AWS, Google, Microsoft and others. They provide computing, storage and other services on demand for a fee.

Neal Magee lecturing in front of a classroom of students.

Neal Magee is the faculty director of systems architecture and an associate professor at the UVA School of Data Science. (Contributed photo)

So instead of buying a $40,000 server, waiting for it to be built and delivered, and then installing it, I can spin up a server in AWS within seconds with no long-term commitment. I could run it for five minutes or five months.

What’s great about that model is that you trade capital investments in infrastructure for ongoing, operational expenses of what you use. And you’re not stuck with servers the way you created them, either. You can resize to have more or less memory, CPU, GPU or storage. You can create computing resources that only run for five minutes every day, and you don’t pay for anything more than that.

AWS was created partly out of the need Jeff Bezos had to keep Amazon.com up and running when it experienced heavy user traffic. Back in the old days, Amazon sold mostly books, and Christmas shopping would cripple the site in November and December. So, they set up massive data centers around the globe in “regions,” with thousands of large, robust physical servers. 

Q. Why did an outage at one company crash the internet for so many businesses, services and people?

A. From the start, AWS taught a series of design principles for how to build in the cloud, and one of the most fundamental is that you should always expect failure. Sometimes the power goes out.  Sometimes your internet drops out or your air conditioning stops working. Those same things can affect data centers and the services that run in them.

It appears the event a couple of weeks ago involved a very low-level service called DNS that many other AWS services rely on and which, in turn, all consumer services rely on. It failed, causing a “cascading outage,” where one system fails and then other systems that depend upon that first system fail as well.

Q. How can companies prevent or lessen the impact of such outages?

A. The way to mitigate this is to build your solutions anticipating failure at any level. AWS has the concept of regions, and each region is made up of sub-regions, or “availability zones.” US-East-1 is Amazon’s Eastern region; it’s the oldest and largest region, with seven availability zones in it. Each availability zone is made up of more than one distinct data center, so you’re talking about a massive array of computing infrastructure.

Thanks, It's vintage, Shop
Thanks, It's vintage, Shop

In this case, the outage affected an entire region, which meant that companies using that region would have to be prepared to pivot immediately to an entirely different region, such as California or Oregon or Ohio, to keep their services up. All of this, of course, costs money, engineering time and expertise.

It’s a balancing act between preparing for something that may only happen every X-number of years and saving on the expense of that preparation.

Q. Could companies run their own servers instead of relying on cloud services? 

A. There are some companies that own and run their own infrastructure and data centers, or lease space from providers like AT&T, Equinix, Iron Mountain, etc.

But internet-scale providers like AWS, Google and Microsoft Azure have such massive capacity, economies of scale and excellent engineering staff that they offer more reliability and redundancy than almost anyone can build on their own.

Before coming to UVA, I had clients who leased space in data centers and ran their own infrastructure. I managed a team of engineers, and we were incredibly relieved to move almost all of our client workloads into the cloud. It meant fewer outages, a greater ability to scale up and down, and to pare down their services from a cost perspective.

The on-call pager hardly ever went off in the middle of the night, so I’m definitely a big fan of the cloud.

Q. How likely is it that a similar event will happen in the near future? Are there steps individuals and businesses can take to protect themselves from future outages, or at least minimize the impacts?

A. I can almost guarantee that it will! The real world has things that break and people who make mistakes. Things happen. Not all outages can take down an entire region this thoroughly, but service outages at some level happen numerous times a year for data centers, even though they have battery and generator backup power supplies and redundant internet connections.

As end users of so many services, there are no specific steps we can take other than making smart choices in our purchases from credible companies that take “uptime” seriously. You could certainly try to research which cloud provider a company uses, or if they post any record of outages in the past, with some sort of explanation of what happened and how they addressed the outage going forward.

Media Contacts

Bryan McKenzie

Assistant Editor, UVA Today Office of University Communications