AWS This Week

AWS This Week: What happened with the AWS outage?

Episode description

Mattias is back with your AWS news! On December 7, things went wrong in us-east-1, causing a number of issues across AWS services including Route 53, API Gateway, EC2 and more. It broke global services homed in the region like AWS Account root logins and SSO, despite the AWS Status dashboard continuing to tell us things were still fine (spoiler: they were not). This week, we’ll take a look at this Amazon global outage, discuss what happened and what lessons we learned from Amazon not working.

Introduction (0:00)
Is Amazon down? What happened? (1:00)
AWS Status dashboard and support tickets impacted (3:12)
What caused the AWS outage: Internal issues (3:51)
Resolution (5:32)
Lessons/takeaways for AWS (6:49)
What could we learn from Amazon not working (7:30)
Summing up the AWS outage event (9:43)

Sign up for a free A Cloud Guru plan to get access to free courses, quizzes, learning paths, and web series

Subscribe to A Cloud Guru for weekly Amazon news and AWS services announcements

Like us on Facebook!

Follow us on Twitter!

Join the conversation on Discord!

Series description

Join our ACG hosts as they recap the most important developments in the AWS world from the past week. Keeping up with ever-changing world of cloud can be difficult, so let us do the hard work sifting through announcements to bring you the best of what's new with AWS This Week.

Hello, cloud gurus. I'm Mattias Andersson, and you're watching AWS This Week. Now, if you're looking for announcements from re:Invent 2021, then pop down to the video description to get links to all of our onsite coverage from the actual event. Lots of good stuff there and way too much to even list, but this is the week after re:Invent. So let me tell you about all the new AWS announcements. Yeah. Post-re:Invent is always a slow week for announcements, but don't feel down because we still don't have an outage of news to cover.

That's right. We're going to dive into the AWS outage of December 7th, 2021. I'll start with a high level description of what happened from the perspective of us AWS users. Then I'll walk through some of the behind the scenes goings on that AWS has shared in their post-event summary, also linked in the description by the way. And finally, I'll try to leave you with some hopefully valuable takeaways. So let's jump in.

On the morning of December 7th, 2021 at 10:30 AM Eastern time or 7:30 AM Pacific time, as AWS uses PST in their report, things went wrong in Amazon's US-East-1 region, North Virginia. Over the next three minutes, which is pretty much all of a sudden from our external point of view, a number of AWS services in the region started having issues. Management console, Route 53, API Gateway, Event Bridge, what used to be called CloudWatch Events, Amazon Connect and EC2 to name just a few. Now, to be clear, the issue was not a complete outage for all of these services. For example, if you already had an EC2 instance running, when the problem started, it would likely keep running just fine throughout the entire event. However,

what that running instance could do might well have been impacted. For example, an EC2 instance would've had trouble connecting through the no longer working VPC endpoints to the still working S3 and DynamoDB. Furthermore, not only did the issue affect all availability zones in US-East-1, but it also broke number of global services that happened to be homed in that region. This included AWS account route logins, single sign on, SSO, and the security token service STS. The overall impact was broad with the issue causing varying degrees of problems for services like Netflix, Disney Plus, Roomba, Ticketmaster and the Wall Street Journal. It also affected many Amazon services,

including Prime Music, Ring doorbells, logistics apps in their fulfillment centers, and some parts of the shopping site, which would instead show pictures of cute dogs. It was a big deal. One Reddit user, a ZeldaFanboy1988 wrote, since I can't get any work done, I decided to relax and order in some pizza. Then I tried ordering online from the Jet's pizza site, 500 errors, lol. Looked at network request headers, it's AWS.

But no one seemed to be too upset at the company's impacted by the outage. In fact, as an example, several people instead took the opportunity in that thread to share how much they loved Jet's and ZeldaFanboy1988 did get their pizza. Anyway, reporting back: I ordered on the phone like a peasant. AWS is really ruining my day. But there were some other nasty problems too. First, despite all the issues, the AWS status dashboard continud for far too long in showing all green lights for all services. And second,

it was no longer possible to log support tickets with Amazon because their support contact center was broken too. This client communication made a lot of people pretty upset. It took almost an hour for the status dashboard to start reporting any issues and support tickets stayed broken all the way until the underlying issues had been addressed and services were coming back online. Now, speaking of underlying issues, let's rewind to the beginning and take a look at those. On the morning of December 7th, 2021 at 10:30 AM Eastern time or 7:30 AM Pacific, an automated system in Amazon's US-East-1 region North Virginia tried to scale up an internal service running on AWS's private internal network.

The one they use to control and monitor all of their Amazon web services. As AWS describes it, this quote, triggered an unexpected behavior from a large number of clients inside the internal network, unquote. Basically AWS unintentionally triggered a distributed denial of service or DDoS attack on their own internal network. Yikes. As an analogy, it was as if every single person who lives in a particular city got into their car and drove downtown at the same time, instant gridlock, nothing moving, not even ambulances, news reporters, nor traffic cops who would try to resolve the issue. Now we do know how we should avoid network congestion problems like this.

We use Exponential Backoff and Jitter. Check out the linked AWS blog post about this. Unfortunately this requires each client to do the right thing. And as AWS writes in their report, a latent issue prevented these clients from adequately backing off during this event. Yeah, oops. So the AWS folks were sort of flying blind because their internal monitoring had been taken out by the flood.

So they looked at logs and figured that maybe it was DNS. It's always DNS. Right? Well, two hours after the problems started that had managed to fully recover internal DNS resolution. And although this reportedly did help, it did not solve everything. So quite surprisingly, it was not DNS this time. For the next three hours after that, the AWS engineers worked frantically trying everything or, as AWS puts it, operators continued working on a set of remediation actions to reduce congestion on the internal network, including identifying the top sources of traffic to isolate, to dedicated network devices, disabling some heavy network traffic services and bringing additional networking capacity online. Then at 12:35 PM Pacific or 3:35 PM Eastern AWS operators disabled event delivery for EventBridge to reduce the load on the affected network devices.

And whether this was the linchpin or just one of the drops in the bucket, I can't tell, but things finally did start getting better. AWS reports that internal network congestion was improving by 1:15, significantly improved by 1L34 and all network devices fully recovered by 2:22 PM Pacific standard time. And although that resolved the network flood and their support contact center, it still took some more time for all the Amazon web services to come back online. API Gateway, Fargate and EventBridge were among the slowest to fully stabilize, needing until at least 6:40 PM Pacific or 9:40 PM Eastern. What a day? Huh? Okay, so AWS has called out some things that they've learned from this one key thing.

And it might seem obvious is that they need to do a better job of communicating with customers during operational events and not letting those systems go down at the same time. They're planning some major upgrades here, but we'll have to wait and see how all that goes. Of course, they're also working to fix the Backoff bug plus some additional network level mitigation to try to prevent another storm like this. They concluded their report with: we will do everything we can to learn from this event and use it to improve our availability even further. But what about us then? What can we learn? Well, during the event, there were lots of responses ranging from the throw the baby out with the bath water type, no cloud for you! To the naive, oh well multi-cloud solves everything. Right? Of course,

lots more responses were more moderate things like, Hey, we should use multi-region at least. But don't overreact because knee-jerk architecture change is, by definition, ill considered. Now we've learned in IT that agility is the name of the game. Figure out what actually matters most to your users. And as Werner Vogels, CTO of Amazon, is famous for saying, everything fails all the time. Now it definitely is possible for us to come up with strategies and architectures that avoid us being impacted by a repeat of this particular problem. But if this were a simple thing to do in advance,

whether through multi-region or whatever, then I'm sure Amazon would already have done that for things like their AWS status dashboard. Right? But as they point out, networking congestion impaired our service health dashboard tooling from appropriately failing over to our standby region. Yes, they actually did have a multi-region setup, but their failover mechanism failed. It's like losing the key to your doomsday bunker and winding up locked out. Prepared in theory, but not in practice.

Those are seriously smart engineers they have, but they're also still human. In practice, it gets complicated, especially because you don't know how things will fail. Now, when we used to have to build and manage everything ourselves on simple instances, then failures were a bit more predictable. Instances would die or become unavailable. But when we take advantage of managed services, then we wind up with rather different kinds of failures. Now, to be clear,

it's foolish to ignore managed services just because they could possibly fail sometimes. That would be like deciding to only walk and swim your way around the world because you've heard that some planes have crashed and some boats have sunk. It's impossible to be agile and not build on the work of others. By the way, if you haven't already seen it, check out the hilarious bare metal video, a link from the video description. Okay. Finally,

let's say that you ask me how much this event has impacted my willingness to use and build on AWS, to rely on them. I'll answer you, not really at all. Now that's not to say that I like outages like this because of course I don't, nor that I'll ignore the possibility of their happening again, but much like how I still confidently travel by air and trust the pilots more than I trust myself to fly those planes, I am still way better off with AWS. The entire package that they offer, faults and all, than I am on my own. And I'm gonna say with a pretty high confidence level that you are too.

So don't overreact and don't underreact either. Incorporate what you've learned as data points alongside all the others, move past feelings and knee-jerk reactions to make rational decisions. And when down the road you recognize that, still, not every decision you've made was perfect either, then apply this same blameless, post-mortem technique to learn from that situation and do better going forward. That's really all any of us can expect of ourselves, I think. So, take care of those around you. Embrace hug ops.

And keep being awesome Cloud Gurus.

More videos in this series

Master the Cloud with ACG

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?