Over the past month, Microsoft Azure has experienced two outages. So how did these Azure outages happen? And what is Microsoft doing to prevent outages like these in the future?
March 2021 Azure AD outage
We’ve come to think of cloud as the magic sauce that holds together or mobile apps, our email access, and the authentication of users for company infrastructure. The cloud never stops computing, right?
But on March 15, an Azure outage led to widespread outages on the second-largest cloud computing platform. This outage not only impacted Azure services but Office, Teams, Dynamics 365, Xbox Live, and more. What happened?
The roughly 14-hour outage began when a routine rotation of security keys for Azure Active Directory (Azure AD) was performed. These key rotations are a good thing and keep users safe.
(Side note: Wondering what the difference is between Active Directory and Azure Active Directory? )
But on this day, Microsoft was performing a complex data migration between cloud providers. They had marked one specific key as “don’t rotate.” This key needed to remain for the time being to finish the migration. But . . . the automatic rotation process ignored this “don’t rotate” marking, and as the new keys reached Azure services, users were unable to log in.
For a simple analogy, imagine a company changing out the swipe cards needed to enter the building — without telling anyone. Then, everyone shows up to the office the next day, and no one can get in.
Post-COVID DevOps: Accelerating the Future
How has COVID affected — or even accelerated — DevOps best practices for engineering teams? Watch this free, on-demand webinar panel discussion with DevOps leaders as we explore DevOps in a post-COVID world.
April 2021 Azure DNS outage
A smaller outage followed on April 1, 2021, when Azure DNS started to experience an availability issue that caused failures in accessing and managing Azure services for many customers across multiple regions.
This could have been played off as the world’s least-funny April Fools’ Day joke, but it was actually caused by Azure DNS servers experiencing a surge in DNS queries targeting a set of domains hosted on Azure. This sequence of events exposed a code defect in the DNS service. That led to an overload in the service and a decrease in availability.
According to Microsoft, the DNS services automatically recovered themselves and recovery time “exceeding [Microsoft’s] design goal.” Microsoft reports it’s already working on repairing the code defect and improving automatic detention and mitigation of unusual traffic.
How does this relate to the 2020 Azure AD outage?
Stop me if you’ve heard this one before. September 2020 had a similar outage that (funny enough) also affected Azure AD in March. So why wasn’t it fixed then?
This March 2021 outage was in part due to Microsoft fixing the root cause of the September 2020 outage. Work wasn’t completed yet — major changes take time — but it would have prevented the March outage. So why did it happen?
A software bug failed to acknowledge the flag that said “please don’t rotate this key.” A fix was implemented within a couple of hours, but it took time for services to clear their caches and the old key being propagated to all corners of Azure.
Get the Cloud Dictionary of Pain
Speaking cloud doesn’t have to be hard. We analyzed millions of responses to ID the top concepts that trip people up. Grab this cloud guide for succinct definitions of some of the most painful cloud terms.
How can we be sure an outage won’t happen again?
As a wise person once said: “All software always have an infinite number of unknown bugs.”
Software will always have bugs. And no cloud platform can offer no outages. But we can hope for speedier recoveries and transparency into what happened — and what is being done to avoid issues in the future.
For its part, Microsoft is taking steps to reduce future outages, writing:
“Azure AD is in a multi-phase effort to apply additional protections to the backend Safe Deployment Process (SDP) system to prevent a class of risks including this problem . . . We understand how incredibly impactful and unacceptable this incident is and apologize deeply. We are continuously taking steps to improve the Microsoft Azure platform and our processes to help ensure such incidents do not occur in the future.”
How frequent are cloud outages?
Platforms this mature should have millions of bug-catching functions and facilities, but here we are. These kinds of outages shouldn’t happen though, right?
There are way more benefits to cloud than drawbacks, but outages can happen.
Both Amazon Web Service (AWS) and Google Cloud (GCP) also had recent outages, so it’s something we have to expect to a certain degree. The November 2020 AWS outage impacted sites and services ranging from Adobe to Roku, and a June 2019 GCP outage affected YouTube, G-Suite, and Snapchat.
Level up your cloud career
A Cloud Guru makes it easy (and awesome) to get certified and master modern tech skills — whether you’re new to cloud or a seasoned pro. Check out ACG’s current free courses or get started now with a free trial.