<![CDATA[Andree's Musings]]>http://toonk.io/http://toonk.io/favicon.pngAndree's Musingshttp://toonk.io/Ghost 3.34Mon, 08 Jul 2024 04:13:33 GMT60<![CDATA[Navigating Infrastructure Outages: Battle Scars and Lessons Learned]]>http://toonk.io/navigating-infrastructure-outages-battle-scars-and-lessons-learned/668b6020a6e22d6c41290284Mon, 08 Jul 2024 03:59:23 GMT

One of my great passions is infrastructure operations. My first few jobs were in network engineering, and later on, as the cloud became more prevalent, that turned into what we now call infrastructure or cloud operations. Probably the most stressful times were during outages. Over the last two decades, I've been part of many outages, some worse than others, but I wear my operational battle scars proudly. In this blog, I'll share some learnings and observations from these incidents.

What triggered me to write this article

I could write many articles about the "epic" incidents I've been part of. While stressful at the time, they make for good war stories afterward. In recent years, companies like Facebook and Cloudflare have started publicly sharing their outage retrospectives (timelines, root causes, and lessons learned), which I enjoy reading. It's free learning!

A few days ago, Rogers posted its public incident summary, covering the typical aspects of an outage retrospective. The outage itself was significant. Rogers is Canada's second-largest ISP and cell phone carrier. Two years ago, they experienced a country-wide 24-hour outage that impacted all their phone (including 911) and internet customers. Yes, hard down for a full day! So naturally, as both a Rogers customer and internet operations enthusiast, when they posted their report (2 years later!), I had to read it.

The Rogers outage

Though not the primary focus of this article, let's briefly review the outage. During planned maintenance, Rogers staff removed a network routing policy, which caused (BGP, or OSPF/ISIS) routes to be flooded internally. This overwhelmed the control plane, causing all kinds of challenges, eventually causing it to crash or otherwise render its core routers inoperable. A logical next step would be to try and undo that change, however, the routers themselves were no longer manageable due to the control plane issues, and to make matters worse, the out-of-band (OOB) management network was also down. As it turns out, the out-of-band network somehow depended on the production data network, so when that went down, so did the out-of-band network, making troubleshooting and remediation impossible. Eventually, engineers had to be dispatched to manually get console access.

Thoughts on the Rogers outage

Outages suck. They’re super stressful and, as the Rogers incident shows, can have significant real-world impacts. Having these retrospectives after the fact, and getting to the root cause, lessons learned, etc., is mandatory for any organization that wants to continuously improve itself.

One thing that stood out for me was the news article from CBC.ca (a major news outlet in Canada), the headline of which was: “Human error triggered massive 2022 Rogers service outage, report finds.” While it's arguably true that the change made kicked off a sequence of events, declaring that as the root cause is short-sighted. Many of these outages are like airplane crash investigations, where multiple factors contribute to the failure, sometimes starting years ahead or being poorly designed to start with.

Reflecting on Outages: Can We Eliminate Them? Should We Want To?

It’s always easy to dunk on an outage from the sidelines and to be clear, that’s not my intent. I’ve learned firsthand that Shit happen! So, with the Rogers outage in mind, but also from past experience, let’s extract some generic lessons.

First off, I’d say that the goal of eliminating outages altogether is likely too ambitious for most of us. Yes, part of an outage retrospective process should be to prevent the exact same outage from happening in the future, but architectures evolve, technologies change, and even a slight change in parameters can lead to a similar outage again.

So, although we can limit the chance of outages, you won’t be able to eliminate them 100%. In fact, given the cost of complete outage elimination and depending on your industry (say Facebook vs. flying a passenger airplane), you may not necessarily want to aim for that. There are real financial costs, process costs, agility costs, or time-to-value costs by trying to completely achieve outage elimination.

So, let’s accept that outages will happen and instead use that insight to focus on limiting their damage and impact! This approach allows us to prioritize effective strategies, which I'll discuss next.

Time to Detection - aka Be The First To Know!