Cascading Failures in Global Network Infrastructures

Prevent Surprising DDI Issues Before They Snowball with Sustainable Networking and Auditing Practices

Nov 23rd, 2021

Tiny human errors are the worst. No, really: you might fear DDoS, data leaks, shadow IT, or any of the scary monsters of network management. But the actual worst is the well-intentioned mistake that unravels into a tsunami of cascading failures. Proper testing and auditing are two proven ways to prevent these mistakes.

We're sure you've watched Facebook go down for hours. Losing billions of dollars. And we're sure you thought to yourself: "could this happen to me?"

Chances are, it could.

When it rains, it pours. Cascadingly so.

The regular threats for networks (DDoS, shadow IT, etc.) all have one thing in common: they have an intent. With intent comes a target, and with a defined target, mitigation is much easier. You can still be more (or less) vulnerable, depending on what your infrastructure looks like and its resilience against different kinds of threats.

But human error has no intent. It's senseless and chaotic because there's not a defined target. Human error doesn't want anything; it just is.

When you can't find the target, you can't pinpoint the origin of the problem easily either. Human error isn't intentional and thus could be anywhere. A well-intentioned DNS change. An honest configuration update. Or just a bug that's running rampant in your entire global infrastructure because one of your users clicked the wrong button after entering incorrect data.

Human error is the plague of anything, but especially network management, where a single human error can ripple through thousands of interconnected systems.

Case in point: Facebook

In the fall of 2021, Facebook engineers pushed an update to their network configuration that caused the BGP withdrawal of the IP addresses for Facebook's DNS servers. What followed was a spectacular cascade of dependencies that forced half the internet to its knees.

On a network, particularly on one as big and critical as the internet, nobody's alone. Not to use the butterfly-effect analogy, but what you do in your little corner of a connected infrastructure will reverberate across all of it.

In Facebook's case, once their DNS servers became unavailable, it wasn't "just" Facebook that went down. Turns out, since the Facebook employees use Facebook Workplace for collaboration they couldn't even talk to each other about the problem. Everything they do is connected to their own infrastructure, so when the foundational piece of the network (the DNS) was out, some people couldn't even enter the building. They were locked out not just from the internet but from the places they needed to go to resolve the problem. Border gateway protocol indeed.

And the outage didn't stop there. If it was only Facebook, that would've been painful enough. But the ripple continued:

·       DNS servers received a much higher load of queries on every level: not only were people looking for Facebook more often than usual (read: slamming that F5 button repeatedly), but hundreds of thousands of websites with varying degrees of integrations were, too.

·       Other platforms started getting a massive influx of additional people: those who were looking for information, those that were looking to reconnect with their peers, and of course, those with the popcorn.

In some parts of the world, for all intents and purposes, "the internet" was down. And because of the substantial additional workload the rest of the internet platforms were dealing with, they started to experience slowdowns.

The BGP butterfly flapped its wings on a Facebook engineer's computer, and the world trembled.

The problem with redundancy

Who among us hasn't mistyped a domain or IP address? While Facebook's engineers may be responsible for the snowball of an error that started it all, we cannot blame them for the following avalanche.

Not directly, anyhow.

In an effort to retain control over its infrastructure, Facebook forgot to build in redundancy that would've mitigated a large part of the outage.

Redundancy is a tricky concept because while it sounds good in theory, it's incredibly challenging to implement in practice. Mainly due to the cost of not just additional platforms but also additional management overhead. The dawn (or high noon, really, at this point) of multicloud, edge, diverse distributed infrastructure, and dispersed teams further complicate the management of network infrastructures, to the point of it growing into a byzantine monster on its own.

Building usable redundancies with overlays

The first step in solving for usable redundancy is recognizing that eliminating single points of failure means not relying on a single vendor or infrastructure.

No vendor has an incentive to be compatible with anyone other than its own ecosystem. Building a diverse environment that facilitates redundancy needs interplay between platforms, which is provided by third-party solutions. (Like Micetro for DDI.)

Third-party orchestrators like Micetro have no allegiance to any single ecosystem, only to the end customer who needs to manage them. Their bottom line isn't tied to consumption but to facilitation. They are (or can be) the glue that holds together the delicate network balance, where no one component is mission-critical to the functionality of the whole.

Including the orchestrator itself: true overlays like Micetro don't require hardware appliances that are single points of failure on their own. While providing high availability within its components (both server and database), the overlay itself isn't a mission-critical piece of the infrastructure either.

This sets Men&Mice apart from our competitors. And speaking of the unique value we bring to your networks:

Start your quest for redundancy with smart planning

Companies, particularly global enterprises with long histories, often struggle with digital transformation because of their size. Modernizing infrastructure can be a complex, drawn-out process. Even easy-to-apply overlays can be a challenge when the majority of the problems are "hiding" below the surface like an iceberg waiting for its moment.

The first step in solving any problem is recognizing there is one. That's why we created a service that does just that: auditing your network and identifying actual or potential problems.

Based on the experience in developing our holistic DDI orchestrator Micetro and conducting DNS training for all levels, our DNS Audit service takes a step back. It looks at your core network components holistically and unbiasedly. The audit provides you with insights and recommendations for topics including:

·       Single points of failure

.   Disaster recovery & resiliency

·       Firewall configuration

·       Platform configuration & security

·       Monitoring

·       Name server software configuration

·       Name server enhancements to improve environment operations and security

·       Change control process

The service is not dependent on Micetro: whatever services or platforms you may be using, we deliver our experience with enterprise network management and orchestration.

You're welcome to contact us about the DNS Audit service, training for your employees, or take Micetro for a test drive without commitment.

We hope all these resources will help you make better decisions on your way to transforming your network infrastructure and making it more resilient and sustainable.