Articles

Network Outages, Human Error and What You Can Do About It

Despite keeping a low profile in the news, human error still rules when it comes to causing common network outages.

Dec 18th, 2017

When your route leaks

Human error. As far as mainstream reporting on network outages goes, it’s the less flamboyant sidekick to DDoS and other cyber attacks. But in terms of consequences, it’s just as effective.

Once again, beginning of November, large parts of the US found themselves unable to access the internet due to one small error: a misconfiguration at Level 3, an ISP (Internet Service Provider) that underpins other, bigger networks.

According to reports the outage was the result of what is known as a “route leak”. In short, a route leak occurs when internet traffic is routed into inefficient, or simply wrong, directions due to incorrect information provided by one, or multiple, Autonomous Systems (ASes). ASes are generally used by ISPs to keep track of IP addresses and their network locations. Packets of data are routed between ASes, which use the Border Gateway Patrol (BGP) to establish and communicate the most efficient routes so you can browse the whole internet, and not just the IP addresses on your particular ISPs network.

Route leaks can be malicious, in which case they’re referred to as “route hijacks” or “BGP hijacks”. But in this case, it seems the cause of the outage was nothing more spectacular than a simple employee blunder, when (as speculation goes) a Level 3/Century Link engineer made a policy change which was, in error, implemented to a single router while trying to configure an individual customer BGP. This particular incident constitutes what the IETF defines as a Type 6 route leak,  generally occurring when “an offending AS simply leaks its internal prefixes to one or more of its transit-provider ASes and/or ISP peers.”

Route leaks, small and large, are regular occurrences – it’s part and parcel of the internet’s dependency on the basic BGP routing protocol, which is known to be insecure. Other recent high impact route leaks include the so-called Google/Hathway leak in March 2015 and a misconfiguration at Telekom Malaysia in June 2015 which had a debilitating roll-on effect around the world.

To minimize the possibility of route leaks, ISPs use route filters that are supposed to catch any problems with the IP routes that peers and customers intend to use for the sending and receiving of packets of data.

Other ways of combating route leaks include origin validation, NTT’s peer locking and commercial solutions. Additionally, the IETF is in the process of drafting proposals on route leaks.

Factoring in the human element

Tools and solutions aside, Level 3’s unfortunate misconfiguration once again highlights the fact that, despite keeping a low profile in the news, human error still rules when it comes to causing common network outages.

In an industry focused on how to design, build and maintain machines and systems that enable interconnected entities to send and receive millions of packets of data efficiently every second of every day, it’s maybe not all that odd that the humans behind all of this activity become of secondary importance. Though, as technology advances and systems become more automated, small human errors such as misconfiguring a server prefix are likely to have ever larger knock-on effects. At increasing rates, such incidents will roll out like digital tsunamis across oceans, instead of only flooding a couple of small, inflatable IP pools in your backyard.

Boost IT best practices - focus on humans

So outside of general IT best practices, what can you do to help the humans on your team to avoid human error?

Just as with any network, human interaction is based on established relationships. And just as in any network, a weak link, or a breakdown in the lines of communication, can lead to an outage. Humans who have to operate in an atmosphere of unclear instructions, tasks, responsibilities and communication, can become ineffective and anxious. This eats away at employee morale and workflow efficiency and lays the groundwork for institutional inertia and the stalling of progress. At other times, a lack of defined task-setting and clear boundaries may resort to employees showing initiative in the wrong places and at the wrong times.

To limit outages due to human error, just distributing a general set of best practices or relying on informally communicated guidelines amongst staff are simply not enough. While networking best practices always apply, the following four steps can be very effective in establishing the kind of human relationships needed to strengthen your network and optimize network availability.

1. Define

Draw up, and keep updated, a diagram not only of your network architecture (you do have one, don’t you?), but also make sure you have a workflow diagram for your teams: who is tasked with which responsibility and where does their action fit into the overall process? What are the expected outcomes? And what alternative plans and processes are in place if something goes awry? Most importantly, match tasks and responsibilities with well-defined role-based access management.

2. Communicate

Does everyone on your team, and collaborating teams, know who is responsible for what, when and where, and how the processes flow? Is this information centrally accessible and kept up to date? Clarity, structure and effective communication empower your team members to accept responsibility and show initiative within bounds.

3. Train

Does everyone on your team know what’s expected of them, and did they receive appropriate training to complete their assignments properly and responsibly? Do they have the appropriate resources available to do what they need to do efficiently? Without training and tools in place, unintentional accidents are simply so much more likely to occur.

4. Refresh

Don’t wait until team members run into trouble or run out of steam. Check in with each other regularly, and encourage a culture of knowledge sharing where individuals with different skill sets can have ample opportunity to develop new skills and understanding.

Finally

The saying goes, a chain is only as strong as its weakest link. The same goes for networks.

At a time in history when we have more technological checks and balances available than ever before, it turns out the weakest networking link is, too often, a human. While we’re running systems for humans by humans, we may as well put in the extra effort to help humans do what they do, better. Our networking systems will be so much stronger for it.