Oops, I Broke The Internet

We've broken the internet by breaking decentralization

I consider the internet to be one of humankind’s greatest inventions. The notion of networking the entire globe together is incredibly ambitious, fraught with technical and political issues, requires unprecedented global cooperation, and, quite frankly, isn’t working out all that well.

Image credit: Alexas_Fotos

My career in internet technology now spans decades and in that time I’ve come to realize that there is one single truth about the internet that trumps every other characteristic of it:

The fact that the internet works at all is a flat out miracle.

The internet is very much like a military parade. It looks coordinated and impressive from afar. But as you get closer to it, you see some members are disheveled, some have their ties on incorrectly, some boots aren’t as shiny as the other. Some members are sweating buckets and some look as if they’re as proud as can be to just be alive. The internet is very similar - if you’re not close to it, it looks like the shiniest toy in the box. The reality is that it is extremely fragile.

A few days ago a company named CloudFlare experienced a short 30-minute outage of their services. You may have never heard of CloudFlare, but it is an influential player on the internet and what it does matters. CloudFlare’s primary services are a content distribution network (CDN) and a web application firewall (WAF). Basically, it protects websites from attack and distributes their static content, such as images, to various servers across the globe to improve their customers’ website load times. CloudFlare is a major player on the internet both in customer count and in technology.

Other than banking and health services, a half-hour outage of pretty much any other company would just be a blip and hardly be noticed. Not so with CloudFlare. Huge chunks of the internet stopped working around San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. That’s a pretty big list and it represents only a small portion of CloudFlare’s global coverage. It also represents hundreds of thousands of web services which translates into hundreds of thousands of apps, services, and websites suddenly disappearing from the internet.

Outages like this are further complicated by CloudFlare because, in order to use CloudFlare services, customers must hand over control of their entire Domain Name Service database (DNS) to CloudFlare. This means that when CloudFlare has a problem that would normally only impact a customer’s website using the WAF, the problem now extends to also affecting every other internet service that customer uses, such as their email. The reasons for insisting on controlling a customer’s DNS are technical, and there is disagreement on whether WAF providers should operate that way which may be the topic of another post.

Because of CloudFlare’s huge customer base, the entire world felt this outage. I’ll spare you the gory details and just tell you the cause of this outage: Cloudflare’s network engineering team updated the configuration on a router in Atlanta to alleviate congestion. And whoever did the update fat-fingered a command and brought down a large chunk of the internet for half an hour.

This is frustrating for me and pretty much everyone else who works on internet infrastructure and networking. Not because someone made a mistake - we’ve all done that. You can’t build a career on typing billions of lines of configuration and code without making a few typos. As they say, shit happens. It’s frustrating because the internet was designed specifically to withstand the type of damage imagined by war. The internet was designed to continue functioning if critical internet convergence sites literally blow up and are reduced to a smoking pile of rubble. It is designed to notice when an internet route is no longer functioning and to find another route. It is an incredibly resilient infrastructure and it’s mind-blowing that we have deliberately gone out of our way to damage that resiliency by allowing individual companies and people to take control over the internet by building massive silos and sucking us all into them.

The internet was designed to handle the loss of critical convergence sites due to acts of war. How did we get from that to the “one guy made a typo and took the internet down”?

The CloudFlare outage isn’t an isolated event.

2016 - DynDNS outage: Dyn provides DNS services for a very large number of companies. DNS is like the internet phone book. If your customers don’t have a phone book, there’s no way for them to look up your number and call you even if you have a perfectly working phone. In this case, the companies’ websites were still up, but with Dyn down, there was no way for people to reach it. Sites like Amazon, Netflix, and Spotify were down most of the day, as well as several government and journalism sites. The list on that Wikipedia link is likely incomplete, but we can be happy there were no critical health services impacted.

2017 - Amazon Web Services (AWS): AWS is the real name for “the cloud”. In reality, there are other companies offering cloud services, but Amazon is the granddaddy of them all and still eclipses all other contenders. Pretty much every internet service you’ve ever heard of uses AWS in some fashion. Some services host the entire thing there, some of them just use bits and pieces like AWS’s storage service named S3. In 2017, a single engineer made a typo and brought down thousands of websites, apps, and other services hosted with AWS.

So far in 2020, Microsoft is leading the pack in major outages. Good ol’ MS can’t seem to keep their stuff running which affects millions of users world-wide every time it has an issue with one of its products.

The internet has wholly departed from its early days of resiliency and we are now living in silos where a single company or, indeed a single person, can cause major damage to the internet by simply typing a word wrong. This current situation can be directly traced back to the root of all evil; money. There are huge profits to be made in internet services.

AWS is a great example. If I want to offer an internet service these days, it needs to be distributed globally and it has to have high availability so that it doesn’t go down. My options are to spend literally millions of dollars deploying hardware in data centers around the world and populating it with my best-guess-what-is-needed hardware, or I can just sign up for a free AWS account and start spinning up cloud servers in a pay-as-you-go model for a few bucks a month.

Likewise with CloudFlare. I need to protect my new web app so I need some kind of security to ensure bad guys don’t upload malware to my service, or launch a Distributed Denial of Service (DDoS) attack against me to take me down. Again, my options are to source, build, and maintain that myself, or just get a free (ish) CloudFlare account to do it for me. It’s a no-brainer.

I fully understand why companies and individuals use these services. But it’s not healthy. It’s not healthy because most people really have no clue what the internet actually does. Most people think “the internet” is Facebook and websites. The truth is, the things you’re probably thinking of right now as “the internet” are just the consumer-facing internet and it represents only a fraction of the services that run over the internet.

The internet is a highway, and like any highway, it has pleasure vehicles (that’s you on your Facebook app) and there are work vehicles (that’s almost everything else). The work vehicles are carrying parts, food, medical supplies, and a whole bunch of stuff that just magically shows up on the shelves in our stores. The internet is the same way. The worker traffic on the internet delivers our utilities to our houses and our paycheques to our bank accounts. When we call the cops or a tow truck, they are brought to us via internet communications and positioning services. When we phone someone, it is routed over the internet. When we watch TV (even *regular* TV) it streams into your TV set from the internet. The list is endless. The internet underpins every single thing we do in our lives. Even when we think we’re disconnected, we’re disconnecting using something that relied on the internet to be there for us to get all disconnecty on.

To restore the inherent resiliency of the internet, we need to stop purposely defeating it. If you’re a business owner with internet assets such as a website and email, you can help yourself by declining to use big silo services such as Google Accounts for your email and large internet infrastructure providers like CloudFlare for your DNS. I don’t wish to financially hurt these companies because some, like CloudFlare, are actually pretty good internet citizens and contribute lots of knowledge and code back to the community. But the “oops, I made a typo and now the internet is broken” has to stop, and the only way to do that is to go back to basics which means allowing multiple, smaller services to handle our internet assets which in turn, provides the decentralization that underpins the resiliency of the internet as it was designed.

As a final thought regarding the danger of these internet silos, consider the ever-increasing frequency of DDoS attacks. Prior to converging the internet in a fragile icicle, it would be very hard for an attacker to…say, take down a government communication system because they’d have to attack that system directly. These days, an attacker just has to attack a CloudFlare, or a Dyn DNS, or any support service a government uses and poof - the whole internet blows up. In the aftermath, we don’t even know who the actual target of the attack was because it was so much easier for the attacker to render the entire internet useless by attacking a silo provider than it was to attack the actual assets they were after. This internet centralization issue isn’t just a philosophical problem, it is a very real security problem as well.