Facebook’s outage
As most people will know – millions of users of Facebook, Instagram, and WhatsApp were left hanging for almost 6 hours on the 4th October 2021.
Facebook Engineers have disclosed a good write-up into the events that led up to the global outage, but the contents may make heavy reading for those who’s networking knowledge is limited.
So, for an explanation in layman’s terms, read on…
First of all lets define a few networking terms:
- DNS – Domain Name System – this is the system which turns a web address into an IP address – think of it like a telephone directory. You use the directory to locate the name of a shop, or restaurant and then retrieve the phone number for that business. DNS locates the web address and retrieves the IP address for the service.
- Autonomous system (AS) – this is a term used to describe a collection of IP addresses belonging to a network or a collection of networks that are all managed, controlled and supervised by a single entity or organisation.
- CDN – Content Delivery Network – a group of geographically distributed servers that speed up the delivery of web content by bringing it closer to where users are
- Routing protocol – A routing protocol is language used by a router to exchange data about pathways to other routers. The protocol enables the routers to “learn” about available routes between networks. There are different routing protocols in use inside, and between networks
- BGP – Border Gateway Protocol – this is the routing protocol used by Internet routers to exchange data about available routes
- ISP – Internet Service Provider – The company which provides customers with access to the Internet
- NSP – Network Service Provider – A company which provides high-speed network connections to other networks
When you want to access a website such as Facebook, you enter the web address in your browser and the DNS conducts a series of searches to obtain the IP address of the nearest Facebook server to your location.
These searches will take place on one of many DNS servers scattered all across the Internet. Once the IP address is located, it is passed back to your browser and your browser can now send data direct to that machine with its request.
Once your browser has the IP address it needs, it sends its data to your router which examines the destination IP address to work out how to get the data to the right place. Now your home router only connects to one other network – that of your ISP, so the data is sent to your closest ISP edge router. This router will have multiple routes through the ISP network and as such it needs to know which route is the best one to take to get the data closer to its destination. this is where the routing protocols play their part.
The ISP internal routing protocols are use to work out the fastest, most efficient way through the ISPs network to another edge router which can forward the data to the next network. Knowledge of which networks that can move your data further, and further towards Facebook is obtained via BGP updates.
Eventually, through this interconnection of networks, your data gets to its destination. All the above takes a fraction of a second to perform, and so within a really short time you start to see the webpage you were after.
Now, Facebook, like many large networks doesn’t just have one webserver, it has thousands scattered all over the world inside its CDN. In order for the networks to know which server you need access to, Facebook runs it’s own internal DNS which lets the rest of the Facebook network identify which server a particular data item is currently accessible from. These DNS servers pass the IP data to the routers which, in turn advertise these routes via BGP to all the other routers across the globe.
In the recent Facebook outage, a configuration change on a system which was, ironically, being used to test the resilience of the Facebook network caused the backbone network connecting all the CDN servers to go offline. This in turn caused the DNS servers to stop advertising those CDN servers to the BGP routers. The BGP routers did their job of informing the rest of the Internet that those routes were no longer available. Thus informing the entire Internet that Facebook was no-more.
Now, to compound matters, many of Facebooks own internal systems relied on those very same DNS servers to allow them to communicate with each other which normally would have allowed them to fix the problem quickly, but because they were offline, they couldn’t be used. This, apparently, also included the physical security systems designed to allow/prevent access to the datacentres where the problems were happening meaning engineers couldn’t gain access to the servers to bring them back online.
Whilst all this was happening, the Internet was seeing a huge number of requests from automated systems and users trying to connect top Facebook and refreshing their systems to try and try with no luck.
Another report from CloudFlare details what their systems were seeing in terms of bandwidth usage as users tried in vain to access the Facebook network.
The video below is a recording from a utility called BGPlay which is available on the RIPE website which visualises BGP update announcements. This video captures the moment the Facebook network went down and the subsequent knock on effects across the Internet. Facebook’s AS is the one in the middle of the recording highlighted in red.
The recording starts at 15:00 UTC and all is normal with the world. The action starts at approx. 15:42 UTC where you will start to see all the networks trying to get to Facebook suddenly start trying to redirect traffic as one route after another fails.
The recording ends at approx. 16:08 UTC where Facebook is inaccessible to the entire Internet.
It just goes to show that the systems we rely on for everyday communications can be sometimes quite fragile.