On Monday the 4th of October, Facebook was hit by its largest outage since 2019. Facebook alongside its platforms including WhatsApp, Instagram, and Messenger went down for nearly six hours, impacting 3.5 billion users globally (Martin, 2021). The disruption was felt widespread, not only impacting external users but internal communication networks went dark as well. Its impact was felt the hardest by content creators and small businesses that are heavily dependent on these platforms for sources of income (Lawler & Heath, 2021).
Facebook released a statement on Tuesday, saying that a faulty configuration change in the backend of the system was at fault (Janardhan, 2021). Normally, the system tells routers that are trying to access information where the relevant data centers are located. On Monday however, it seemed to routers that these data centers did not exist, promoting a total wipeout of Facebook’s internal and external servers.
Later that day Cloudflare, a company that also recently experienced such an outage provided a more detailed explanation of what happened. Essentially the internet runs off of two systems, Domain Name System (DNS), and Border Gateway Protocol (BGP)(Taylor, 2021). DNS acts as an address book for all websites, where the Internet Protocol (IP address) of a website is stored. Whereas the BGP closely resembles what we refer to as a GPS, a system that configures the most efficient pathway to get to that IP address. After an update from Facebook, the BGP was unable to compute a pathway to the respective IP address (Taylor, 2021). Consequently millions of users were unable to access the platforms of Facebook.
Such malfunctions are quite common and fixing them is normally rather straightforward. A couple of debugs and a reboot of the data centers are usually sufficient and can be done remotely. However, in the case of Facebook, the system used to access the data centers was the same network that went dark. Therefore the only solution was for data scientists and experts to physically go to the affected data centers. Upon arrival, the experts incurred another issue which was that the access cards to these data centers were also dependent on the internal network (which were also down) (Martin, 2021). Causing a major delay, eventually taking experts close to six hours to resolve the issue.
Facebook’s recent outage has surprised many users and experts worldwide. Prompting the question, how it is possible that one of the most influential and used companies in the world runs its systems through the same network? Luckily Facebook was able to react decently quickly and restore the issue within a few hours. It will be interesting to investigate the developments Facebook will take to ensure users that such a widespread outage will not occur again.
Sources:
Janardhan, S. (2021). Update about the October 4th outage. Facebook Engineering. Retrieved 7 October 2021, from https://engineering.fb.com/2021/10/04/networking-traffic/outage/.
Jordan, B. (2020). Facebook [Image]. Retrieved 7 October 2021, from https://unsplash.com/photos/tWX_ho-328k.
Lawler, R., & Heath, A. (2021). Facebook is back online after a massive outage that also took down Instagram, WhatsApp, Messenger, and Oculus. The Verge. Retrieved 7 October 2021, from https://www.theverge.com/2021/10/4/22708989/instagram-facebook-outage-messenger-whatsapp-error.
Martin, A. (2021). Facebook outage: What actually caused WhatsApp and Instagram to go down?. Sky News. Retrieved 7 October 2021, from https://news.sky.com/story/facebook-outage-what-actually-caused-whatsapp-and-instagram-to-go-down-12426383.
Taylor, J. (2021). Facebook outage: what went wrong and why did it take so long to fix after social platform went down?. the Guardian. Retrieved 7 October 2021, from https://www.theguardian.com/technology/2021/oct/05/facebook-outage-what-went-wrong-and-why-did-it-take-so-long-to-fix.
Hi Keiko,
Super interesting article! You clearly explained how DNS and BGP are related, thank you for that! It is indeed crazy to think that such a powerful company like Facebook did not have the right processes in place to tackle such an event. As a consumer, you realised how dependent you are on these platforms in your day-to-day lives. In my opinion, it a good wake-up call to the fact that there is more to life than the online world. Moreover, it shows that even while you are one of the most powerful companies, you can still overlook aspects and make mistakes. Another interesting side to this story is how other social media platforms such as twitter all of a sudden got a lot more attention. This once more shows that if one party loses, there is another party that benefits from it. I am curious what the long-term effect will be for Facebook and whether a similar event will take place in the future again.