Facebook's Massive Blackout: Here's What Data Centers Have Learned So Far
Facebook’s massive outage put data centers on front pages across the globe this week — and may pose some uncomfortable questions for operators moving forward.
When Facebook, Instagram, Facebook Messenger and WhatsApp went down for more than six hours on Monday, the outage created a cascading series of consequences around the world as the loss of messaging services interrupted communications, businesses relying on Facebook’s platforms saw sales grind to a halt, and companies and individuals lost access to devices and services logged in through Facebook.
The tech giant itself also lost control over many of its internal systems and infrastructure during the blackout, which reportedly cost the company more than $100M in lost revenue.
The cause of the outage — the largest in history, according to downdetector.com — was a faulty update of a system called the Border Gateway Protocol that routs traffic between the company’s private networks and the outside internet, linking the company’s data centers and digital infrastructure together.
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication,” wrote Facebook's vice president for infrastructure, Santosh Janardhan, in a blog post.
“This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
The outage was caused by a network problem, rather than an issue with a data center itself like a power outage or damaged fiber line. But the length of the outage — reportedly due to the need to get engineers physically into a data center — may make this incident an important case study for an industry facing a labor shortage and increasingly looking toward a future with few on-site personnel.
Fixing the outage Monday required Facebook engineers to physically access a data center in Santa Clara, California, as the systems normally used for remote access failed. That proved difficult. First, the outage impacted the company’s keycard access systems. Perhaps more significantly, the engineers with the skills needed to perform the necessary reset were not anywhere near the Santa Clara data center, according to reports.
“The people with physical access [are] separate from the people with knowledge of how to actually authenticate the systems and people who know what to actually do,” Interesting Engineering reports a Facebook insider as saying. “So there is now a logistical challenge with getting all that knowledge unified."
Some sources places blamed pandemic-driven work-from-home policies on the lack of staff on-site to perform the reset but data centers are facing a dire shortage of workers overall, particularly in markets far from data center hubs. Increasingly, facilities are monitored from central operations centers, which could be hundreds of miles away. So-called dark data centers, which have no on-site staff, are increasingly common.
Although industry leaders are increasingly investing in using artificial intelligence, remote monitoring and automation to try to reduce the human role in data center operations, Facebook’s outage may serve as evidence of the importance of expertise on-site or at least nearby, even in the most sophisticated data centers.
And with more than $100M in lost revenues and the company’s mistakes a top news story around the world, it also showed what's at stake.