Facebook global outage

zirias@

Developer
As it was discussed in at least one other thread, where it is off-topic, I thought I'd just start a new thread about Facebook's clusterfuck of 2021-10-04 :cool:

I have a technical question about it. Word is now they messed up BGP, which sure explains the scope of the problem, but: When FB was down, I had a quick look at DNS using drill -T. I found all the domains they use have four nameservers. The interesting thing: trying to resolve the address of these nameservers failed as well. Therefore:
  • Shouldn't resolving the nameservers still work because of glue records in the TLD's zones?
  • Shouldn't you put at least one nameserver in an AS you don't administrate yourself?
 
As far as I understood it that BGP mess they made caused their DNS servers to drop off the internet. So even if those glue records were fine there was no way to contact those DNS servers and thus you will not be able to get any authoritative answers. Various DNS servers may have had some data cached for some period but those caches are going to expire at some point. And I'm pretty sure the root DNS servers aren't caching their internal DNS structures.

Apparently they used those DNS servers for their internal authentication and authorization too because I heard you couldn't even enter the buildings anymore because the access gates didn't accept their passes anymore. If you can't authenticate your own employees (even the ones trying to fix the mess) it would be virtually impossible to correct the mistakes too. Normally you would have some "master" password or key stored somewhere safe, to be able to access a local backup account. But perhaps they opted not to have that (if an attacker could get a hold of that master key or account/password they could potentially login on everything). Or maybe they did have something in place but it just took a long time to get a hold of those stored keys/passwords (things can get a bit messy if you can't even enter the building or safety box to access it; a bit of a chicken and egg situation).

Shouldn't you put at least one nameserver in an AS you don't administrate yourself?
Sure, but that also means you have to trust that third party where it's hosted. That same nameserver could also be abused to gain access so maybe they opted not to have that risk and keep everything on their own premises.

Whatever actually caused this mess you can be sure they will be having quite a few really long meetings trying to figure out a way to prevent this from ever happening again. And those aren't going to be fun meetings.
 
So even if those glue records were fine there was no way to contact those DNS servers and thus you will not be able to get any authoritative answers.
Okay, so … trying to resolve the nameserver itself with drill(1) won't give me any result, even if the glue records are fine, as long as it can't be verified by also asking an authorative nameserver? 🤔

BTW, already read about the rest of the story, but thanks for posting the "executive summary" again. It's just kind of hilarious 😅

Sure, but that also means you have to trust that third party where it's hosted. That same nameserver could also be abused to gain access so maybe they opted not to have that risk and keep everything on their own premises.
Hm yes, that might make sense. But kind of contradicts the redundancy of DNS…
 
Okay, so … trying to resolve the nameserver itself with drill(1) won't give me any result, even if the glue records are fine, as long as it can't be verified by also asking an authorative nameserver?
The root DNS servers just have an entry that says the facebook.com domain is that way. If that way leads to a dead end you can't resolve anything from the facebook.com domain. I'm pretty sure they have a couple of more domains but the same applies to those too if you just get directed to what's essentially a blackhole.

Hm yes, that might make sense. But kind of contradicts the redundancy of DNS…
I'm sure they've spread their DNS servers across different ASs in different datacenters for redundancy. All those ASs are just all owned by facebook and that BGP mess just took everything out in one go.
 
Yep, but that wasn't what I asked. I had a look yesterday, all the facebook-related domains have 4 nameservers in subdomains, e.g. a.ns.whatsapp.net etc. for whatsapp.com. So I'd expect an A (and probably AAAA) record to be present in the net. zone. Of course that's not authorative. My question was: does drill -T just return nothing if it does find the glue record, but can't ask one of the authorative nameservers? I would have somehow expected it shows the non-authorative answer before showing the final error…
I'm sure they've spread their DNS servers across different ASs in different datacenters for redundancy. All those ASs are just all owned by facebook and that BGP mess just took everything out in one go.
That, of course, makes sense as well. Kind of a bummer that a single error (whatever it was) takes everything down 🥳
 
Apparently they used those DNS servers for their internal authentication and authorization too because I heard you couldn't even enter the buildings anymore because the access gates didn't accept their passes anymore.
I heard the same story through the Silicon Valley rumor mill. The version I heard is that even the entrance door security system at Facebook's data centers uses Facebook's internal network and DNS. So when all of their networks went down, it became impossible to even get into the building that holds the servers which needed to be restarted or reconfigured. According to the rumors, the problem was eventually solved by using a sledge hammer on a door, and bringing someone and something inside.

The other part of the rumor is that the fix was done by starting with the data center nearest to Facebook's main engineering facilities (which are in Menlo Park); I've heard both Redwood City and Santa Clara mentioned. From there, it became possible to restart networking infrastructure at other sites remotely. That probably used dedicated network links; all the big hyper-scalers have their own fiber networks which they control end to end (not rented bandwidth, but unshared dark fiber). The funny thing about this is that I didn't know that any of the big hyper-scalers have data centers in Silicon Valley. With the insanely high cost of real estate and electricity here, there are few data centers in the immediate vicinity, and the few that exist are run by wholesale colocation operators (like Equinix), usually serving smaller customers. Companies large enough to build dedicated data centers (and Facebook is definitely in that class) typically place them in places where real estate, electricity and cooling are cheaper, but not so remote that labor is unavailable.

Or maybe they did have something in place but it just took a long time to get a hold of those stored keys/passwords (things can get a bit messy if you can't even enter the building or safety box to access it; a bit of a chicken and egg situation).
I've heard stories that some of the most fundamental security keys (like the ultimate root password to all of Amazon AWS, just as a hypothetical example) are stored in a physical safe (a big steel box with thick walls) near the CEOs office, using a standalone security device. That safe uses traditional mechanical locks (the thing with a dial). I've also heard stories that some of those security devices rely on being unlocked by a pass phrase which is memorized by a small number of humans, but not recorded otherwise (not on a piece of hardware). Part of the long delay in getting Facebook back online might have been caused by the need for one of those humans to be brought to the correct location. If someone has some spare time, they could track what flights Facebook's corporate aircraft took yesterday, it might give us a clue.

Whatever actually caused this mess you can be sure they will be having quite a few really long meetings trying to figure out a way to prevent this from ever happening again. And those aren't going to be fun meetings.
And everyone else in the industry will also have long meetings, to make sure that "this can't happen to us". Those meetings won't be quite as painful, but no means amusing.
 
I observed to some colleagues earlier today, this will be a case study in IT risk management for the ages! For many, there will be some urgency to that.

The solution goes all the way back to the ancient Chinese warlords who knew all about "defense in depth".

But first, you need to understand what needs defending. BGP has been a potential single point of failure for a lot of Autonomous Systems for a long time...

The best analysis I found was by Celso Martinho and Tom Strickx at Cloudfare.
 
I'm sure they've spread their DNS servers across different ASs in different datacenters for redundancy. All those ASs are just all owned by facebook and that BGP mess just took everything out in one go.
Nope, they didn't. All their authorative NS are in the same AS (32934) which not only disappeared from the map due to lost BGP peerings, but they widthdrew routes to all (or at least most) prefixes in that AS before the peerings went down.

They pretty impressively demonstrated how 'single point of failure' works on a large scale. plus some other seemingly bad designs in their infrastructure, like binding each and every system like access control and authentication to the same infrastructure without any fallbacks...
This incident is yet another great example to bring up if some PHB wants to hardwire everything "to the cloud".


edit: I found this screenshot from reddit I took when this all unfolded, where someone from FB was posting some information shortly before his account was deleted and the messages disappeared:
1633505030477.png

(archived: https://archive.is/QvdmH )

The sentence of the year for me is "There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do". That's textbook clown-college there 😂
 
The sentence of the year for me is "There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do". That's textbook clown-college there
Not that uncommon though. We have people working in our datacenters, all they can do is keep the hardware itself in order and replace things we tell them to do. The people working in the datacenters can't login on anything at all. I have access to the systems via various accounts (even access to the root accounts) but I'm not allowed to go in the datacenters and actually touch the hardware.
 
This is a good case study of why infrastructure needs to be cold startable.
 
This is a good case study of why infrastructure needs to be cold startable.
I have been involved in two cold starts of large data centres in my career. Neither was planned.
The problem is that such plans are very hard to test unless you have fully duplicated infrastructure.

Much later, in a grand push to virtualise everything, I do remember some clod asserting that "Microsoft says that there's nothing you can't virtualise".

I politely enquired if he saw any issues booting the SAN and server infrastructure without functioning name and time servers.
 
I didn't even notice the outage.

What do you guys think about "A Bad Solar Storm Could Cause an 'Internet Apocalypse'". A real threat or just FUD?
Not FUD. Search for "Carrington Event".
Today, we would be so deep in the sticky that we would need an industrial grade depth gauge just to know which way is up. There is evidende that solar flares some order of magnitude larger hit earth before, as it even left C14 from its interaction with the atmosphere. And I recently read about our power grid not being cold startable, so you may need to get a diesel generator to the transformation stations to jump start them. And maybe you need to aquire the transportation device from some amish...
 
This is a good case study of why infrastructure needs to be cold startable.
The problem with that is even on a smaller scale, a lot of software higher up in the stack and especially proprietary appliances and software fail in the most stupid and/or spectacular ways if they don't find their surroundings *exactly* like they expect them to be. Therefore you will be hunting weird race-conditions if some service or part of the network isn't up yet or routes are not fully established. The higher in the stack you get, the deeper this rabbit hole gets.
Just take the millions of dumb phone apps that tried to (aggressively) reach the FB servers and significantly increased the load on the root-DNS-servers by doing so.

You might be able to reboot your whole network and bring up L2 and even L3 without any problems, but at least some fallout that might even cause some chain reactions is almost inevitable that causes problems higher up in the stack or might even overload parts of your network.
 
sko I've seen this problem on a ridiculously small scale in a student-operated "cafe" on the campus of my old university. They had mainly two Linux servers (on horribly old consumer hardware) and none of them booted correctly when the other one was down. Lot of manual work to get that crap "up" again 🤪
 
sko I've seen this problem on a ridiculously small scale in a student-operated "cafe" on the campus of my old university. They had mainly two Linux servers (on horribly old consumer hardware) and none of them booted correctly when the other one was down. Lot of manual work to get that crap "up" again 🤪

When rebooting a bunch of switches with STP-managed redundant links you can watch the network go down everytime another switch comes on and, if it has higher priority, taking over as STP-master, thus shortly interrupting a bunch of those redundant links for a few seconds, often making other switches and everything above L2 freak out... It gets even funnier with stacked switches that take considerably different times to boot up, so sometimes the whole stack needs to reboot to elect a new master...
This alone can easily add up to an hour or more until even the L2 can be considered 'up' and stable. now add a fleet of servers that barf their souls throughout the network everytime their links come up again during that phase...

Regarding dependent servers/servers that cause a deadlock you don't even need 2 separate machines: just put your DHCP on a VM and let the hypervisor get its IP via DHCP...
 
Should've built the whole thing Peer to Peer. :)
I don't use Facebook myself but I feel your pain. Even at large companies it happens that someone flips the wrong switch occasionally. Especially at very centralized ones it might have global impact.

For me the amazing thing is that this happens so rarely. I would expect it 1-2 a year, because people make mistakes. It's a law of nature.
 
When rebooting a bunch of switches with STP-managed redundant links you can watch the network go down everytime another switch comes on and, if it has higher priority, taking over as STP-master, thus shortly interrupting a bunch of those redundant links for a few seconds, often making other switches and everything above L2 freak out... It gets even funnier with stacked switches that take considerably different times to boot up, so sometimes the whole stack needs to reboot to elect a new master...
Only if you haven't configured STP correctly. You specifically set your core switches to be the root bridge. You don't let it figure this out by itself or you will definitely end up in a situation like the one you're describing. Without a properly configured STP even connecting a blank configured switch anywhere on your network could result in the whole network going down for several minutes.
 
Only if you haven't configured STP correctly. You specifically set your core switches to be the root bridge. You don't let it figure this out by itself or you will definitely end up in a situation like the one you're describing. Without a properly configured STP even connecting a blank configured switch anywhere on your network could result in the whole network going down for several minutes.
Bridge IDs are configured manually everywhere so there is one distinct root bridge, but if we rebooted the root bridge AND the secondary at the same time, the remaining will - as expected/intended - still hand off the role of the root bridge according to their bridge IDs (which are never equally configured on different switches ofc). But I once had to reboot all 3 switches/stacks in our main building and chose to just issue the 'reload' on all of them at the same time. One of the remaining 2 at that site took over as root and thanks to murphy the 3 switches came up in the opposite order as their bridge IDs - so each time the next one finally came up, it took over as root, triggering a STP reconvergence. Once they are all up and the root bridge with the lowest ID has taken over, everything was running fine - but until then the network was completely useless and I just sat there for ~25 minutes watching the dust settle. This was on my weekly monday evening maintenance window, so no screaming were involved...
We since have replaced this configuration with just one stack; so this is no longer an issue.
 
If someone has some spare time, they could track what flights Facebook's corporate aircraft took yesterday, it might give us a clue.
That takes knowing the FAA-registered tail number of those aircraft. They're not gonna wear bright honu livery like ANA :p 😤
1633529980804.jpeg
 
But I once had to reboot all 3 switches/stacks in our main building and chose to just issue the 'reload' on all of them at the same time.
I'm betting that was a big "Oops" moment. You typically realize this the second you let go of the enter key.

Once they are all up and the root bridge with the lowest ID has taken over, everything was running fine - but until then the network was completely useless and I just sat there for ~25 minutes watching the dust settle.
STP is nice but it does have quite a few drawbacks. Once it starts recalculating there's really nothing else you can do but lean back and watch it happen.
 
Back
Top