On Tuesday evening, Facebook users all over the world experienced an abnormal situation ,where they were logged out of their accounts without an option of having access to the same. These had to log in a fresh, but after feeding in their usual passwords, they were all rejected without giving them an option of recovery as it has always been on most social media accounts.
This created more worry among the Facebook users, who suspected that their accounts could have been hacked or compromised. Some who were impatient, tried to open new accounts, but all didn’t yield any positive result.
“This is the worst outage we’ve had in over four years, and we wanted to first of all apologize for it. We also wanted to provide much more technical detail on what happened and share one big lesson learned.” Reads part of the official statement from Meta Engineering released on Tuesday 17th December 2024
The Engineers revealed that,
The key flaw that caused this outage , was an unfortunate handling of an error condition. That An automated system for verifying configuration values ended up causing much more damage than it fixed.
“The intent of the automated system is to check for configuration values that are invalid in the cache and replace them with updated values from the persistent store. This works well for a transient problem with the cache, but it doesn’t work when the persistent store is invalid.”
That on Tuesday, they made a change to the persistent copy of a configuration value that was interpreted as invalid which meant that every single client saw the invalid value and attempted to fix it.
“ Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second. To make matters worse, every time a client got an error attempting to query one of the databases it interpreted it as an Invalid value, and deleted the corresponding cache key.”
That this meant, even after the original problem had been fixed, the stream of queries continued. As long as the databases failed to service some of the requests, they also caused more requests to themselves this causing a feedback loop that didn’t allow the databases to recover.
“The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.”
The Engineers contend , that although the site back was got and it’s in a proper functioning state, the system attempts have been turned off to correct the configuration values.
“We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.”
These further apologized for the site outage and asked users to be calm as the whole problem is getting fixed.
Discussion about this post