Facebook was down for a couple hours yesterday, the second day in a row they’ve had problems. In More Details on Today’s Outage, Director of Software Engineering Robert Johnson explains:
The key flaw that caused this outage to be so severe was an unfortunate handling of an error condition. An automated system for verifying configuration values ended up causing much more damage than it fixed.
Don’t you just hate it when that happens?
It’s a very clear post, detailing how things went downhill:
We had entered a feedback loop that didn’t allow the databases to recover.
The way to stop the feedback cycle was quite painful – we had to stop all traffic to this database cluster, which meant turning off the site. Once the databases had recovered and the root cause had been fixed, we slowly allowed more people back onto the site.
Reliability issues … emergence of open source competitors … business practices that push the envelope … arrogance …Â hmm …
Reminds me a lot of Microsoft in the 1990s.
jon
M. Edward (Ed) Boras | 26-Sep-10 at 9:15 pm | Permalink
When a fail-safe system fails, it fails by failing to fail safe. 😉