Incident
|
#19 at
Reddit on
2017/08/11 by u/gooeyblob (Infrastructure leader)
|
Full report
|
https://www.reddit.com/r/announcements/comments/4y0m56/why_reddit_was_down_on_aug_11/
|
How it happened
|
Before an upgrade of a Zookeeper system the autoscaler was manually turned off by making a configuration change, since it depends on Zookeeper for server health information. During the upgrade, the package manager reverted the configuration change (since it detected it had been made manually), turing back on the autoscaler. The autoscaler (based on partial Zookeeper data) terminated many healthy servers, including caching servers.
|
Architecture
|
Application servers and caching servers managed by an autoscaler (which uses Zookeeper for server health information).
|
Technologies
|
Apache ZooKeeper
|
Root cause
|
Autoscaler (unintentionally) running during a Zookeeper upgrade.
|
Failure
|
Terminated application and caching servers.
|
Impact
|
Service unavailable for 1.5 hours followed by an additional 1.5 hours of increased response time.
|
Mitigation
|
Engineers restored the servers (ending the outage) and waited for the caches to fill (ending the performance degradation period).
|