Why Reddit was down on Aug 11

“Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds.”
Incident #19 at Reddit on 2017/08/11 by u/gooeyblob (Infrastructure leader)
Full report https://www.reddit.com/r/announcements/comments/4y0m56/why_reddit_was_down_on_aug_11/
How it happened Before an upgrade of a Zookeeper system the autoscaler was manually turned off by making a configuration change, since it depends on Zookeeper for server health information. During the upgrade, the package manager reverted the configuration change (since it detected it had been made manually), turing back on the autoscaler. The autoscaler (based on partial Zookeeper data) terminated many healthy servers, including caching servers.
Architecture Application servers and caching servers managed by an autoscaler (which uses Zookeeper for server health information).
Technologies Apache ZooKeeper
Root cause Autoscaler (unintentionally) running during a Zookeeper upgrade.
Failure Terminated application and caching servers.
Impact Service unavailable for 1.5 hours followed by an additional 1.5 hours of increased response time.
Mitigation Engineers restored the servers (ending the outage) and waited for the caches to fill (ending the performance degradation period).