Why Reddit was down on Aug 11

“Autoscaler read the partially migrated Zookeeper data and terminated many of our application servers, which serve our website and API, and our caching servers, in 16 seconds.”

Incident	#19 at Reddit on 2017/08/11 by u/gooeyblob (Infrastructure leader)
Full report	https://www.reddit.com/r/announcements/comments/4y0m56/why_reddit_was_down_on_aug_11/
How it happened	Before an upgrade of a Zookeeper system the autoscaler was manually turned off by making a configuration change, since it depends on Zookeeper for server health information. During the upgrade, the package manager reverted the configuration change (since it detected it had been made manually), turing back on the autoscaler. The autoscaler (based on partial Zookeeper data) terminated many healthy servers, including caching servers.
Architecture	Application servers and caching servers managed by an autoscaler (which uses Zookeeper for server health information).
Technologies	Apache ZooKeeper
Root cause	Autoscaler (unintentionally) running during a Zookeeper upgrade.
Failure	Terminated application and caching servers.
Impact	Service unavailable for 1.5 hours followed by an additional 1.5 hours of increased response time.
Mitigation	Engineers restored the servers (ending the outage) and waited for the caches to fill (ending the performance degradation period).