Elastic Cloud January 18, 2019 Incident Report

“So, if the ZooKeeper server is loaded and causes heartbeat timeouts because of GC pauses, TreeCache will start flooding ZooKeeper with requests, making the situation worse and leading to a chain reaction that prevents the ZooKeeper servers from recovering, and can also kill client services.”
Incident #35 at Elastic on 2019/01/18 by Panagiotis Moustafellos (Tech Lead - SRE), Uri Cohen (Sr. Director - Product Management), Sylvain Wallez (Tech Lead - Software Engineer)
Full report https://www.elastic.co/blog/elastic-cloud-january-18-2019-incident-report
How it happened CPU load on the cluster manager hosts increased resulting in some client disconnections and as a result clients sending large requests to update their mirror of cluster node information. Many of these requests failed and were resent by clients, with subsequent requests queuing up in the client. Clients started experiencing resource starvation due to these requests accumulating in memory, leading to out of memory errors and an outage.
Architecture A proxy/routing layer that routes requests to cluster nodes manged by a ZooKeeper instance. The proxy/routing layer is a client to the ZooKeeper layer and caintains an in memory mirror (TreeCache) of all ZooKeeper node information.
Technologies Apache ZooKeeper, TreeCache
Root cause When there is network instability between the cluster manager hosts and its clients (the proxy layer), the clients send large reqeusts to refresh state, with retries and queuing of requests.
Failure Clients (ie, the proxy layer) were unavailable as the backlog of requests (to ZooKeeper) led to out of memory conditions.
Impact Customers experienced severely degraded access to the service for 3 hours and 20 minutes.
Mitigation Responders scaled up the proxy layer (to accomodate surging traffic), manually trigged a leader election and stabilizing the ZooKeeper ensemble.