Incident
|
#35 at
Elastic on
2019/01/18 by Panagiotis Moustafellos (Tech Lead - SRE), Uri Cohen (Sr. Director - Product Management), Sylvain Wallez (Tech Lead - Software Engineer)
|
Full report
|
https://www.elastic.co/blog/elastic-cloud-january-18-2019-incident-report
|
How it happened
|
CPU load on the cluster manager hosts increased resulting in some client disconnections and as a result clients sending large requests to update their mirror of cluster node information. Many of these requests failed and were resent by clients, with subsequent requests queuing up in the client. Clients started experiencing resource starvation due to these requests accumulating in memory, leading to out of memory errors and an outage.
|
Architecture
|
A proxy/routing layer that routes requests to cluster nodes manged by a ZooKeeper instance. The proxy/routing layer is a client to the ZooKeeper layer and caintains an in memory mirror (TreeCache) of all ZooKeeper node information.
|
Technologies
|
Apache ZooKeeper, TreeCache
|
Root cause
|
When there is network instability between the cluster manager hosts and its clients (the proxy layer), the clients send large reqeusts to refresh state, with retries and queuing of requests.
|
Failure
|
Clients (ie, the proxy layer) were unavailable as the backlog of requests (to ZooKeeper) led to out of memory conditions.
|
Impact
|
Customers experienced severely degraded access to the service for 3 hours and 20 minutes.
|
Mitigation
|
Responders scaled up the proxy layer (to accomodate surging traffic), manually trigged a leader election and stabilizing the ZooKeeper ensemble.
|