Elastic Cloud Incident Report: February 4, 2019

“Service metrics had reported the hosts as healthy, thus signaling that it was safe to proceed with the maintenance; however, the metrics proved to be insufficient in conveying the state of individual hosts and of the coordination layer as a whole.”
Incident #40 at Elastic on 2019/02/04 by Panagiotis Moustafellos (Tech Lead - SRE), Ben Osborne (Site Reliability) Engineer
Full report https://www.elastic.co/blog/elastic-cloud-incident-report-feburary-4-2019
How it happened During an upgrade of hosts in the coordination layer (in which hosts were patched and then used to replace old hosts) high traffic and a defect led to CPU softlocks and a ZooKeeper failure. A portion of the high traffic was due to reconnection attempts due to the instability caused by high latency. A second set of services (kibana dashboards) that depend on the ZooKeeper ensemble also failed due to a defect that left unsuccessful connections open (and there were many of these because of failures to connect to the zookeeper ensemble).
Architecture A multi-layer application: (1) a Kibana frontend, (2) a proxy/routing layer that routes requests to cluster nodes, and (3) a coordination layer which maintains node state and location (implemented as a three-node Apache ZooKeeper ensemble).
Technologies Apache ZooKeeper, Elasticsearch, Kibana
Root cause During an upgrade, hosts in the coordination layer were under too much load (due to client traffic, including reconnect attempts) to establish quorum; and a defect in runc.
Failure The coordination layer had increasd latency and low availability, and cluster hosts experienced soft locks; responders eliminated nearly all client traffic to help with mitigation.
Impact Customers experienced reduced functionality or a partial outage and later a complete in one region.
Mitigation Rolled back ZooKeeper hosts to previous version and removed client traffic to allow the ensemble to get up correctly. Restarted kibana instances and applied limits to avoid keeping connections open.