Incident
|
#40 at
Elastic on
2019/02/04 by Panagiotis Moustafellos (Tech Lead - SRE), Ben Osborne (Site Reliability) Engineer
|
Full report
|
https://www.elastic.co/blog/elastic-cloud-incident-report-feburary-4-2019
|
How it happened
|
During an upgrade of hosts in the coordination layer (in which hosts were patched and then used to replace old hosts) high traffic and a defect led to CPU softlocks and a ZooKeeper failure. A portion of the high traffic was due to reconnection attempts due to the instability caused by high latency. A second set of services (kibana dashboards) that depend on the ZooKeeper ensemble also failed due to a defect that left unsuccessful connections open (and there were many of these because of failures to connect to the zookeeper ensemble).
|
Architecture
|
A multi-layer application: (1) a Kibana frontend, (2) a proxy/routing layer that routes requests to cluster nodes, and (3) a coordination layer which maintains node state and location (implemented as a three-node Apache ZooKeeper ensemble).
|
Technologies
|
Apache ZooKeeper, Elasticsearch, Kibana
|
Root cause
|
During an upgrade, hosts in the coordination layer were under too much load (due to client traffic, including reconnect attempts) to establish quorum; and a defect in runc.
|
Failure
|
The coordination layer had increasd latency and low availability, and cluster hosts experienced soft locks; responders eliminated nearly all client traffic to help with mitigation.
|
Impact
|
Customers experienced reduced functionality or a partial outage and later a complete in one region.
|
Mitigation
|
Rolled back ZooKeeper hosts to previous version and removed client traffic to allow the ensemble to get up correctly. Restarted kibana instances and applied limits to avoid keeping connections open.
|