Incident
|
#9 at
Parse.ly on
2015/03/26 by Andrew Montalenti (Founder & Chief Product Officer)
|
Full report
|
https://blog.parse.ly/post/1738/kafkapocalypse/
|
How it happened
|
Data volumes hit a new high which increased the outbound network volumes beyond the capcity of the machines running cluster nodes (either a hardware or system software limit). As a node in the cluster failed, load increased on the remaining nodes leading to more nodes failing for the same reason. As new nodes were added (as part of the mitigation) clients stopped writing to the cluster due to a defect in the configuration management system that led to new nodes being added to the cluster with incorrect host names.
|
Architecture
|
Two analytics dashboard systems (one production and one beta) powered by two parallel backend data processing systems that process data from a JavaScript tracker and forward it to a Kafka cluster.
|
Technologies
|
Amazon Elastic Compute Cloud (EC2), Apache Kafka, Opscode Chef, Apache ZooKeeper
|
Root cause
|
Outbound network traffic on cluster nodes exceeded the capacity of the machines those nodes were running on (either a hardware or system software limit); and a configuration defect in the cluster management system which led to new nodes being assigned incorrect ip addresses.
|
Failure
|
Some nodes in the cluster failed and were removed from the cluster. Clients writing to the cluster (mistakenly) determined the cluster was unavailable and stopped writing to the cluster.
|
Impact
|
Several hours of no data processing and no new data appearing in the analytics dashboards. Data processing was delayed but no data was lost.
|
Mitigation
|
Corrected the configuration and then added new nodes to the cluster (bringing the per machine network usage under the limits). Restarted downstream data consumers of the cluster.
|