Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause

“This caused all redis-slaves to reconnect and request full synchronization with the master at the same time. Receiving full sync requests from each redis-slave caused the master to suffer extreme load, resulting in performance degradation of the master and timeouts from redis-slaves to redis-master.”
Incident #28 at Twilio on 2013/07/18
Full report
How it happened A loss of network connectivity caused all redis-slaves to simultaneously disconnect from the master, then reconnect and request full synchronization with the master at the same time. The redis-master began to fail due to load generated by synchronization requests. The redis-master was restarted and due to two configuration defects it attempted to restore from a (non-existent) append-only file (AOF), instead of the intended binary snapshop, and started as a slave instead of the primary.
Architecture An in-memory Redis cluster (storing account balances) with a single master and multiple slaves distributed across data-centers.
Technologies Redis
Root cause A loss of network connectivity, led to redis-slaves disconnecting and reconnecting to redis-master with a request for full synchronization. Two configuraiton defects delayed recovery.
Failure The redis-master failed (due to load generated by the slave synchronization) as did dependent systems. When the redis-master was restarted it recovered incorrectly and started as a read-only slave.
Impact For 1.4% of customers financial payments were not reflected in account balances, some payments were recharged and some accounts were suspended due to those recharges.
Mitigation Configuration defects for redis-master were corrected; the redis cluster was restored and refunds for recharges were issued. During mitigation a dependent system was deactivated and then reactivated when mitigation was complete.