Incident
|
#28 at
Twilio on
2013/07/18
|
Full report
|
https://www.twilio.com/blog/2013/07/billing-incident-post-mortem-breakdown-analysis-and-root-cause.html
|
How it happened
|
A loss of network connectivity caused all redis-slaves to simultaneously disconnect from the master, then reconnect and request full synchronization with the master at the same time. The redis-master began to fail due to load generated by synchronization requests. The redis-master was restarted and due to two configuration defects it attempted to restore from a (non-existent) append-only file (AOF), instead of the intended binary snapshop, and started as a slave instead of the primary.
|
Architecture
|
An in-memory Redis cluster (storing account balances) with a single master and multiple slaves distributed across data-centers.
|
Technologies
|
Redis
|
Root cause
|
A loss of network connectivity, led to redis-slaves disconnecting and reconnecting to redis-master with a request for full synchronization. Two configuraiton defects delayed recovery.
|
Failure
|
The redis-master failed (due to load generated by the slave synchronization) as did dependent systems. When the redis-master was restarted it recovered incorrectly and started as a read-only slave.
|
Impact
|
For 1.4% of customers financial payments were not reflected in account balances, some payments were recharged and some accounts were suspended due to those recharges.
|
Mitigation
|
Configuration defects for redis-master were corrected; the redis cluster was restored and refunds for recharges were issued. During mitigation a dependent system was deactivated and then reactivated when mitigation was complete.
|