Incident
|
#11 at
Stripe on
2019/07/10 by David Singleton (CTO)
|
Full report
|
https://stripe.com/rcas/2019-07-10
|
How it happened
|
Two database cluster nodes became stalled for unknown reasons, stopped emitting metrics reporting their replication lag and continued to respond as healthy to checks. The primary node for the database cluster failed and the custer was not able to elect a primary, due to a database defect that only manifested in the presense of multiple stalled nodes. To prevent a repeat incident, reposnders rolled back the database election code causing a second failure due to an incompatible cluster configuration.
|
Architecture
|
A database cluster with multiple shards. Each shard has multiple redundant nodes.
|
Technologies
|
|
Root cause
|
A database election protocol defect that only manifests when there are multiple stalled nodes; and later a configuration that was not compatible with the version of the database's election protocol that was reverted to during the incident.
|
Failure
|
Database nodes failed for one shard. When that shard's primary node also failed, election of new primary node failed and the shard was unable to accept writes; and later CPU starvation on multiple shards.
|
Impact
|
Applications that write to the shard began to timeout and the API returned errors for users.
|
Mitigation
|
Restarted all nodes in the database cluster, resulting in a successful election and a restoration of servive; and then updated the cluster configuration.
|