Root cause analysis: significantly elevated error rates on 2019-07-10

“The new version also introduced a subtle fault in the database’s failover system that only manifested in the presence of multiple stalled nodes. On the day of the events, one shard was in the specific state that triggered this fault.”

Incident	#11 at Stripe on 2019/07/10 by David Singleton (CTO)
Full report	https://stripe.com/rcas/2019-07-10
How it happened	Two database cluster nodes became stalled for unknown reasons, stopped emitting metrics reporting their replication lag and continued to respond as healthy to checks. The primary node for the database cluster failed and the custer was not able to elect a primary, due to a database defect that only manifested in the presense of multiple stalled nodes. To prevent a repeat incident, reposnders rolled back the database election code causing a second failure due to an incompatible cluster configuration.
Architecture	A database cluster with multiple shards. Each shard has multiple redundant nodes.
Technologies
Root cause	A database election protocol defect that only manifests when there are multiple stalled nodes; and later a configuration that was not compatible with the version of the database's election protocol that was reverted to during the incident.
Failure	Database nodes failed for one shard. When that shard's primary node also failed, election of new primary node failed and the shard was unable to accept writes; and later CPU starvation on multiple shards.
Impact	Applications that write to the shard began to timeout and the API returned errors for users.
Mitigation	Restarted all nodes in the database cluster, resulting in a successful election and a restoration of servive; and then updated the cluster configuration.