Incident
|
#10 at
GoCardless on
2017/10/10 by Chris Sinjakli, Harry Panayiotou, Lawrence Jones, Norberto Lopes, Raul Naveiras (Site Reliability Engineers)
|
Full report
|
https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
|
How it happened
|
A disk array failed on the primary database node and conicidently a database subprocess crashed on the synchronous replica. The cluster management system was then unable to promote one of the replicas to be primary, due to subtle interactions between three configurations.
|
Architecture
|
An API layer that connects to a PostgreSQL cluster using a virtual IP address. The cluster has 1 primary node, 1 synchronous replica node and 1 asynchronous replica node. The cluster and virtual IP address is managed by Pacemaker.
|
Technologies
|
PostgreSQL, Pacemaker
|
Root cause
|
A disk array failure on the primary database node combined with a conicident database subprocess crashing on the synchronous replica and some subtle interactions between cluster management configurations.
|
Failure
|
Primary database failed and the cluster (managed by Pacemaker) failed to promote a replica to be the new primary, leaving the database unavailable.
|
Impact
|
1 hour and 50 minute outage of API Dashboard.
|
Mitigation
|
Put the cluster into maintenance mode, configured the synchronous replica to be a primary, and manually started database. Configured clients of the database with the (non-virtual) IP address of the new primary.
|