Incident review: API and Dashboard outage on 10 October 2017

“The Pacemaker cluster correctly observed that Postgres was unhealthy on the primary node. It repeatedly attempted to promote a new primary, but each time it couldn't decide where that primary should run.”
Incident #10 at GoCardless on 2017/10/10 by Chris Sinjakli, Harry Panayiotou, Lawrence Jones, Norberto Lopes, Raul Naveiras (Site Reliability Engineers)
Full report https://gocardless.com/blog/incident-review-api-and-dashboard-outage-on-10th-october/
How it happened A disk array failed on the primary database node and conicidently a database subprocess crashed on the synchronous replica. The cluster management system was then unable to promote one of the replicas to be primary, due to subtle interactions between three configurations.
Architecture An API layer that connects to a PostgreSQL cluster using a virtual IP address. The cluster has 1 primary node, 1 synchronous replica node and 1 asynchronous replica node. The cluster and virtual IP address is managed by Pacemaker.
Technologies PostgreSQL, Pacemaker
Root cause A disk array failure on the primary database node combined with a conicident database subprocess crashing on the synchronous replica and some subtle interactions between cluster management configurations.
Failure Primary database failed and the cluster (managed by Pacemaker) failed to promote a replica to be the new primary, leaving the database unavailable.
Impact 1 hour and 50 minute outage of API Dashboard.
Mitigation Put the cluster into maintenance mode, configured the synchronous replica to be a primary, and manually started database. Configured clients of the database with the (non-virtual) IP address of the new primary.