Postmortem of database outage of January 31

“Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead.”

Incident	#6 at Gitlab on 2017/01/31 by Sid Sijbrandij (CEO)
Full report	https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/
How it happened	Increased load on database servers (due to spam and/or an automated maintenance script) led to replication between primary and secondary falling behind and then failing. During mitigation an engineer accidentally deleted data from the primary database server, thinking they were operating on the secondary.
Architecture	PostgreSQL database with primary and secondary servers. Azure disk snapshots, Logical Volume Manager (LVM) snapshots, and full backups uploaded to Amazon S3.
Technologies	PostgreSQL, Azure Disk Snapshots
Root cause	High load on database servers; the accidental removal of 300 GB of data from primary database server (during mitigation).
Failure	Data replication between primary and secondary servers fell behind and then failed. Data removal from primary database.
Impact	Service outage and permanent data loss (5,000 projects, 5,000 comments, and 700 new accounts).
Mitigation	Responders restored the databases using the Logical Volume Manager (LVM) snapshot created 6 hours before the outage.