Postmortem of database outage of January 31

“Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead.”
Incident #6 at Gitlab on 2017/01/31 by Sid Sijbrandij (CEO)
Full report https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/
How it happened Increased load on database servers (due to spam and/or an automated maintenance script) led to replication between primary and secondary falling behind and then failing. During mitigation an engineer accidentally deleted data from the primary database server, thinking they were operating on the secondary.
Architecture PostgreSQL database with primary and secondary servers. Azure disk snapshots, Logical Volume Manager (LVM) snapshots, and full backups uploaded to Amazon S3.
Technologies PostgreSQL, Azure Disk Snapshots
Root cause High load on database servers; the accidental removal of 300 GB of data from primary database server (during mitigation).
Failure Data replication between primary and secondary servers fell behind and then failed. Data removal from primary database.
Impact Service outage and permanent data loss (5,000 projects, 5,000 comments, and 700 new accounts).
Mitigation Responders restored the databases using the Logical Volume Manager (LVM) snapshot created 6 hours before the outage.