Incident
|
#6 at
Gitlab on
2017/01/31 by Sid Sijbrandij (CEO)
|
Full report
|
https://about.gitlab.com/blog/2017/02/10/postmortem-of-database-outage-of-january-31/
|
How it happened
|
Increased load on database servers (due to spam and/or an automated maintenance script) led to replication between primary and secondary falling behind and then failing. During mitigation an engineer accidentally deleted data from the primary database server, thinking they were operating on the secondary.
|
Architecture
|
PostgreSQL database with primary and secondary servers. Azure disk snapshots, Logical Volume Manager (LVM) snapshots, and full backups uploaded to Amazon S3.
|
Technologies
|
PostgreSQL, Azure Disk Snapshots
|
Root cause
|
High load on database servers; the accidental removal of 300 GB of data from primary database server (during mitigation).
|
Failure
|
Data replication between primary and secondary servers fell behind and then failed. Data removal from primary database.
|
Impact
|
Service outage and permanent data loss (5,000 projects, 5,000 comments, and 700 new accounts).
|
Mitigation
|
Responders restored the databases using the Logical Volume Manager (LVM) snapshot created 6 hours before the outage.
|