Outage post-mortem

“For the past couple of days, we’ve been working around the clock to restore full access as soon as possible.”
Incident #20 at Dropbox on 2014/01/10 by Akhil Gupta (Head of Infrastructure)
Full report https://dropbox.tech/infrastructure/outage-post-mortem
How it happened During a database upgrade, a defect in the upgrade script led it to believe several active databases were inactive and so it performed the upgrade. Database transactions/replication actions were interrupted and master-replica pairs failed.
Architecture Thousands of databases, each with a master machine and two replica machines.
Technologies MySQL
Root cause An upgrade script had a defect in the way that it determined whether a database machine was active (and therefore not safe to upgrade).
Failure Multiple databases failed.
Impact An outage for the service depending on those databases.
Mitigation A recovery from backup for each affected database, which took from a few hours to 2 days depending on the database.