Incident
|
#20 at
Dropbox on
2014/01/10 by Akhil Gupta (Head of Infrastructure)
|
Full report
|
https://dropbox.tech/infrastructure/outage-post-mortem
|
How it happened
|
During a database upgrade, a defect in the upgrade script led it to believe several active databases were inactive and so it performed the upgrade. Database transactions/replication actions were interrupted and master-replica pairs failed.
|
Architecture
|
Thousands of databases, each with a master machine and two replica machines.
|
Technologies
|
MySQL
|
Root cause
|
An upgrade script had a defect in the way that it determined whether a database machine was active (and therefore not safe to upgrade).
|
Failure
|
Multiple databases failed.
|
Impact
|
An outage for the service depending on those databases.
|
Mitigation
|
A recovery from backup for each affected database, which took from a few hours to 2 days depending on the database.
|