Incident
|
#7 at
Salesforce on
2016/05/09
|
Full report
|
https://help.salesforce.com/articleView?id=000315819&language=en_US&type=1&mode=1
|
How it happened
|
Due to a failed circuit breaker, responders manually switched application instance from the primary to the secondary data center. The database in the secondary data center became corrupted due to a storage firmware defect triggered under high load (due to automatic processes associated with establishing the new primary database, and a backlog of traffic associated with the power related downtime) and the corruption was replicated to the secondary. Once the database was corrupted, the database cluster failed and could not be restarted and the latest available backup was from the previous day.
|
Architecture
|
Web application with primary and secondary data centers, depending on a database cluster
|
Technologies
|
|
Root cause
|
A circuit breaker failure (cause unknown); A firmware defect triggered under high load.
|
Failure
|
Database cluster was corrupted, failed and could not be restarted
|
Impact
|
Complete outage of service for 16 hours, plus some degraded service time
|
Mitigation
|
Restored database from most recent backup and manually replayed missing transactions (from redo log) until the database was needed for next day's peak traffic (leaving approximately 3.5 hours of logs not re-applied). Competing, inessential activities were halted.
|