RCM for NA14 Disruptions of Service - May 2016

“Each attempt to restore service resulted in errors or failures that prevented these approaches from continuing.”

Incident	#7 at Salesforce on 2016/05/09
Full report	https://help.salesforce.com/articleView?id=000315819&language=en_US&type=1&mode=1
How it happened	Due to a failed circuit breaker, responders manually switched application instance from the primary to the secondary data center. The database in the secondary data center became corrupted due to a storage firmware defect triggered under high load (due to automatic processes associated with establishing the new primary database, and a backlog of traffic associated with the power related downtime) and the corruption was replicated to the secondary. Once the database was corrupted, the database cluster failed and could not be restarted and the latest available backup was from the previous day.
Architecture	Web application with primary and secondary data centers, depending on a database cluster
Technologies
Root cause	A circuit breaker failure (cause unknown); A firmware defect triggered under high load.
Failure	Database cluster was corrupted, failed and could not be restarted
Impact	Complete outage of service for 16 hours, plus some degraded service time
Mitigation	Restored database from most recent backup and manually replayed missing transactions (from redo log) until the database was needed for next day's peak traffic (leaving approximately 3.5 hours of logs not re-applied). Competing, inessential activities were halted.