Outage Postmortem - January 24 2017

“It took us 2 minutes to notice the issue, 5 minutes to locate the source of the issue and 10 minutes to get service restored.”
Incident #29 at Stack Exchange on 2017/01/24
Full report https://stackstatus.net/post/156407746074/outage-postmortem-january-24-2017
How it happened A bugcheck in the primary SQL Server placed the primary in read only state, but application-level failovers were disabled due to a code defect, so the SQL server failed, and the network went offline
Architecture Applications that depend on an SQL Server primary that has multiple standby servers
Technologies SQL Server
Root cause A SQL Server bugcheck caused by a (suspected) bad memory chip; and a fail-over related code defect
Failure The service went into read-only mode for approximately 5 minutes and offline for 12 minutes
Impact Service outage and 3.5 seconds of data loss
Mitigation A sanity check on the SQL health was completed and the sites were put back into read-write mode